Can We Trust Psychological Studies?
Making sense of the replication crisis in psychology
Posted Aug 28, 2015
Will the 28th of August 2015 go down in the science history books as the end of Psychology? Or will it mark a new beginning for our beautiful discipline? I hope the latter. But I cannot be sure after reading the article in Science by the Open Science Framework team (consisting of various research teams across the US and Europe), spearheaded by University of Virginia psychologist Brian Nosek. It reports the results of a painstaking effort to replicate 100 psychology experiments from leading journals in psychology. The team randomly selected these studies from a long list of studies that appeared in three top journals in social and cognitive psychology in 2008 (Journal of Experimental Psychology, Journal Personality and Social Psychology, and Psychological Science). I must admit that I am not totally unbiased, having served as an Associate Editor of one of these journals during the period. Thus I was interested to see what conclusions the Reproducibility Project reached about the state of our field. The article is freely available online. My overall impression upon reading this paper is that it is not looking good. But I would like to make some nuances to this conclusion and note some positive signs coming out of this project.
The bottom line is that many replication studies show weaker effects than the original experiments. The mean effect size of the replications is half the magnitude of the effects of the original research. Furthermore, whereas 97% of the original experiments reported significant effects (otherwise it is simply hard to get published), only 36% of the replications did. Some fail to replicate, and some even show opposite results. Among these failures to replicate are some high profile studies such as the one showing that if people are primed with free will rather than genetic determinism, they will cheat less on various performance tasks. Also a notorious study suggesting that children who are exposed to more than one language at home can concentrate better fails to replicate.
But there are some nuances in place about this general pattern of results. First, when I looked at the replicated studies individually (the files are available on the Open Science Framework website) the replication teams only picked one of the experiments of a paper consisting of sometimes a set of as many as 6 different studies. There is nothing wrong with that in principle until one considers that the replication study adds just one extra data point. And this data point is also the subject of random noise or structural error. A second problematic issue is that replication is a noble goal but it is not always easy to create the same conditions as in the original experiments. I came across replications that were conducted in a different country from the original. Or a completely different sample was used. For instance, the original study contained predominately male participants whereas the replication was conducted primarily with females. I also came across replications that were conducted under quite different lab conditions from the original (for instance, individuals were in separate cubicles in the original but shared a laboratory in the replication). Sometimes the incentives for participants were different, for example, in the original the participants participated for course credits and in the replication they received money. I suppose I could be accused of cherry-picking here, but that’s what I came across when I looked at individual replications.
Nevertheless, there are some important lessons to be learned from this major scientific effort. First, the researchers apparently did not find any signs of data manipulation or fabrication in the 100 selected studies. This will silence some critics of our field.
Second, it seems that although many of the original studies did not fully replicate, the mean differences between conditions were usually trending in the expected direction. This suggests that perhaps a large number of the original studies were underpowered. The lesson here is to work with large sample sizes.
The third is that the replication success was higher in the field of cognitive psychology than social psychology. This can perhaps be attributed to better experimental practices in the first field which could be easily adopted. For instance, cognitive psychologists usually work with within-subject designs (where the individual is their own control) and the studies contain repeated measures.
The fourth lesson is that stronger effects replicate more easily. This is not surprising, but it suggests to me that we should assign more value to one high-powered study with a strong effect size than to a set of studies with relatively weak effect sizes.
Fifth, the prestige of the research team that carried out the original study did not matter. This is encouraging as it allows anyone with a good idea and research paradigm to do high quality stuff. It is also a bit depressing, as we like to think of psychological science in terms of a status hierarchy, with top researchers having something "special" which we could model our efforts on.
Despite having some reservations, I think the Open Science team has done a great service to our field. In addition to improving our scientific practices, we should focus our efforts next on getting rid of bad psychological ideas!