About Two-Thirds of Psych Findings Hold Up in Top Journals

Studies in top-tier outlets are not exempt from concerns about reproducibility.

Posted Aug 27, 2018

VikiVector/Shutterstock
Source: VikiVector/Shutterstock

A team of scientists that sought to repeat 21 experimental psychology findings published in the renowned journals Science and Nature were able to replicate 13 of them.

Psychology has been in the throes of a movement to recognize and reform problematic practices that lead to unreliable results. Insights from even the most highly regarded journals are not immune to these obstacles, demonstrating the importance of continuing to implement policies that make new scientific findings more trustworthy.

“We’re all going to be striving for that counterintuitive, surprising result. That’s not a bad thing in science, because that’s how science breaks boundaries,” said Brian Nosek, the study’s lead author and a professor of psychology at the University of Virginia. “The key is recognizing and embracing the uncertainty of that, and it’s OK if some turn out to be wrong.” What the latest test shows, he explained in a web conference last week, “is that we can get a lot more efficient at identifying false leads rather than having them perseverate because we never bothered to replicate them in the first place.”

An international team of scientists reviewed every social science paper published in Science and Nature between 2010 and 2015. They planned to replicate a subset of studies that included an experimental intervention, generated a significant result, and were performed on a group of participants.

The team recreated the experimental design as closely as possible and worked with the original authors to do so. They also registered the study protocol, design, and analysis on the Open Science Framework, a system designed to increase reliability and transparency in science. They conducted each study with five times more people than the original so that the investigation would be especially sensitive to detecting any experimental effect.

The team successfully replicated 13 of the study findings, or 62 percent. The remaining eight studies failed to replicate. Past replication initiatives have produced a range of results, and the team estimates that psychology’s rate of reproducibility currently lies between 35 percent and 75 percent. The scientists also discovered that the strength of the experimental effects were about half of what they were in the original studies. The results were published today in the journal Nature Human Behaviour.

Many pieces have contributed to the problems with replication. Scientists typically have flexibility in how they analyze experimental data, and by trying different approaches, they can consciously or unconsciously nudge the findings toward the threshold for statistical significance. Researchers may also alter a hypothesis after seeing the results, which has the effect of weaving whatever significant results they found into a compelling narrative. They are not required to make their data available, which can lead questionable behavior to go unchecked. Perhaps most importantly, scientists and journal editors are incentivized to publish as many novel, flashy findings as possible—rather than replicate previous findings to ensure reliability.

These elements lead to novel results and strong effects being reported at a misleadingly high rate. This is illustrated by the fact that the average size of the effects was half as large in the replicated studies than it was originally. “This is a very consistent theme in replication,” said Sanjay Srivastava, a professor of psychology at the University of Oregon, who was not involved with the research. “If studies sometimes overshoot and sometimes undershoot, then it should be 50/50. But that’s not at all the case.”

The team also devised an experiment to see whether psychologists could detect solid, rigorous results. A group of 200 researchers bet on which studies would or would not hold up to scrutiny. The probability that a study would replicate correctly predicted the outcome for 18 of the 21 studies.

“As a community, we’re not totally stumbling around in the dark when it comes to what’s going to replicate,” Will Gervais, an associate professor of psychology at the University of Kentucky, said in the web conference. “You could potentially train peer reviewers to look out for the patterns people are picking up on. That way hopefully we can weed out some of these false positives before they pollute the literature.”

Gervais authored one of the studies that failed to replicate. The paper was published in Science in 2012, and it showed that analytic thinking suppressed belief in religion. At the time, concerns about bolstering credibility had not yet permeated the field. Now he recognizes that the experiment was fairly weak and didn’t hold up as well as other ideas throughout his career.

“Our study, in hindsight, was outright silly. It was a really tiny [number of participants] and just barely significant—kind of par for the course before we started taking replicability seriously,” Gervais said. “One of the best things coming out of the whole replicability movement is that it’s nudging reviewers and editors to be more savvy about what we should publish and endorse in the first place.”

The field has come a long way toward strengthening the credibility of new research. Solutions include testing experiments on larger numbers of people, creating a stricter threshold for statistical significance, making data publicly available, continuing replication efforts, and publicly preregistering the plan for a study before conducting it, which limits the sort of researcher flexibility that leads to false positives. The Open Science Framework was created in 2012 and now includes registrations of more than 20,000 studies, according to Nosek, who is executive director of the Center for Open Science. The rate has doubled each year since its creation. 

“This study is really good motivation to continue pressing journals to update their policies and to change incentives so scientists will be rewarded for more of these practices,” Srivastava said. “It’s like pushing your car when it’s stalled. The car is moving, but you have to keep pushing, or else it’s going to stop.”