What a Replication Does (and Does Not) Tell You

Interpreting failed study replications is not straightforward.

Posted Sep 22, 2019

Currently, there is a rise in the usage of direct replications in psychological science. A direct replication is basically taking the original study and repeating it as closely as possible.

I say "as closely as possible" because it is arguably impossible to replicate the procedure and materials of an original study 100%, let alone the sample. And especially in social psychology, the norms of the people being studied change across years, as does the entire worldly context of what is being studied.

That being said, replications are extremely important and a hallmark of science. If a finding cannot be replicated, it could perhaps signal that the finding is false. It could also mean a lot of other things though, and that is where the difficulty lies with interpreting these findings.  It is that difficulty that makes academics on twitter having headlines about the death of a theory, based on a few failed replications, absolute nonsense. Psychology isn't physics where that might be the case. Things are complicated and people are messy data points.

What a failed replication can mean:

1) The current study had bad luck

Essentially, an experimental finding is considered to have significant results if there is a difference between the groups of people being studied (e.g., one group had X done, another Y done, and then compare X to Y on a third variable tested in both) or within the same people at different time points (all people in study are asked to do X and then asked to do Y and measure the differences between X and Y) that passes a certain threshold of not occurring due to random chance.

Replication attempts address this chance factor by taking the size of effects in original studies and getting enough participants that if the effect is valid, the replication will find it 80-95% of the time (this is called statistical power). The exact amount varies by study but, basically, this means that 5%-20% of replication attempts should fail even if the original finding is legitimate (if the study has 90% power it will fail to reproduce 10% of the time, etc). Thus, the failed replication might just fall into the bad luck category, and the original finding can replicate in the expected amount. (for any Texas Hold Em players out there, 20% is roughly the odds of losing with pocket aces pre-flop to any one other hand. With enough hands played, you quickly learn how frequent 20% is!)

2) The current study is missing a key detail (unknown to the researchers) which helped generate the original effect.

I have written before about the facial feedback hypothesis, which is basically that people's emotions can be influenced by their bodily actions (e.g., smiling can make things funnier). This effect was having a very rough time on the replication "market" until researchers noticed that all the replications that did not find the original effect had used a camera to record the study. The original studies did not have this feature, or anyone watching the participants while they were doing the study. This proved to be a key detail. When being watched or recorded, this effect does not occur, but it does in the absence of this (for explanation see here).

Thus, the failure to replicate a finding might be due to a difficulty in re-doing the procedure and materials in a study precisely. Knowing these differences tell us when an effect does and does not occur. Simply saying "the original study was false" denies this opportunity and is, frankly, false.

3) The current study isn't rigorous enough

Null findings (finding no effect) in studies can be due to an effect not existing. They can also, as researchers know, be due to factors related to the participants (such as them not paying attention or not entirely understanding the materials). And they can also be due to the researchers themselves being a bit sloppy.  An example of the latter would be if the researchers had typos in their materials, or did not adequately screen the quality of the data they received. While taking out participants can be problematic for loads of reasons, failure to take out data that should not be used is, I would argue, just as problematic. Recent research on Amazon Mturk, which is a popular replication study destination because it enables quick recruitment of large numbers of participants, has in my experience (and others, just Google Mturk bots and data farms), been a total minefield of participants that should not be used in the final data analysis (e.g, not being able to choose what they had just read about out of 4 options; IP address not matching the location filter on the site;multiple responses from the same IP address; multiple responses from the same latitude longitude; missing study codes they need to provide to verify they only took it once; people who didn't answer most of the items or who only took like 3 minutes to do a 15 minute study; answers that made no sense on open-ended items or that are just repeating the question itself verbatim or copying a wiki entry about a key word in the question). But many replication attempts do not even report how they screened the data.

4) The original finding stands, but it might not apply today or to a specific group of people.

Social and cultural norms change. Historical contexts change. A study finding might exist in only a specific context of human history or among a specific group. But the initial finding is true.

5) The original findings exist, but they are not as strong as believed.

It is possible that the researchers of the original study used "p hacking" (e.g., taking out participants after looking at the data that help your study be significant or having multiple dependent variables and only reporting the significant ones) to help boost the study into significance, but that the effect still exists. The "p hacking" in this case would just be increasing the size of the effect. Thus, when a replication uses that effect size to generate a power analysis for the replication (how much chance there is that the study will replicate if the effect is real), they are using inflated numbers. Thus, the replication has a weakness that is passed on by the original study's p hacking. But it doesn't show necessarily that the idea or theory being tested is itself false.

6) The theory or idea being tested is true, but the materials don't adequately test it

Say you are studying whether belief in free will impacts moral behavior. So you have people read an essay on free will in one condition, and not in the other. Then you measure them on a pro-social behavior, like how much money they donate to an environmental charity. In the initial study, they find that the more free will people have, the more pro-social they behave using these exact measures. In the replication, these effects do not occur. In the initial study, the free will essay was found to increase free will belief, but not in this one. Thus, the very premise of the idea being tested falls flat. It cannot be tested in the current study because the material believed to cause changes in free will belief did not do so. In this case, the study did not replicate, but that says nothing about whether the idea or theory itself is true. it just says the essay you were using to test the key idea isn't reliable. There is a difference between testing ideas and testing materials, and the focus on replicating studies sometimes leads people to confound the two.

I am sure I have missed possibilities for what a failed replication can mean.  But I do not want to sound opposed to direct replications or all of the other measures (such as pre-registration) being taken to improve psychological science. I agree wholeheartedly with all of those, as any student who has taken a class with me recently can attest. I am sure the word rant has been used among them when the topic comes up!

But, we need to be careful as a field not to throw the baby out with the bathwater here. Careful interpretation of failed replications is tricky business. But it is essential if we are going to be a field whose chief goal is accuracy.