The Quest for Replicable Results in Psychology
One thing that shows signs of replicability is the sense of crisis in the field.
Posted Feb 17, 2014
~ The Book of p
Psychological science is—again—in the grip of an identity crisis. This time, the crisis is about its empirical base. Can we trust the findings in the archival journals? This trust depends on the replicability of these findings. Science is supposed to build a storehouse of knowledge. What happened in Laboratory X is supposed to be reproducible in Laboratory Y. As the world is a complex place, where myriad factors affect a particular result, the idea of replicability is necessarily probabilistic. Not having the luxury of certainty, we can only ask for a high probability that results will replicate when we repeat a study. The probability of result replication should be highest when we faithfully replicate the design and the methods of the original study. The sense of crisis comes from reports that even when repeated attempts to document a phenomenon are crafted with care, the results remain variable. When that is so, one wonders if the original reports entering the field’s archive are a positively biased sample of the attempted studies or if there is so much irreducible uncertainty in the subject matter that only research based on gigantic samples will ultimately reveal reality.
Others have discussed the issues of publication bias and intrinsic uncertainty at length. I will not add to that debate. Instead, I want to raise the point that the field has failed to find consensus on what is meant by ‘replicable.’ This seemingly clear and innocent term allows several interpretations, and hence lets scholars talk past each other. I will sketch some of these possible interpretations and note their implications. I limit myself to simple and ordinal relations (e.g., “Estimates of replicability will be greater when conditions x or y hold”), leaving the exact estimates to others who wish to survey the terrain.
Suppose we are interested in the question of whether different kinds of alcoholic beverages can affect creativity. Setting a no-alcohol control aside, let’s say Professor Antonius has found that scores on a test of creativity (say, oblique associations) are higher after the administration of a quart of Grenache than after the administration of a fifth of Glenfarclas, and he has published this result. Professor Bacchus runs a replication study. Consider Bacchus’s options when contemplating the issue of statistical replication.
 Repeating significance. Now that Antonius (A) has found a statistically significant result, p < .05, Bacchus (B) can wonder about the probability of finding p < .05 as well. B is thus interested in a higher-order p. Assuming that A’s result was unbiased (i.e., assuming that A did not run several studies and published a cherry), A’s observed effect size is the best estimate of the latent population effect size and his observed p value is the best estimate of the p value that will be obtained with an exact replication. Therefore, if B replicates A’s methods exactly, the probability of obtaining a p value smaller than A’s p value is .5. If A observed a p value of exactly .05, B’s probability of finding significance are .5. Inasmuch as A’s observed p value was less than .05, B’s probability of finding p < .05 will be < .5. For B to have an estimated replication probability of .95 (that is, the probability of getting a p value of .05 or smaller), A’s original p value would have to be very small indeed. In short, the ‘repeating significance’ approach to replicability is conservative. It favors failures to replicate. When the issue of replicability is framed this way, and when investigators – falsely – expect significant result in all or most replication studies, a sense of frustration and crisis is inevitable.
 Repeating direction. In null hypothesis significance testing (NHST), the probability of finding X (creativity after Grenache) = Y (creativity after Glenfarclas) is infinitesimally small. In an infinitesimally large sample, X is either > or < Y, in whatever small way. Once A has found that X > Y, B’s question of replicability is the probability of X > Y in his study. Without A’s result, this probability might be .5. With A’s result, it is higher. The question is whether A, B, or the rest of the field will see a small, nonsignificant, result in B’s lab as a successful replication. Those who insist on repeated significance will not accept a mere directional replication.
A and B can pool their results and compute a new p value. If this meta-analytic p value is smaller than A’s original p value, then B’s result will have strengthened the case for Grenache. This is, however, a conservative strategy. A single non-significant replication attempt could sink an interesting finding even though both results point in the same direction. If Professors C, D, and E attempted further replications, they might all fail individually, but their pooled results could be significant if they all (or most of them) shared the ordinal finding that X > Y.
If indeed B’s nonsignificant result is a failure to replicate, the results of additional studies will show X > Y and X < Y with approximately the same frequency. From this perspective, it is doubtful whether one can even talk about the issue of replication if there are only one or two additional studies. It would even be rash to argue that a study with a small nonsignificant result of the opposite kind (X < Y) is a failed replication. If, however, a carefully conducted replication study yields an opposite effect, which is significant and of about the same size (which is rare but possible), expectations should probably return to baseline. Such a result would inhibit further research, for who would want to perform a third study if it is unlikely to break the stalemate even if it is individually significant?
 Replicating null results. NHST breeds categorical thinking. There is something (p < .05) or there is nothing (p > .05). When nothing can be said about nothing, while something can be said about something, the telling is asymmetrical. The issue of replicability inherits this asymmetry – sort of. When A finds that X > Y (p < .05), while B does not, B might be tempted to broadcast his “failure to replicate” (perhaps he is a closet admirer of Glenfarclas). If B is otherwise a faithful disciple of NHST, he should watch out, though. NHST orthodoxy demands caution when results are not significant. ‘Nothing can be concluded’ is a standard phrase. If nothing can be concluded from nonsignificant results, how can one conclude that these results undo significant results in another study?
Now suppose instead that A found nonsignificant results. B, giving it another shot (so to speak) repeats A’s study protocol and finds p < .05. B has failed to replicate A’s results. Unless there is a good reason to think that the order in which the studies were conducted matters (it should not since we are talking about exact replication studies), this scenario is the same as the original one. Those who believe that a single failed replication (failed in the sense of finding significance) decisively undermines a research hypothesis would have to conclude that it is a waste of time and effort to replicate an earlier study that had nonsignificant results.
An interesting scenario arises when researchers have strong reasons to believe that the null hypothesis of Nothing is, in fact, the best representation of reality. If positive results are published somewhere, these researchers are in an awkward position. If they run replication studies, they might also find positive results, thereby unwittingly strengthening the case for what they believe to be a false positive. Alternatively, if they fail, as they expected, to replicate the effect, they must make an argument for the idea that their null results matter. In this case, repeated null findings to not obscure a small positive effect, but are an important corrective. In other words, if published results are false positives, the field needs failures to replicate.
Failures to replicate serve an important corrective function. This function is not fully recognized in the current debate. Critics seem to think that the replicability of an effect should already be known at the time when the first report is published. But how can that be, unless there is a two-stage system of publication. In the first, limbo, stage, results are publically available, but not considered ‘published.’ In the second, salvation, stage, studies lucky enough to attract some acceptable number of independent replication attempts, and lucky enough for most of these attempts to be corroborative, ascend to the realm of ‘true publications.’ Whether the replication studies (all of them?) are invited to ascend along, is an interesting question.
 Strong NHST. So far, I have only considered the traditional, or weak, use of NHST. Here, the null hypothesis is the nil hypothesis of nothing. Paul Meehl and others have suggested that NHST be used on strong, substantive, non-nil hypotheses. Everything you have learned about NHST would be turned upside down (including the arguments in this post). With strong NHST a substantive result is when p > .05 (or any other value conventionally agreed upon), which means that the repeated-significance interpretation of replicability would turn into repeated-nonsignificance interpretation, and that the repeated-direction interpretation would be moot. Any significant result would count as a failure to replicate. Even Meehl himself, who thought he was prescribing strong Popperian medicine for the ailing field of empirical psychology, backed off from the rigorous application of this regimen. A field that refutes hypotheses faster than it can generate them runs dry fast.
 Bayes. Years ago, I argued in print that the probability of replication is not determined unless we consider the prior probabilities of our hypotheses. Remember, the ordinary p value refers to the probability of the data (e.g., a difference between mean X and mean Y) or data more extreme (i.e., even larger differences between mean X and mean Y) assuming the null hypothesis is true. In a replication study, we might find that X > Y with p < .05, but that can occur for one of two reasons: it can occur if the null hypothesis is true and it can occur if the null hypothesis is false. A Bayesian says that we need the prior probabilities of both hypotheses before the first study, update them in light of the first batch of evidence, and then we can estimate the probability of replicating the result of p < .05 with p < .05.
Truth and illusion
It is true that more evidence is better than less. Replication studies add to the store of data that can be mined and recombined. Unless there are systematic biases in these studies, more is more. It is an illusion, however, to think that there is something categorically different about replication, or rather that the probability of successful replication—however defined—is essentially different from little old p, the probability of the data under the null hypothesis. The probability of obtaining a successful replication is the probability of having data strong enough to make the complement of that probability (1 – p) small. Reframed in the language of NHST, we want data extreme enough to reject the hypothesis that the original result is not replicable. Yet, some critics of NHST write as if to suggest that an individual p value is nothing, while a successful replication is everything.
The goal of empirical science (at least during a Kuhnian lull of normalcy) is to produce gesichertes Wissen, in the beautiful if untranslatable German phrase (something like sound and secure knowledge), but omniscience remains elusive. Another illusion of categorical thinking is that if we do not have fully gesichertes Wissen, we know nothing. One thing we do know is that we will continue arguing over the proper interpretation of p.
I have chosen not to provide specific references. A casual survey of recent issues of Perspectives on Psychological Science will turn up more relevant sources than even the most sympathetic reader might contemplate reading.