Jesse Marczyk Ph.D.

Pop Psych

Psychology's Research Replication Problem

When more research isn't necessarily better

Posted Apr 20, 2016

By now, many of you have no doubt heard about the reproducibility project, where 100 psychological findings were subjected to replication attempts. In case you're not familiar with it, the results of this project were less than a ringing endorsement of research in the field: of the expected 89 replications, only 37 were obtained and the average size of the effects fell dramatically; social psychology research in particular seemed uniquely bad in this regard. This suggests that, in many cases, one would be well served by taking many psychological findings with a couple grains of salt.

Naturally, this leads many people to wonder whether there's any way they might be more confident that an effect is real, so to speak. One possible means through which your confidence might be bolstered is whether or not the research in question contains conceptual replications.

What this refers to are cases where the authors of a manuscript report the results of several different studies purporting to measure the same underlying thing with varying methods; that is, they are studying topic A with methods X, Y, and Z. If all of these turn up positive, you ought to be more confident that an effect is real. Indeed, I have had a paper rejected more than once for only containing a single experiment. Journals often want to see several studies in one paper, and that is likely part of the reason why: a single experiment is surely less reliable than multiple ones.

Flickr/Michael Caroe Andersen
​It doesn't go anywhere, but at least it does so reliably
Source: Flickr/Michael Caroe Andersen

According to the unknown moderator account of replication failure, psychological research findings are, in essence, often fickle. Some findings might depend on the time of day that measurements were taken, the country of the sample, some particular detail of the stimulus material, whether the experimenter is a man or a woman; you name it. In other words, it is possible that these published effects are real, but only occur in some rather specific contexts of which we are not adequately aware; that is to say they are moderated by unknown variables. If that's the case, it is unlikely that some replication efforts will be successful, as it is quite unlikely that all of the unique, unknown, and unappreciated moderators will be replicated as well. This is where conceptual replications come in: if a paper contains two, three, or more different attempts at studying the same topic, we should expect that the effect they turn up is more likely to extend beyond a very limited set of contexts and should replicate more readily.

That's a flattering hypothesis for explaining these replication failures; there's just not enough replication going on prepublication, so limited findings are getting published as if they were more generalizable. The less-flattering hypothesis is that many researchers are, for lack of a better word, cheating by employing dishonest research tactics. These tactics can include hypothesizing after data is collected, only collecting participants until the data says what the researchers want and then stopping, splitting samples up into different groups until differences are discovered, and so on.

There's also the notorious issue of journals only publishing positive results rather than negative ones (creating a large incentive to cheat, as punishment for doing so is all but non-existent so long as you aren't just making up the data). It is for these reasons that requiring the pre-registering of research - explicitly stating what you're going to look at ahead of time - drops positive findings markedly. If research is failing to replicate because the system is being cheated, more internal replications (those from the same authors) don't really help that much when it comes to predicting external replications (those conducted by outside parties). Internal replications just provide researchers the ability to report multiple attempts at cheating.

These two hypotheses make different predictions concerning the data from the aforementioned reproducibility project: specifically, research containing internal replications ought to be more likely to successfully replicate if the unknown moderator hypothesis is accurate. It certainly would be a strange state of affairs from a "this finding is true" perspective if multiple conceptual replications were no more likely to prove reproducible than single-study papers. It would be similar to saying that effects which have been replicated are no more likely to subsequently replicate than effects which have not. By contrast, the cheating hypothesis (or, more politely, questionable research practices hypothesis) has no problem at all with the idea that internal replications might prove to be as externally replicable as single-study papers; cheating a finding out three times doesn't mean it's more likely to be true than cheating it out once.

Flickr/vozach1234
​It's not cheating; it's just a "questionable testing strategy"
Source: Flickr/vozach1234

This brings me to a new paper by Kunert (2016) who reexamined some of the data from the reproducibility project. Of the 100 original papers, 44 contained internal replications: 20 contained just one replication, 10 were replicated twice, 9 were replicated 3 times, and 5 contained more than three. These were compared against the 56 papers which did not contain internal replications to see which would subsequently replicate better (as measured by achieving statistical significance). As it turned out, papers with internal replications externally replicated about 30% of the time, whereas papers without internal replications externally replicated about 40% of the time. Not only were the internally-replicated papers not substantially better, they were actually slightly worse in that regard. A similar conclusion was reached regarding the average effect size: papers with internal replications were no more likely to subsequently contain a larger effect size, relative to papers without such replications.

It is possible, of course, that papers containing internal replications are different than papers which do not contain such replications. This means it might be possible that internal replications are actually a good thing, but their positive effects are being outweighed by other, negative factors. For example, someone proposing a particularly novel hypothesis might be inclined to include more internal replications in their paper than someone studying an established one; the latter researcher doesn't need more replications in his paper to get it published because the effect has already been replicated in other work.

Towards examining this point, Kunert (2016) made use of the 7 identified reproducibility predictors from the Open Science Collaboration - field of study, effect type, original P-value, original effect size, replication power, surprisingness of original effect, and the challenge of conducting the replication - to assess whether internally-replicated work differed in any notable ways from the non-internally-replicated sample. As it turns out, the two samples were pretty similar overall on all the factors except one: field of study. Internally-replicated effects tended to come from social psychology more frequently (70%) than cognitive psychology (54%). As I mentioned before, social psychology papers did tend to replicate less often. However, the unknown moderator effect was not particularly well supported for either field when examined individually.

In summary, then, papers containing internal replications were no more likely to do well when it came to external replications which, in my mind, suggests that something is going very wrong in the process somewhere. Perhaps researchers are making use of their freedom to analyze and collect data as they see fit in order deliver the conclusions they want to see; perhaps journals are preferentially publishing the findings of people who got lucky, relative to those who got it right. These possibilities, of course, are not mutually exclusive. Now I suppose one could continue to make an argument that goes something like, "papers that contain conceptual replications are more likely to be doing something else different, relative to papers with only a single study," which could potentially explain the lack of strength provided by internal replications, and whatever that "something" is might not be directly tapped by the variables considered in the current paper. In essence, such an argument would suggest that there are unknown moderators all the way down.

Flickr/ynnil
"...and that turtle stands on the shell of an even larger turtle..."
Source: Flickr/ynnil

While it's true enough that such an explanation is not ruled out by the current results, it should not be taken as any kind of default stance on why this research is failing to replicate. The "researchers are cheating" explanation strikes me as a bit more plausible at this stage, given that there aren't many other obvious explanations for why ostensibly replicated papers are no better at replicating. As Kunert (2016) plainly puts it:

This report suggests that, without widespread changes to psychological science, it will become difficult to distinguish it from informal observations, anecdotes and guess work.

This brings us to the matter of what might be done about the issue. There are procedural ways of attempting to address the problem - such as Kunert's (2016) recommendation for getting journals to publish papers independent of their results - but my focus has, and continues to be, on the theoretical aspects of publication. Too many papers in psychology get published without any apparent need for the researchers to explain their findings in any meaningful sense; instead, they usually just restate and label their findings, or they posit some biologically-implausible function for what they found (like, "X makes people feel good," or "self-control tasks are heavy metabolic drains"). Without the serious and consistent application of evolutionary theory to psychological research, implausible effects will continue to be published and subsequently fail to replicate because there's otherwise little way to tell whether a finding makes sense. By contrast, I find it plausible that unlikely effects can be more plainly spotted - by reviewers, readers, and replicators - if they are all couched within the same theoretical framework; even better, the problems in design can be more easily identified and rectified by considering the underlying functional logic, leading to productive future research.

References: Kunert, R. (2016). Internal conceptual replications do not increase independent replication success. Psychological Bulletin Review, DOI 10.3758/s13423-016-1030-9