Replication Problems in Psychology
Are replication failures in psychology a crisis or a 'tempest in a teapot?'
Posted Nov 15, 2015
In late August 2015 an article appeared in the New York Times with a loaded headline: Many Psychology Findings Not as Strong as Claimed, Study Says. The article reported on a recent publication in the journal Science, which raised important questions about the extent to which findings in psychological research are replicable . This is an important issue, since replicability is considered a hallmark of the scientific method. If findings cannot be replicated, the trustworthiness of the research becomes suspect. The publication in Science outlined the findings of a major project that had attempted to replicate 100 studies that had been published in major psychology journals in recent years. Led by Brian Nosek of University of Virginia, the research was conducted by a network of research teams (there were 270 contributing authors to the publication) that had coordinated efforts to replicate a total of 100 studies that had been published in three top tier psychology journals:Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition. They found that just over one third of the studies could be replicated. As one could imagine, this study, designated as The Reproducibility Project, has caused quite a stir in both the professional world and the media.
After its publication, the September issue of American Psychologist, the flagship publication of the American Psychological Association, was devoted to what has been termed the “replication crisis” in psychology. Some of the articles in this issue attempted to dismiss or minimize the significance of The Reproducibility Project, while others attempted to identify factors endemic to the field of psychology which may contribute to the problem and suggested steps that can potentially be taken to address it moving forward.
The field of psychology has been aware of this problem for some time. In the mid-1970s Lee J. Cronbach, a prominent methodologist, noted the general tendency for effects in psychology to decay over time. At same time, Michael Mahoney, a prominent cognitive behavioral theorist and researcher, drew attention to this problem in a more emphatic way. As he put it “The average physical scientist would probably shudder a the number of social science ‘facts’ which rest on unreplicated research.”  More recently, cognitive psychologist Jonathan Schooler garnered considerable attention with a publication in the high profile journal Nature in which he described the ubiquity of the replication problem in psychology, and outlined some of the factors potentially responsible for it.
It is important to bear in mind that conversations about whether there is indeed a replication crisis in psychology take place within the broader context of an ongoing conversation about the cultural boundaries of science. Efforts to establish the boundaries of demarcation between the “sciences” and nonsciences have been of longstanding interest to philosophers, and increasingly to sociologists, anthropologists, and historians as well.
Within the field of psychology, conversations of this sort can arouse intense controversy and heated discussion. This is not surprising, given the fact that whether or not a given field of inquiry is considered to be a science has significant implications for the designation of epistemic authority in our culture. This in turn has immense implications for social prestige, power, credibility, and the allocation of resources, all of which shape research agendas and ultimately influence shared cultural assumptions.
Whether or not psychology (and other social sciences such as anthropology, sociology and economics) should be viewed as belonging in the same category as the natural sciences such as physics, chemistry and biology, can and has been argued on various grounds including methodology, explanatory power, predictive power, and the ability to generate useful applications. But the bottom line is that these cultural boundaries shift over time, and at times become the focus of considerable controversy.
In the aftermath of World War II, the question of whether the social sciences should be included along with the natural sciences within the nascent National Science Foundation (NSF) was the topic of sustained debate. Proponents of including the social sciences argued that they share a common methodology with the natural sciences and that although they have less predictive power can nevertheless lead to trustworthy developments in knowledge with important applications. Opponents argued that the social sciences have no advantages over common sense, that their inclusion would lead to a tarnishing of the public’s view of the natural sciences, and that unlike the natural sciences, which at the time were portrayed as “objective,” the objectivity of the social sciences is compromised by their inevitable entanglement with human values. Ultimately a decision was made not to completely exclude social sciences from the NSF, but to include them under a miscellaneous “other sciences” category (in contrast to sciences such as physics and chemistry which were designated by name).
Then in the 1960s a number of social science proponents argued vigorously for the establishment of a National Social Science Foundation (NSSF) to house the social sciences. At this time they argued that the social sciences do indeed differ from the natural sciences in important ways, and warrant dedicated funding, so that they don’t have to vie with the natural sciences for resources. An important concern at the time among those arguing for the creation of an NSSF was that the social sciences would always be treated as “second class citizens” within the NSF. In contrast to the argument for inclusion upon the initial establishment of the NSF, defenders of social sciences now argued that the social and natural sciences use different methods and that the social sciences need to be evaluated by different criteria in order to fully develop. Ultimately Congress decided not to establish a National Social Science Foundation, but the social sciences did receive increased attention within the NSF for a period of time.
Psychology has had a particularly strong investment in defining itself as a science that is similar in important respects to the natural sciences. Perhaps this is in part a function of its past successes in attracting government funding on this basis. Because of this, psychology tends to be somewhat vulnerable to what one might call “epistemic insecurity” when articles such as “Many Psychology Findings Not as Strong as Claimed, Study Says,” appear in the New York Times. I suspect that the media will lose interest in this topic pretty quickly, and that the recent publication of a special issue of American Psychologist devoted to the topic will turn out to have been the peak of interest in this topic within the field of psychology. Be that as it may, I will devote some time in this essay discussing factors contributing to the “replication problem” in my own field of research (i.e. psychotherapy research), as well as the way in which this theme is intertwined with the politics of funding within the healthcare system. I will then discuss factors contributing to replication failures in the field of psychology in general, with a particular emphasis on important discrepancies between the way in which research is represented in the published literature, and the realities of research as practiced on the ground. And finally I will argue that rather than simply striving to bring the everyday practice of psychology research closer in line with the idealized portrayal of psychology research that is common, there is potentially much to be learned from studying the way in which impactful research in psychology is actually practiced.
In the psychotherapy research field, the replication problem has been widely discussed for years, with some investigators ignoring it, others dismissing its significance, and others making serious efforts to grapple with its implications. One of the key ways in which the failure to replicate plays out in the psychotherapy research field is in the form of a phenomenon which is commonly referred to as the “therapeutic equivalence paradox” or the “dodo bird verdict,” alluding to an incident from Lewis Carol’s Alice in Wonderland in which the Dodo Bird decrees “Everybody has won and all must have prizes.” In psychotherapy research, the “dodo bird verdict” refers to the finding that despite ongoing claims of proponents of different therapy schools regarding the superiority of their respective approaches, systematic and rigorous syntheses of large numbers of clinical trials conducted by different investigators over time fail to find that any one therapeutic approach is consistently more effective than others.
Those who ignore this phenomenon or dismiss its relevance tend to be proponents of the so-called evidence-based treatment approach. If one has a stake in claiming that a particular form of psychotherapy is more effective than others, the data consistent with the dodo bird verdict become a nuisance to be ignored, or worse, a threat to one’s funding. On the other hand, the dodo bird verdict is taken more seriously by those who 1) have an investment in arguing for the value of a therapeutic approach is that is supposedly not “evidence based,” (e.g., psychodynamic therapies), 2) take the position that all therapies work through common factors, or 3) believe that that the effects of therapy are a function of the interaction between each unique therapist-patient dyad. The situation is complicated further by the fact that many of those who recognize the validity of the dodo bird verdict (e.g., colleagues of mine who are psychodynamic researchers) will conduct clinical trials evaluating the effectiveness of psychodynamic treatment, in an effort to have it recognized as effective by the field as a whole, where “evidence-based” is more or less a synonym for “funded.” To further complicate the picture, a number of prominent psychotherapy researchers in the 1970s, who were funded at the time by the National Institute of Mental Health, recognized the existence of the “therapeutic equivalence paradox,” and believed that clinical trials in psychotherapy research would be of limited value in advancing the field, arguing instead for the value of investigating how change takes place. Nevertheless, with the rise of biological psychiatry in the late 1970s, and the looming threat of the government and heathcare insurers deciding that psychotherapy was of no value, they decided that it was worth putting what little influence they had left into launching the most expensive program of research that had ever taken place, evaluating the relative effectiveness of two forms of short-term psychotherapy versus antidepressant medication for the treatment of depression. Not only did this study ultimately demonstrate that these therapies were as effective as medication, it established the clinical trials method derived from pharmaceutical research as the standard for all psychotherapy research that would be fundable moving forward. Thus the stage was set for mainstream psychotherapy research valuing randomized clinical trials as the “gold standard” in methodology, despite the fact that prominent psychotherapy researchers had been making the case for some time that they were of limited value for purposes of genuinely advancing knowledge in the field.
In the mid 1970s, Lester Luborsky, a prominent psychodynamic researcher at the time, reanalyzed the aggregated results of a number of randomized clinical trials comparing the effectiveness of different forms of psychotherapy. This time, he used a statistical procedure to estimate how much impact the theoretical allegiance of the investigator has on the outcome of the study. He found that the investigator’s theoretical allegiance has a massive impact on treatment outcome – an impact that dwarfs the degree of impact attributable to the brand of therapy. Since that time, Luborsky’s findings have been replicated sufficiently often that they are beyond dispute. What accounts for the researcher allegiance effect? Although potential misrepresentation of findings may take place in some instances, there are a large number of other variables that are likely to be more common.
One factor is that psychotherapy researchers tend to select treatment outcome measures that reflect their understanding of what change should look like, and this understanding is shaped by different worldviews. Another is that most investigators understandably have an investment in demonstrating that their preferred approach (in some cases an approach that they have played a role in developing) is effective. This investment is likely to influence the outcome of the study in a variety of ways. For one thing, there is a phenomenon that can be called the ‘home team advantage.’ If an investigator believes in the value of the approach they are testing, this belief is likely to have an impact on the enthusiasm and confidence of the therapists who are implementing this approach in the study. In many cases the effectiveness of the ‘home team’ treatment is evaluated relative to the effectiveness of a treatment intentionally designed to be less effective. One consequence of the ‘home team advantage’ is that a replication study carried out by a different team with different theoretical commitments is likely to fail to replicate the findings of the first team, and may in fact come up with findings that are completely contradictory.
Psychology Research in General
Moving beyond the specifics of psychotherapy research to the field of psychology in general, there are a number of factors potentially contributing to the replication problem (no doubt some of the factors are relevant to other fields as well, but that is not the focus of this brief essay). The first can be referred to as the “originality bias.” In practice, if not in theory, controlled replications are considered to be one of the lowest priorities in the field. Thus straightforward replications are less likely to be accepted for publication in important journals, where this is a tendency for reviewers dismiss them as “mere replications.” Psychology researchers learn this early on in graduate school and are thus less likely to conduct replication studies. When they do conduct replications they are apt to modify the design, so that the study has the potential of adding a “new wrinkle potential” on the topic. Studies of this type are thus unlikely to be exact replications.
Another factor is that studies that do not yield statistically significant findings are less likely to be accepted for publication by reviewers, and are thus less likely to be submitted for publication. When reviews of the literature are conducted that include unpublished Ph.D. dissertations, conclusions drawn on the basis of the published research alone tend to disappear. Michael Mahoney conducted a provocative study in which he sent out 75 manuscripts to reviewers from well-known psychology journals. All versions of the manuscript were anonymously authored and contained identical introductions, methodology sections and bibliography. The manuscripts were, however, varied in terms of whether the results were significant, mixed or non significant. His findings were striking. Manuscripts with significant results were consistently accepted. Manuscripts with nonsignificant or mixed findings were consistently rejected.
A third factor is that there are important contrasts between codified formal prescriptions regarding the way research should be practiced in psychology and the reality of research as it is conducted in the real world. In graduate school, psychology students are taught that the scientific method consists of spelling out hypotheses and then conducting research to test them. In the real world researchers often modify their hypotheses in a fashion that is informed by the findings that are emerging. One of the articles in the recent issue ofAmerican Psychologist devoted to the “replication crisis” suggested that one way of remedying this problem would be for funding agencies to require that all principle investigators register their hypotheses with the funding agency before they begin collecting data. In a recent conversation on this topic, a colleague of mine who dismisses the significance of the “replication crisis” remarked that the requirement to register hypotheses in advance would be problematic, since “we all know that some of the more creative aspects of research involve modifying and refining hypotheses in light of the data that emerges (or something to that effect).” I am in complete agreement with him. My concern is that this type of post hoc attempt to find meaning in the data when one one’s initial hypotheses are not supported is an implicit aspect of psychology research in the real world, rather than the formal position that is advocated. This is not as scandalous as it might seem. Abduction, the reasoning process through which theories are formulated to fit observed patterns, plays an important role in the natural sciences as well. If the field of psychology were to formally place more emphasis on the value of abduction, requiring investigators to register their hypotheses in advance might be less problematic. Post hoc efforts to make sense of unexpected findings could potentially be considered as legitimate and interesting as findings that support one’s a priori hypotheses. There would thus be less incentive for investigators to report their research in a way that appears to support their initial hypotheses.
Another common practice in psychology research is referred to as “data mining” Data mining involves the process of analyzing the data in every conceivable way, until significant findings emerge. In some cases this can involve examining aspects of the data that one had not initially planned to examine. In other cases it involves exploring the use of a range of different statistical procedures until significant results emerge. Once again, textbooks in psychology research methodology teach that data mining or “going on a fishing expedition” is not de rigueur. There is, however, nothing inherently wrong with data mining. The reality of psychology research as it is practiced on the ground is that it does not take place in the linear fashion in which it is often portrayed in the published literature. Data are collected and analyzed, and researchers try make sense of the data as they analyze it. This process often helps them to refine their thinking about the phenomenon of interest. This practice becomes problematic when researchers selectively report on all that has taken place between the data collection process and the final publication. But of course there are good reasons for selective reporting: clean stories are more compelling and easier to publish.
Another standard practice consists of conducting pilot research. Pilot research is the trial and error process through which the investigator develops important aspects of his or her methodological procedure. One important aspect of this pilot phase entails experimenting with different ways of (what is termed) implementing the experimental manipulation (i.e. the conditions that the subjects are exposed to) until one is able to consistently demonstrate the phenomenon of interest. This type of “stage management” is part of the art of psychology research. Is such “stage management” inherently problematic? Not necessarily. What is problematic is that publications don’t as a rule describe the pilot work that that led to the development of the effective experimental manipulation.
Vividness and the compelling demonstration of phenonema
An important aspect of the skill of psychological research consists of devising creative procedures for demonstrating phenomena. While psychology research is not performance art, a key element in whether or not a particular piece of research has an impact on the field is the vividness or memorability of the study. Two of the more influential and widely known modern psychology experiments in the history of psychology exemplify this principle: Stanley Milgram’s classic research on obedience to authority, and Harry Harlow’s demonstration of fundamental nature of the need for “contact comfort” in baby rhesus monkeys.
Milgram conducted his research in the context of the Eichmann trials and Hannah Arendt’s publication of Eichmann in Jerusalem: A Report on the Banality of Evil. He set about designing an experiment to demonstrate that given the right context, the average American could be manipulated to act in a cruel and inhumane way out of deference to an authority figure. Milgram worked with his research team to stage an elaborate deception in which subjects were recruited to participate in what they were told was a study on learning.
Upon arriving at the lab, subjects were assigned the role of “teacher” (supposedly on a random basis) and paired with another subject who they were led to believe was randomly assigned to the role of the “student.” In reality, the so-called students were research confederates working with Milgram to stage the deception. The real subjects, who were always assigned to the role of the teacher, were instructed by an “experimenter” to administer electric shocks when the student gave incorrect answers, as a way of investigating whether punishment can facilitate the learning process. The “experimenter” wore a white lab coat, designed give him the air of authority, and stood beside the subjects, instructing them when to administer the shocks and what voltage to use. The set-up was rigged so that students made ongoing mistakes. With each repeated mistake the “experimenter” instructed the “teacher” to increase the intensity of the shock, until they were administering potentially harmful levels of shock that left “students” screaming in pain. The published results of the research reported that over 60% of subjects actually cooperated with the “experimenter” to the point at which they were administering painful and potentially harmful levels of shock.
Archival research reveals that Milgram spent a tremendous amount of time piloting variations in the experimental procedure, until he was able to find one that produced the desired effect.  It turns out that Milgram also conducted studies employing a range of variations on the experimental studies, and in some conditions, the proportion of subjects complying with the experimenter was considerably lower than it was in his publications. In interviews, Milgram’s subjects revealed that they construed the experimental situation in a variety of different ways. Some believed that they really had inflicted painful shocks and were genuinely traumatized. Others were skeptical and simply “played along.” None of these details were revealed in Milgrams’s published papers (or the book he finally authored).
Considerable controversy erupted following the publication of Milgram’s dramatic results, with much of it focusing on whether or not it was ethical to deceive subjects in this fashion. In the aftermath of this controversy, new ethics policies would make precise replications of Milgram’s research paradigm impossible in the future. At the time there was also considerable controversy regarding the interpretation, meaning, and generalizability of his results. Despite this controversy, Milgram’s experiment had a lasting effect on public perception and remains one of the most well-known experiments in the history of psychology. Milgram’s notes reveal that he took great care in developing a compelling documentary of his research, carefully choosing footage of subjects and “experimenters” that he believed would be most impactful, and employing directorial and editing strategies to maximize the documentary’s dramatic impact. And to great effect: the image of the Milgram’s “shock apparatus” and the compliant “teacher” inflicting painful levels of shock on a screaming “student” has become a fixed element in popular culture.
The compelling impact of vivid demonstrations can also be seen in Harry Harlow’s influential research of the 1950s investigating the nature of the factors underlying the infant-caregiver bond. At the time the dominant theory in American psychology was the behavioral notion that the baby’s affection for his or her mother is learned because she provides food to satisfy the child’s hunger. Harlow was interested essentially in demonstrating that the origins of love cannot be reduced to learning via association. In order to demonstrate this, he separated baby rhesus monkeys from their mothers and then put them in room with two “surrogate mothers” constructed out of wire and wood. The first wire mother held a bottle of milk that the baby monkey could drink from. The other wire monkey held no milk but was covered with terry cloth. Harlow found that the baby monkeys consistently preferred to spend time with the terry cloth monkey with no milk, only quickly feeding from the bare wire mother before seeking comfort and cuddling for hours with the cloth mother. He argued on this basis that the need for closeness and affection (what he technically referred to as ‘contact comfort’) cannot be explained simply on the basis of learning to associate the caregiver with nourishment. Although Harlow’s research can hardly be considered definitive on logical grounds, it had a massive impact in the field and is still one of the morewell-known studies in psychology. Like Milgram’s shock machine, Harlow’s baby monkey clinging to the terry cloth mother is etched into public consciousness.
Psychology and Science
Prior to the 1960s, attempts to understand how science works were primarily considered to be within the domain of the philosophy of science. Since Thomas Kuhn’s classic publication of The structure of scientific revolutions, it has become increasingly apparent that in the natural sciences there is a substantial discrepancy between any type of purely philosophical reconstruction of how science takes place and the reality of scientific practice in the real world.Thus there is a general recognition that the practice of science can and should be studied in same way that science itself studies other areas.This naturalistic approach to understanding the nature of science has been contributed to in significant ways by historians, anthropologists , sociologists, and philosophers . Psychologists have been conspicuously absent in this field of study, with one important exception. Michael Mahoney (whose work I mentioned earlier in this essay) published a book on the topic in 1976 titledScientist as subject: The psychological imperative.
There is an emerging consensus in both contemporary philosophy and in the field of science studies, that the practice of science is best understood as an ongoing conversation between members of a scientific community who attempt to persuade one another of the validity of their positions. Evidence plays an important role in this conversation, but this evidence is always subject to interpretation. The data do not “speak” for themselves, but are viewed from the perspective of a particular lens to begin with and are woven into narratives that stake out positions. 
Research methodology in mainstream psychology is shaped, broadly speaking, by a combination of neo-positivist and falsificationist philosophies that were developed prior to the naturalistic turn in science studies.   In this essay I have been making the case that some of the more important aspects of research activity in psychology take place behind the scenes, and consequently are not part of the published record. Researchers in psychology mine their data to search for interesting patterns, they experiment with trial and error procedures in order to produce compelling demonstrations of phenomena, and they selectively ignore inconvenient findings in order to make their cases. The idea that formulating hypotheses and collecting data take place in a linear and sequential fashion is an idealized portrayal of what happens in psychology research, quite distinct from what happens on the ground. In practice, data analysis and theory formulation are more intertwined in nature.
I hope it is clear that I am not arguing that the recent evidence regarding the “replication problem” in psychology is a serious blow to the field, nor am I arguing against the “scientific method” in psychology. I am arguing, rather, that it is important for psychologists to adapt a reflexive stance on the field that encourages them to study the way in which research really takes place, as well as those factors, oftentimes operating underground, that contribute to important developments in the field. Whether or not psychology is classified as a “science” has more to do with the politics of credibility claims than with the question of what is most likely to advance knowledge in the field. Unfortunately, some prevailing disciplinary standards actually obstruct the momentum of the field: if we produce literature that reinforces an outmoded idealized notion of what science is, we prevent its becoming what it can be. Educating psychology graduate students in post-Kuhnian developments in the philosophy of science as well as contemporary naturalistic science studies (both ethnographic and historical) is every bit as important as teaching them conventional research methodology, and I believe that future psychologists will need backgrounds in both if they are going to have a progressive impact on the development of mainstream psychology.
*This essay was initially published in Public Seminar.
 Mahoney, M.J. (1976). Scientist as subject: The psychological imperative. Cambridge, MA: Ballinger.
 Much of this discussion about the history of the social sciences in relationship to the National Science Foundation is based on The Cultural Boundaries of Science: Credibility on the Line, 1st Edition by Thomas F. Gieryn 1999 Chicago: University of Chicago Press.
 Perry, G. (2012). Behind the Shock Machine: The Untold Story of the Notorious Milgram Psychology Experiments. New York: Scribe Publications.
 Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Paris, France: La Découverte.
 Bernstein, R.J. (1983). Beyond objectivism and relativism: science, hermeneutics, and praxis. Philadelphia: University of Pennsylvania Press.
 Callebaut, W. (1993). Taking the naturalistic turn on how science is done. Chicago: University of Chicago Press.
 Shapin, S. (2008). The scientific life: A late moral history of a late modern vocation.Chicago: The University of Chicago Press.
 Godfrey-Smith, P. (2003). Theory and reality: An introduction to the philosophy of science. Chicago, IL: University of Chicago Press.
 Laudan, L. (1996). Beyond positivism and relativism: Theory, method, and evidence.Boulder, CO: Westview.