Are Most Published Social Psychology Findings False?

Social psychology is in crisis. Is it all wrong?

Posted Feb 26, 2016

Social psychology is in crisis because no one knows what to believe anymore.  The journals are now filled with failed priming replication after failed priming replication.  (For lay readers, priming refers to the idea that if you make some idea, concept, belief, attitudes, or value salient in some way, it can pervasively influence your subsequent perceptions and behaviors in ways that are entirely outside of your awareness – referred to as “automaticity” in social psych parlance).  Priming studies once held great sway in social psychology, because published studies showed amazing world-changing pervasive effects of priming.  Priming often occurs outside of awareness, so the stuff seemed to show that people do not know why they are doing what they do most of the time.  Amazing! And if you think that is a straw claim, think back to The Unbearable Automaticity of Being (if you are a lay reader, just look it up on Google Scholar).

Priming elderly stereotypes supposedly led people to walk slowly.  Priming money supposedly led people to be less willing to help.  Exposing men to attractive women supposedly led to an increase in risk-taking and conspicuous consumption.  People were out of (their own) control!  Adopting strong assertive nonverbal positions (“power poses”) could supposedly improve your life by improving both your confidence and how people treat you.  But all of these findings, and far more, have proven sufficiently difficult to replicate that many scientists now consider them dubious at best. 

Create Meme
Source: Create Meme

And the issues go well beyond failed replications of priming studies.  Stereotype threat research, which is widely interpreted as showing that “remove threat, and black and white test scores are equal” never actually showed any such thing.  Implicit prejudice research, which has been widely interpreted as showing the existence of pervasive racial prejudice, has never shown that, e.g., implicit association test scores supposedly reflecting prejudice (scores above 0) generally correspond to much discriminatory behavior (at least one study showed they correspond to egalitarian behavior).  Or, put differently, some of the most famous and most influential effects in social psychology, especially effects obtained within the last 20 years, have been called into question by failed replication after failed replication, and by revelations of questionable methodological, statistical, and interpretive practices.

And it gets worse before it gets better.

Part I: The (Ir?)replicability of Social Psychology

Some of the strongest evidence for the claim that "most social psych is false" comes from a single paper (Open Science Collaboration, 2015 published in Science) that examined research published in 2008 in several fields of psychology, including social psychology.

That paper was a multi-lab collaboration that attempted to replicate 52 studies published in two top social psych journals (Journal of Personality and Social Psychology and Psychological Science).  What "counts" as a "successful replication" is itself not settled science.  What counts as "evidence that the effect is real" is not settled science.  So they used multiple measures.  Depending on the criteria, they found that between 25 and 43% of studies replicated or revealed a true effect.  

So far, this sounds like "Most social psych findings are false" is on pretty safe grounds.  And it might be.  But I do not think that general conclusion is justified by this large scale replication study.

Part II: OSC 2015 is a Great Study, But Let's Not Overinterpret It

Here is the key thing that OSC did NOT do that renders the inference "most social psych findings are false" unjustified:

They did not identify a population of social psych studies (say, since 1950 or 1970 or even 1990), randomly select ones, and then attempt to replicate them.

Instead, they first restricted replication attempts to 2008.  Then they created subsamples of studies (e.g., the first 20 papers published in Psychological Science).  They then allowed their replication teams to select the papers from which to attempt a replication. In general, by design, the last studies in multi-study reports were selected for replication attempts.  Beyond that, however, from the report published in Science, it is impossible to know how the replication teams selected which paper to replicate. It is possible that, disproportionately, the teams selected papers reporting studies they thought were unlikely to replicate (there is no way to know short of surveying the over 100 co-authors of those replications, which I have not done).  At minimum, this cannot be ruled out. 

Regardless, absent bona fide random sampling of studies over a long time period, no general conclusion about the replicability of social psych can be reached on the basis of this paper.  Hell, one cannot even reach clear conclusions about the replicability of social psych published in 2008 from this paper.  

Of course, these limitations do not mean social psych is on safe grounds. They do not mean the study is definitively known to have provided results unrepresentative of social psychology.   It certainly means lots of stuff is getting published that is difficult to replicate.

Part III: Replication in Social Psychology is Hard Even When the Effect is Known to be True

Jon Krosnick is a social psychologist/political scientist at Stanford who is also internationally recognized as one of the premier survey researchers in the social sciences.  He once headed the American National Election Study, a nationally representative survey of political views that has been going on for decades, routinely appears in the NYTimes, and has received numerous awards for his work.

A few years ago, he collected survey data on almost 10,000 people.  A series of well-known survey effects were identified as statistically significant in this large sample (e.g., order effects, acquiescence, etc.).  Subsamples of about 500-1000 people were then examined to determine the frequency with which statistically significant subsamples would demonstrate the same effects. 

Despite the fact that the phenomena under study was usually significant in the large sample, the subsamples found significant evidence of the effect only about half the time (analyses are still in progress and the exact number of replications for each phenomena is subject to change).  Even if the 50% “replication” number is only ballpark pending final analyses, this speaks to the difficulties of replication, even with large samples, and even without any questionable research practices whatsoever.

That is, in some ways, good news.  It means that, e.g., when smaller sample studies only replicate 30% or 40% of the time, it is not necessarily evidence of rampant problematic practices.  It may simply be a testament to the large effects of sampling variability and minor changes in context (e.g., being conducted in a different state or country) or procedure. And there is more good news.  At least with their large samples, Krosnick’s team's preliminary results suggest that, whether they found significant evidence of the effect or not, about 80% of the studies were not significantly different from one another.  Again, whether the final tally is 71% or 93% or 80%, that is a relatively high level of replication.  

Why is this important?  It shows how the vagaries of sampling variability can make detecting even a true effect quite difficult. It also means that, perhaps, we need to reconsider our understanding of how frequently a finding needs to replicate for it to be credible, and how we can ever distinguish a credible finding from an incredible one.  Lots of scientists are working on just this issue and have developed whole new statistical tools for figuring out what is credible from what is not (p-curves, replication indices, statistical tests for identifying and controlling for publication biases, etc.).  Most of those methods are, however, sufficiently new that it will probably be a while before we know which work best.

Part IV: The Replicability of Social Psychology

Some areas of social psychology are a mess, especially those involving “social priming” (see references for links to articles discussion the various priming crises and failures to replicate).   I am not saying all are false, but, with some rare exceptions, I do not know which social priming effects are credible and which are not.  Cognitive priming is not a mess.  There has long been excellent and easily replicable work on cognitive priming in cognitive psychology.  After exposure to the word “black,” people more quickly recognize subsequent presentations of the word “black” (compared, e.g., to other words, such as “green” or “blasphemy”). 

 In my lab, over 30 years, I have replicated each of the following phenomena:

  • Stereotypes bias how people judge an individual when people lack much information (other than stereotype category membership) about that individual
  • People massively judge individuals based on their personal characteristics and hardly at all on stereotypes, if people have relevant information about that individual's personal characteristics -- e.g., their personality, accomplishments, behaviors, etc.
  • Moderate to high levels of accuracy in many demographic stereotypes
  • Pervasive inaccuracy in national stereotypes when evaluated against big five personality self-report criteria
  • Teacher expectations produce self-fulfilling prophecies in the classroom -- but these effects tend to be weak, fragile, and fleeting (few other researchers would describe them this way, but when you look at the actual findings, this is pretty much what almost everyone has actually found).
  • Teacher expectations mostly predict student achievement because those expectations are accurate, not self-fulfilling.
  • Nonetheless, teacher expectations also bias their own evaluations of students to a modest degree.
  • Mortality salience increases anti-Semitism.
  • Self-consistency dominates cognitive reactions to performance feedback; self-enhancement dominates affective reactions to performance feedback
  • The fundamental attribution error
  • Self-serving biases
  • Politically motivated confirmation biases 

I did not discover these phenomena.  So my replications constitute independent evidence that the phenomena are real.  However, none of these were direct replications.  In modern parlance, all were conceptual replications.  Indeed, this distinction was itself not on my mind when I conducted those studies.  25 years ago (or 15 or even 5) no one was talking about direct versus conceptual replications, and I just took for granted that other research had found a phenomena, and went about seeing if I could, too, usually in the service of some other research effort (e.g., Rosenthal & Jacobson, 1968 demonstrated experimentally-induced self-fulfilling prophecies; I wanted to see if expectations teachers developed on their own, without being misled by researchers, were also self-fulfilling – they were).  I often did reproduce others’ phenomena (most recently, we completed a successful conceptual replication of Jones’ & Harris’ pro-/anti-Castro speech/correspondence bias study – but with sex stereotypes constraining behavior rather than researcher requests).  Now, most of these are not the "hot flashy topics" of the last 20 years.  No priming, no implicit prejudice, no power posing, no stereotype threat.  Many, though not all, of these findings are accompanied by quite large effect sizes (which was one of the predictors of replication success in the OSC, 2015 paper).  

That is just in my lab.  Counting only stuff I know of from other folks, that has been replicated in more than one independent lab:

  • Jon Haidt's moral foundations replicate. 
  • Similarity-attraction is very powerful. 
  • Rightwing prejudice against leftwing groups and leftwing prejudice against rightwing groups repeatedly replicates.
  • Exaggeration of political stereotypes replicates. 
  • Prejudice (disliking/liking a group) usually predicts all sorts of biases more strongly than do stereotypes (beliefs about groups. 
  • Above chance accuracy in person perception based on thin slices of behavior replicates. 
  • Kahneman & Tversky-like heuristics mostly replicate. 
  • Ingroup biases replicate most of the time.
  • Self-serving self-evaluations of competence, morality, and health replicate. 
  • In person perception, people seek diagnostic information more than confirmatory information in just about every study that has ever given people the chance to seek diagnostic information. 

As long as one is talking about technical results, rather than widespread overinterpretations of such results:

  • racial IAT scores greater than zero widely replicate;
  • conservatives routinely score higher on common measures of rigidity and dogmatism than do liberals
  • race/ethnicity and class differences in academic achievement abound. 

I am sure there are many more that I have not listed. 

Many findings are easy to replicate. 

On the other hand, this is no random sample of topics either.  It would not be justified to conclude from my personal experience or this off-the-top of head list that, in fact, social psych is just fine, thank you very much.  And the problems go way beyond replication, but that is a missive for another day.

How will we figure out what, from the vast storehouse of nearly a century of social psychological research, is actually valid and believable?  How can we distinguish dramatic, world-changing results that are just hype, terrific story-telling, phacked results, wishful thinking and, ultimately, snake oil, from dramatic world-changing results that we can really hang our hats on and go out and change the world with?  No one really knows yet, and anyone who claims they do, without having subjected their claims to skeptical tests such as pcurves, replication indices, and pre-registered replication attempts is just selling you repackaged snake oil. 

To me, there is a single, crucial ingredient for figuring this out: Diversity of viewpoints and deep skepticism of one another’s claims.  When answers are not settled science – and much of our science is currently unsettled – diversity and skepticism are essential tools for ferreting out truth from hype, signal from noise, real world-changing results from snake oil. 

Groupthink and deference to scientific “authorities” and to repeated “scientific” stories resting on empirical feet of unclear firmness is a significant threat to the validity of social psychology.  Big doses of humility and uncertainty, at least with respect to our claims about social psychology, seem to be in order.  In that spirit, we are probably best off eschewing extreme claims, including “most social psychology findings are false,” unless we know they have extremely strong foundations of scientific support. 

Who knew that Mark Twain was a scientist?  “It ain’t what you don’t know that gets you in trouble.  It’s what you know for sure that just ain’t so.” 

References

Jones, E. E., & Harris, V. A. (1967).  The attribution of attitudes.  Journal of Experimental Social Psychology, 3, 1-24.

Krosnick, J. A. Replication.  Talk presented at the 2015 meeting of the Society for Personality and Social Psychology.

Loeb, A. (2014).  Benefits of diversity.  Nature: Physics, 10, 616-617.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. doi: 10.1126/science.aac4716

Rosenthal, R., & Jacobson, L. (1968a).  Pygmalion in the classroom: Teacher expectations and student intellectual development.  New York: Holt, Rinehart, and Winston.

Easy to Access On Line Resources on Problematic Priming and Other Difficult to Replicate Studies

Recent priming failures

Valid and invalid priming effects

An early failed priming replication

Unicorns of Social Psychology

Social Psychological Unicorns: Do Failed Replications Dispel Scientific Myths>

Is Power Posing Just Hype?