Evaluating Psychology Research

Many famous psychological studies cannot be reproduced.

Posted Aug 08, 2018

Studies in psychology often find different results. Even in fields like medicine, where one might think there to be a direct relationship between the intervention being tested and its effects, results can vary. 

Wikimedia Commons
Stanford Prison Experiment
Source: Wikimedia Commons

For example, a study found that drinking one glass of orange juice a day could increase a person’s risk of getting Type 2 diabetes by 18 percent. Researchers at the University of California, Davis, however, found that drinking 100% juice reduced the risk for several chronic diseases, including cancer.

Wikimedia Commons
White Marshmallows
Source: Wikimedia Commons

But many think the situation is worse in psychology.

A recent New York Times article mentions some famous psychology studies of human behavior that cannot be reproduced, including the famous Stanford Prison Experiment that showed how people role-playing as guards quickly acted cruelly to mock prisoners, as well as the famed "marshmallow test" that showed that young children who could delay gratification demonstrated greater educational achievement years later than those who could not.

Why do research results vary and fail to replicate? 

The relationship between an intervention and its effects may depend on many factors. And differences in context or implementation can have a large impact on a study's results. There are other reasons that studies might report different effects: Chance errors could affect a study’s results. Researchers may also consciously or inadvertently sway their results.

All these sources of variability have led to fears of a “replication crisis” in psychology and other social sciences. Given this concern, how should we evaluate psychology and social science research? 

The first rule of thumb is to not rely solely on any one study. If possible, review meta-analyses or systematic reviews that combine the results from multiple studies. Meta-analyses can provide more credible evidence. Meta-analyses can suggest reasons why results differ.

A meta-analysis is a statistical analysis that combines the results of multiple research studies. The basic principle behind meta-analysis is that there is a common truth behind all conceptually similar research studies, but each individual study has been measured with a certain error within individual studies. The aim is to use statistics to get a pooled estimate, closest to the unknown common truth. A meta-analysis, then, yields a weighted average from the results of all the individual studies.

Aside from providing an estimate of the unknown common truth, meta-analysis also can contrast results from different studies and identify patterns among study results. It can also identify sources of disagreement among these results. And it can identify other interesting relationships that pop out in the context of multiple studies. A key benefit of the meta-analytic approach is the aggregation of information leading to a higher statistical power and more robust point estimate than is possible from the measure derived from many individual studies.

Still, there are some limitations of the meta-analytic approach to consider, too. The researcher must make choices about what studies to include that can affect the results of the meta-analysis (e.g. only published studies). The researcher must decide how to search for the studies. And the researcher must decide how to deal with incomplete data, analyze the data, and account for publication bias.

Sometimes, however, we want to evaluate a single, individual psychology study. So how should we go about that? When considering how much weight to give to a study and its results, focus on sample size. Studies are more likely to fail to replicate if they used small samples. The most positive and negative results are often those with the smallest samples or widest confidence intervals. Smaller studies are more likely to fail to replicate in part due to chance, but effects may also be smaller as sample size increases, for numerous reasons. If the study was testing an intervention, there may be capacity constraints that prevent high-quality implementation at scale. Smaller studies also often target the exact desirable sample that would yield the biggest effects.

There is a line of reasoning to this: If, for example, you have an expensive diversity educational program that you can only use with a limited amount of students, you might only have one class and have students who could benefit from it the most. That means the effect would likely be smaller if you implemented the diversity education in a larger group. So more generally, it can be helpful to think about what things might be different if the educational program was scaled up. For example, small diversity educational programs are unlikely to affect the broader institution, community, or society. But if scaled up, the institutional, community, or societal culture might change in response. 

Similarly, consider specific features of the sample, context, and implementation. How did the researchers come to study the diversity educational program including the institution and the students they did? Would you expect this sample to have performed better or worse than the sample you are interested in? For example, if I was interested in testing the outcome of the teaching method I use in my web conference course the Psychology of Diversity at Harvard Summer School the setting and format (e.g. Harvard Summer School, web conference, campus) could have affected the results, too. Was there anything unique about the setting and format that could have made the results larger?

If the study was evaluating a diversity educational course, how that course was implemented is important, too. For example, suppose you hear that a web conference course on diversity can improve students’ feelings of belonging and inclusion. If you were considering implementing a similar course, you would probably want to know the format of the web conference course and the course content and the training of the teaching staff in order to gauge whether you might have different results.

You may also have more confidence in the results of a study if there is some clear mechanism that explains the findings and is constant across settings. Some results in behavioral economics, for instance, suggest that certain rules of human behavior are hardwired. But these mechanisms can be difficult to uncover. And many experiments in behavioral economics that initially seemed to reflect a hardwired rule have failed to replicate, such as finding that happiness increases patience and learning.

But, if there is a convincing reason that we might expect to see the results that a study has found, or if there is a strong theoretical reason that we might expect a specific result to generalize, that should lead us to trust the results from a single study a little more. But we should take care to examine why we think there is a convincing reason.

Finally, if it appears too good to be true, it probably is. This is based on a principle from Bayesian statistics: Stranger assertions should require stronger evidence in order to change one’s “priors” or beliefs. If we take our beliefs earnestly — and there is reason to conclude that, on average, humans are fairly good at making predictions — then results that seem improbable, actually are less likely to be true.

In conclusion, all psychology research is subject to error, and hence the results may vary and fail to replicate. It is far better to be aware of this than to be uninformed of the errors potentially concealed in the research. The scientific method was developed to draw on empirical reasoning to help us resolve cases in which studies vary or fail to replicate. The application of the scientific method to the study of human behavior and psychology has not simplified human behavior; instead, it has suggested how complex human behavior is.

References

Weissmark,M. (forthcoming). The Science of Diversity. Oxford University Press, USA.

Weissmark, M. (2004). Justice Matters:Legacies of the Holocaust and World War II. Oxford University Press, USA.

Weissmark, M. & Giacomo, D. (1998). Doing Psychotherapy Effectively. University of Chicago Press, USA.