This is a modified version of a blog post by a group headed by my close friend and colleague Dr. Patrick McKnight (who co-leads my own Well-Being Lab). The Measurement, Research Methodology, Evaluation and Statistics (MRES) group started a blog on issues undergirding the foundation of science.

We raise a single criticism about the field of psychology in this blog post—most of the questions and hypotheses being tested by scientists are weak, uninteresting, and cowardly. We include ourselves in the category of people producing work that tests cowardly hypotheses. We offer some suggestions for pushing for braver scientists and disciplines. This topic is being ignored in discussions of the replication crisis in psychology,  particularly social psychology and neuroscience. But maybe there isn't a replication crisis at all...read on.

There is a longstanding tradition in empirical science of advancing scientific knowledge by generating hypotheses that can be tested and falsified; the rationale is even more relevant today with the emerging push for replication. Scientists now hold replication as a sine quo non for scientific contribution. Findings that can be replicated in other labs by other scientists hold greater value than those that cannot be replicated. The logic seems sound until we start examining what constitutes replicability and whether the probability of falsification remains constant across all studies. We argue here that falsification—while discussed at length over centuries—remains at the heart of the problem with replication. Studies that are easily falsifiable (less risky) are easier to replicate whereas other studies that fail to meet that “easy” standard stand less of a chance of replication. In some cases, falsifiability may be a better test of scientific contribution. We elaborate on these points below and offer an idea for how to quantify the probability of replication—we replace this probability with a simpler term: risk. Our aim is to help researchers, especially students and young scientists, to better understand what they are asking and testing with their research studies. In addition, we want to contribute to the review process and help push the field toward more meaningful, impactful work.

We begin with a simple question: are research questions really risky predictions? Consider an analogy that my students find humorous and, I hope, illuminating. I ask them to consider the risk in the following prediction:

The New York Yankees will win the World Series.

The problem with this prediction is that there is no time restriction. So if the Yankees win the World Series in 2026, I win!  Of note, the Yankees won 27 World Series to date—more than double the wins of the second place team and triple that of the third place team in that category. In short, the Yankees are a good bet. Suppose I wanted to restrict my prediction to something a bit more timely by saying....

An American League team will win the World Series in 2018.

Currently, Major League Baseball in the USA has 30 teams—15 in the American League and 15 in the National League. My seemingly bold prediction states that one team in the American League (N=15) will beat one team in the National League (N=15) in some uncertain number of games (best of 7 series). We could supplement that prediction estimate—at least regarding the risk—by looking up the probability that an American League team wins the World Series in the past. The American League won 64 of the 112 World Series (p = 0.57 or 57 percent). Those values help us better appreciate how risky the prediction. An even riskier prediction would be that I state the number of games required to settle the series:

An American League team will win the World Series in 2018 in 6 games.

As we get more specific, the risky nature of our prediction increases but that increase is easily accounted for by simply understanding the probabilities of each element described in the prediction. There are 15 teams in each league so the first prediction yields a 0.5 probability. By increasing the specificity of our prediction to the number of games, we need to know how many series were decided in 6 games AND how many series the American League team won. These statistics are available to us via the internet so we have the ability—if we choose to do so—to look up the values and assess the probability that our prediction may be true. All of these probabilities are based on prior data and in no way provide exact probabilities for current or future values.

How does baseball relate to science?

Baseball provides us with a great analog for science. We use the examples above because they lay the foundation for our main point—scientists ought to be held to risky test standards so that we actually demonstrate how bold our predictions were and how much our findings either fit into the "Whoa!" or "Meh" categories. Right now, I suspect most of our predictions fall into the latter group. Most social and behavioral science predictions are not terribly risky - at least not the ones that are formulated in the following manner:

We predict X significantly predicts Y.

If X produces a significant correlation (p < .05) with Y, the authors celebrate their finding as if that were a feat of strength. Our problem with these types of formulated questions is that they are not terribly risky. What might be riskier is if the formulation included a direction:

We predict X significantly and positively predicts Y.

That prediction at least includes a direction. Only positive relationships gain the celebration and yet that still seems to fall way short of a risky prediction. What if we were to hold ourselves and colleagues to much riskier predictions such as:

We predict X significantly predicts Y with a correlation of .5 or better.

That risky prediction leaves us with a clearly defined and risky test but how risky remains unclear. Yes, the last prediction is certainly riskier than the first but how risky is risky?

Risky tests in science

What does it mean to conduct a risky test? Norbert Kerr spent the past two decades urging us all to state our hypotheses up front without peeking at our data. He termed the violation "HARKing" or Hypothesizing After the Results are Known. Those familiar with his work will realize immediately that HARKing affects the risk of the test. If we know the results to be evident from our data, there exists no test. The evidence cannot be used to construct the hypothesis. Furthermore, we need to have a test that indicates failure of a hypothesis. Statistics offer that ability to differing degrees and we have no intention of wading into the p-value controversy. Suffice it to say that every tool has limitations. 

Our primary concern here is that we want hypotheses to provide greater specificity as the evidence accrues. Similar to the baseball examples above, we might find that a research program produces a disproportionate number of findings that are in the positive direction (for instance, we got it, gratitude is positively related to greater well-beingtime to move on and get more sophisticated with the questions being asked); thus, directional hypotheses are not enough. As the evidence increases for an area of inquiry, we need great specificity in our predictions. We expect hypotheses to be independent of the evidence, more specific than previous hypotheses, and suitable for the statistical procedure used to test them. In other words, the predictions or hypotheses we state as we gain more insights into a phenomenon ought to be more restricted and more precise. Only by narrowing in on more specific predictions can we learn enough to warrant further inquiry in that direction. 

A few points about point predictions

Paul Meehl wrote many times that psychological science needs to shift toward point predictions and away from non-directional hypotheses. These "point predictions" are risky tests that further our understanding of estimates—their stability, replicability, and perhaps even validity. Meehl's call fell on deaf ears or, perhaps, the listeners heard his call and ignored it for various reasons including inertia, defensiveness, or even disagreement. Here are a few points worth considering:

Point 1: High precision may not be justifiable in psychological science. One area where we may all disagree is the degree to which psychological science stands to gain from such precise estimates. Perhaps we do not have the data nor the theories to warrant precise predictions. We grant you that latitude but counter with...why not try? It is far too easy to simply respond "I doubt precision exists in psychological science to the extent that point estimates would be justifiable." Rather than doubt point predictions, open your mind to the possibility. Consider precision as an iterative process. Surely psychological science has evolved enough to move away from non-directional tests. How much further demands serious conversation.

Point 2:  Precision, risky tests, and point predictions may come in many forms. There exists no criteria that stipulates exact point estimates but we ought to try to advance each hypothesis or prediction such that our predictions get riskier as we advance our knowledge. Consider alternative risky tests to precise point estimates:  

1) Competing theory tests where your theory gets a direct test with an alternative, contrasting theory (for instance, which is the better explanation for why people crave self-esteem boosts - terror management theory, self-determination theory, or sociometer theory?).

2) Resampling statistics (e.g., bootstrap or jacknife estimates) that show how dependent your findings are upon your observations.

3) Omitting theoretically relevant predictors or including irrelevant predictors to demonstrate the stability of your estimates.

4) Thresholds for variance accounted (R-squared) that would demarcate what you deem useful scientific gains.

5) Independent testing of your theory by an outside group who analyzed your data without any knowledge of your preferences—only knowledge of the theory being tested.  

Each of these stand as riskier tests than what we have today.

Point 3: No real gains in science come for free or without costs. We must exert great effort to make small gains in science; history teaches that point to us repeatedly. Right now, we need to exert minimal effort to produce imaginary gains—not real gains per se but the illusion that we have advanced knowledge (here is the 3,867th study showing that people who are optimistic also endorse feeling good about their lives). We are asking for standards that require great effort - a point we acknowledge and appreciate. The number of publications per person will decrease no doubt. Our CV's may be shorter (a point we shall return to in the future) but easier to understand with respect to our contributions; read about a few horrendous articles on my own CV. These types of risky tests may lead to greater failures but those failures ought to be informative. We ought to learn as we move forward (or sideways, backwards, or some other direction). Knowledge gains and clarity in our science stand as the fortunes we may reap with riskier tests. Worth it? We think so.

Replication crisis?

We hold that there is no replication crisis. Sure, some predictions do not replicate. Perhaps their failure to replicate is based upon many factors including the sampling, instrument, and statistical dependencies that fail to get articulated by the original researchers who first published the effects. What is more, social and behavioral scientists have been replicating the same old tired hypotheses of non-directional "significance" without much regard to advancing our collective knowledge. We ought to hold ourselves to higher standards by increasing the risk of our tests, predictions, and hypotheses.  

We believe there are issues that require initial attention. We need to fix how science is constructed by the initial investigators prior to addressing the ability of a second group trying to replicate. Here is a sampling of the issues:

1. the predictions being made (the point of this blog post)
2. the over-reliance on college students when they are not conceptually relevant to the topic under study. Now, if you are interested in mood contagion, studying college roommates is perfect. If you are interested in studying hook up culture, studying college students is useful because of the high base rate of activity compared to the general population. If you are interested in mentorship, college students in various clubs and sports and the staff that guide them are ideal.
3. the over-reliance on low resource measures instead of the best measures. I am dying for someone to create an alternative measure of dispositional affect to the Positive and Negative Affect Schedule or the Brief Mood Introspection Scale. Have you ever conducted a think-aloud paradigm, asking people to report on their thoughts while answering questions about emotion adjectives? People offer a wide variety of thoughts that are far from capturing dispositional mood. 
 

Issues about the quality of science and the replication of findings can be handled simultaneously. There is merit, however, in first focusing on best practices in sampling, research methodology, statistical analyses, and statistical power in initial investigations. Then we can focus on replication and generalizability. ​The current push for pre-registration, transparency, and open science will certainly help

So, what do you think? Do you think by advocating for riskier tests, we may learn something more from our efforts? We want to know from you. Tell us your thoughts.
 

Acknowledgements
The blog post above was written by Patrick E. McKnight with the assistance of the entire MRES group. We discussed this topic during several meetings. Those who contributed to the written work received proper credit above but here is an alphabetical list of the key contributors (affiliated with George Mason University):

Dan Blalock - clinical psychology intern at Northwestern
David Disabato - clinical psychology graduate student
Simone Erchov - human factors graduate student
Amanda Harwood - human factors graduate student
Todd Kashdan  - me
Nick Khaligh - undergraduate psychology honors student

Sam Monfort - human factors graduate student

Dr. Todd B. Kashdan is a public speaker, psychologist, professor of psychology and senior scientist at the Center for the Advancement of Well-Being at George Mason University. His latest book is The upside of your dark side: Why being your whole self—not just your “good” self—drives success and fulfillment. If you're interested in arranging a speaking engagement or workshop, visit toddkashdan.com

You are reading

Curious?

How Many Ways Can We Measure Well-Being?

Reconsidering how we think about the subject.

4 Steps for Effective Social Media Arguments

New sociological trends on the inability to disagree.

Profanity and Seemingly Inappropriate Words in the Classroom

New ideas for improving critical thinking.