As I’ve discussed previously, there are a number of theoretical and practical issues that plague psychological research in terms of statistical testing. On the theoretical end of things, if you collect enough subjects, you’re all but guaranteed to find some statistically significant result, no matter how small or unimportant it might be. On the practical end of things, even if a researcher is given a random set of data they can end up finding a statistically significant (though not actually significant) result more often than they don’t by exercising certain “researcher degrees of freedom”. These degrees of freedom can take many forms, from breaking the data down into different sections, such as by sex, or high, medium, and low values of the variable of interest, or peaking at the data ahead of time and using that information to decide when to stop collecting subjects, among other methods. At the heart of many of these practical issues is the idea that the more statistic tests you can run, the better your chances of finding something significant. Even if the false-positive rate for any one test is low, with enough tests, the chances of a false-positive result rises dramatically. For instance, running 20 tests with an alpha of 0.05 on random data would result in a false-positive around 64% of the time.
“Hey every body, we got one; call off the data analysis and write it up!”
In attempts to banish false-positives from the published literature, some have advocated the use of what are known as Bonferroni corrections. The logic here seems simple enough: the more tests you run, the greater the likelihood that you’ll find something by chance so, to better avoid fluke results, you raise the evidentiary bar for each statistical test you run (or, more precisely, lower your alpha level). So, if you were to run the same 20 tests on random data as before, you can maintain an experiment-wide false-positive rate of 5% (instead of 64%) by adjusting your per-experiment error-rate to approximately 0.25% (instead of 5%). The correction, then, makes each test you do more conservative as a function of the total number of tests you run. Problem solved, right? Well, no; not exactly. According to Perneger (1998), these corrections not only fail to solve the initial problem we were interested in, but also create a series of new problems that we’re better off avoiding.
Taking these two issues in order, the first is that the Bonferroni correction will only serve to keep the experiment-wide false-positive rate a constant. While it might do a fine job at that, people very rarely care about that number. That is, we don’t care about whether there is a false-positive finding; we care about whether a specific finding is a false positive, and these two values are far from the same thing. To understand why, let’s return to our researcher who was running 20 independent hypothesis tests. Let’s say that, hypothetically, out of those 20 tests, 4 come back as significant at the 0.05 level. Now we know that the probability of making at least one type 1 error (false-positives) is 64%; what we don’t know is (a) whether any of our positive results are false-positives or, assuming at least one of them is, (b) which result(s) that happens to be. The most viable solution to this problem, in my mind, is not to raise the evidentiary bar across all tests, threatening to make all the results insignificant on account of the fact that one of them might just be a fluke.
There are two major reasons for not doing this: the first is that it will dramatically boost our type 2 error rate (failing to find an effect when one actually exists) and, even though this error rate is not the one that many conservative statisticians are predominately interested in, they’re still errors all the same. Even more worryingly, though, it doesn’t seem to make much sense to deem a result significant or not contingent on what other results you were examining. Consider two experimenters: one collects data on three variables of interest from the same group of subjects while a second researcher collects data on those three variables of interest, but from three different groups. Both researchers are thus running three hypothesis tests, but they’re either running them together or separately. If the two researchers were using a Bonferroni correction contingent on the number of tests they ran per experiment, the results might be significant in the latter case but not in the former, even the two researchers got identical sets of results. This lack of consistency in terms of which results get to be counted as “real” will only add to the confusion in the psychological literature.
“My results would have been significant, if it wasn’t for those other meddling tests!”
The full scale of the last issue might not have been captured by the two researcher example, so let’s consider another, single researcher example. Here, a researcher is giving a test to a group of subjects with the same 20 variables of interest, looking for differences between men and women. Among these variables, there is one hypothesis that we’ll call a “real” hypothesis: women will be shorter than men. The other 19 variables being assessed are being used to test “fake” hypotheses: things like whether men or women have a preference for drinking out of blue cups or whether they prefer green pens. A Bonferroni correction would, essentially, treat the results of the “fake” hypotheses as being equally as likely to generate a false-positive as the “real” hypothesis. In other words, Bonferroni corrections are theory-independent. Given that some differences between groups are more likely to be real than others, applying a uniform correction to all those tests seems to miss the mark.
To build on that point, as I initially mentioned, any difference between groups, no matter how small, could be considered statistically significant if your sample size is large enough due to the way that significance is calculated; this is one of the major theoretical criticisms of null hypothesis testing. Conversely, however, any difference, no matter how large, could be considered statistically insignificant if you run enough additional irrelevant tests and apply a Bonferroni correction. Granted, in many cases that might require a vast number of additional tests, but the precise number of tests is not the point. The point is that, on a theoretical level, the correction doesn’t make much sense.
While some might claim that the Bonferroni correct guards against researchers making excessive, unwarranted claims, there are better ways of guarding against this issue. As Perneger (1998) suggests, if researchers simply describes what they did (“we ran 40 tests and 3 were significant, but just barely”), that can generally be enough to help readers figure out whether the results were likely to be the chance outcomes of a fishing expedition or not. The issue is that this potential safeguard is that it would require researchers to accurately report all their failed manipulations as well their successful ones, which, for their own good, many don’t seem to do. One guard that Perneger (1998) does not explicitly mention which can get around that reporting issue, however, is the importance of theory in interpreting the results. As most psychological literature currently stands, results are simply redescribed, rather than explained. In this world of observations equaling explanations and theory, there is little way to separate out the meaningful significant results from the meaningless ones, especially when publication bias generally hinders the failed experiments from making it into print.
What failures-to-replicate are you talking about?
So long as people continue to be impressed by statistically significant results, even when those results cannot be adequately explained or placed into some larger theoretical context, these statistical problems will persist. Applying statistical corrections will not solve, or likely even stem, the research issues in the way psychological research is current conducted. Even if such corrections were honestly and consistently applied, they would likely only change the way psychological research is conducted, with researchers turn to an altogether less-efficient means in order to compensate for the reduced power (running one hypothesis per experiment, for instance). Rather than demanding a higher standard of evidence for fishing expeditions, one might instead focus on reducing the prevalence of these fishing expeditions in the first place.
References: Perneger TV (1998). What’s wrong with Bonferroni adjustments? BMJ (Clinical research ed.), 316 (7139), 1236-8
Copyright Jesse Marczyk