As I’ve discussed previously, there are a number of theoretical and practical issues that plague psychological research in terms of statistical testing. On the theoretical end of things, if you collect enough subjects, you’re all but guaranteed to find some statistically significant result, no matter how small or unimportant it might be. On the practical end of things, even if a researcher is given a random set of data they can end up finding a statistically significant (though not actually significant) result more often than they don’t by exercising certain “researcher degrees of freedom”. These degrees of freedom can take many forms, from breaking the data down into different sections, such as by sex, or high, medium, and low values of the variable of interest, or peaking at the data ahead of time and using that information to decide when to stop collecting subjects, among other methods. At the heart of many of these practical issues is the idea that the more statistic tests you can run, the better your chances of finding something significant. Even if the false-positive rate for any one test is low, with enough tests, the chances of a false-positive result rises dramatically. For instance, running 20 tests with an alpha of 0.05 on random data would result in a false-positive around 64% of the time.

Taking these two issues in order, the first is that the Bonferroni correction will only serve to keep the experiment-wide false-positive rate a constant. While it might do a fine job at that, people very rarely care about that number. That is, we don’t care about whether there is *a false-positive* finding; we care about whether *a specific finding* is a false positive, and these two values are far from the same thing. To understand why, let’s return to our researcher who was running 20 independent hypothesis tests. Let’s say that, hypothetically, out of those 20 tests, 4 come back as significant at the 0.05 level. Now we know that the probability of making at least one type 1 error (false-positives) is 64%; what we don’t know is (a) whether any of our positive results are false-positives or, assuming at least one of them is, (b) which result(s) that happens to be. The most viable solution to this problem, in my mind, is not to raise the evidentiary bar across all tests, threatening to make all the results insignificant on account of the fact that one of them might just be a fluke.

There are two major reasons for not doing this: the first is that it will dramatically boost our type 2 error rate (failing to find an effect when one actually exists) and, even though this error rate is not the one that many conservative statisticians are predominately interested in, they’re still errors all the same. Even more worryingly, though, it doesn’t seem to make much sense to deem a result significant or not contingent on what other results you were examining. Consider two experimenters: one collects data on three variables of interest from the same group of subjects while a second researcher collects data on those three variables of interest, but from three different groups. Both researchers are thus running three hypothesis tests, but they’re either running them together or separately. If the two researchers were using a Bonferroni correction contingent on the number of tests they ran per experiment, the results might be significant in the latter case but not in the former, even the two researchers got identical sets of results. This lack of consistency in terms of which results get to be counted as “real” will only add to the confusion in the psychological literature.

To build on that point, as I initially mentioned, any difference between groups, no matter how small, could be considered statistically significant if your sample size is large enough due to the way that significance is calculated; this is one of the major theoretical criticisms of null hypothesis testing. Conversely, however, any difference, no matter how large, could be considered statistically *insignificant* if you run enough additional irrelevant tests and apply a Bonferroni correction. Granted, in many cases that might require a vast number of additional tests, but the precise number of tests is not the point. The point is that, on a theoretical level, the correction doesn’t make much sense.

While some might claim that the Bonferroni correct guards against researchers making excessive, unwarranted claims, there are better ways of guarding against this issue. As Perneger (1998) suggests, if researchers simply describes what they did (“we ran 40 tests and 3 were significant, but just barely”), that can generally be enough to help readers figure out whether the results were likely to be the chance outcomes of a fishing expedition or not. The issue is that this potential safeguard is that it would require researchers to accurately report all their failed manipulations as well their successful ones, which, for their own good, many don’t seem to do. One guard that Perneger (1998) does not explicitly mention which can get around that reporting issue, however, is the importance of theory in interpreting the results. As most psychological literature currently stands, results are simply redescribed, rather than explained. In this world of observations equaling explanations and theory, there is little way to separate out the meaningful significant results from the meaningless ones, especially when publication bias generally hinders the failed experiments from making it into print.

**References:** Perneger TV (1998). What’s wrong with Bonferroni adjustments? BMJ (Clinical research ed.), 316 (7139), 1236-8