Fear of False Positives

Is p < .005 better than p < .05?

Posted Jul 30, 2017

I am the only god who knows the keys / to the armory where the lightning-bolt is sealed. / No need for that, not here. / Let me persuade you. ~ Aischylos: The Eumenides [1]

In a much circulated paper to appear in Nature, Benjamin and 71 co-authors ask that significance levels be tightened from the current convention of p < .05 to .005. The argument is that the published record of psychological science contains too many false positive results, leading us to believe in things that aren’t so, such as the Sasquatch or social priming. Lowering the significance threshold would reduce the incidence of false positives. At the same time, they say, the new convention, if adopted, would help fix the replication crisis. Or would it? If it is difficult to replicate a .05 finding at a given level of statistical power, then it will be difficult to replicate a .005 finding at that same power level. Remember that statistical power is the probability of finding significance [as defined by convention] with a given probability [conventionally .8] if the original finding is real, that is, if it is a true and not a false positive. In order to make their proposal positively relevant to the replication crisis, the authors propose a lowered significance threshold be applied only to novel hypothesis tests. In other words, they ask that we report a new piece of research only if p < .005, while allowing us to replicate it with p < .05. [This proposal begs the question of how we know what a novel test is.]

This tightening-of-the-screws proposal is interesting but it flirts with incoherence. Remember the old saying that God loves p = .055 no less (or not much less) than she loves p = .045 (and Professor Gelman’s proof thereof). Now God also does not care much about which study was conducted first and gets to be considered the novel hypothesis test and which was conducted later. The order of these studies is theoretically and statistically irrelevant (Krueger, 2001). If we wish to hold first and second studies to different statistical standards, we might as well reverse the argument. Let us be easy on early hypothesis tests for they know not yet what they are. Early tests are exploratory, not confirmatory (Sakaluk, 2016). Early tests are the scientist’s way of foraging. The scientist understands that easy early tests will produce many leads that later turn out to be dead ends, but he and she also understands that such tests will turn up many findings that will later be counted as true discoveries.

Benjamin et al. know the risks of false negative errors, but they do not seem much concerned. This lack of concern is extra-statistical. It is a value judgment. If they believe that the horrors of false positives are greater than the horrors of false negatives, they must advocate a stricter p threshold. Because they do advocate a stricter p threshold, we can reverse-infer that they abhor false positives (Krueger, 2017). But as some of us have argued, we need to consider which direction science will take when considering changes in conventional practice (Fiedler, Kutzner, & Krueger, 2012). Yet, there are statistical considerations in that we can estimate the rate at which false positives and false negatives will change with changes in the p threshold. In simulation experiments, we find that lowering the p threshold degrades the overall validity of inductive inferences (Krueger & Heck, 2017). This is so because the proportion of Misses rises more steeply than the proportion of False Positives drops. To insist on lowering the significance threshold in light of these findings is to place a greater disutility on a false positive than a utility on a true positive. 

And why .005 and not .01 or .001? Benjamin et al. concede the choice is as arbitrary as it is pragmatic. They refer to social proof (many favor it) and the heightened Bayes factor that goes with it. The lower the p value, the higher the BF favoring the alternative hypothesis. This is a moment of truth for the Bayesians among the authors. The BF, as it turns out, is a log-linear transformation of the p value. Nothing statistical is added until the priors are included, but that’s another story.

The 72-author report comes from the critical literature on significance testing. This literature boils down to two claims:

  1. p values are fatally flawed in the sense of being incoherent and unreliable;
  2. p values are not low enough.

The 72 stress the latter point, thereby de-emphasizing the former. Surely, it would be difficult to log both complaints in the same paper. It would be rather like the old Jewish quip that “The food was horrible, and the portions were so small!”

There is a third point, which is not about the statistical basics, but about their use. Critics complain that researchers mindlessly or slavishly use a significance threshold to make categorical inferences about the presence or absence of "something." Not even Fisher or Neyman and Pearson advocated rigid decision-making. Fisher viewed .05 as a reasonable threshold when little else is known, and Neyman and Pearson suggested that researchers should use .05, .01, or .001 depending on the relative utilities of the two types of error. Now the 72 come close to demanding a normative change, a new significance criterion that would be binding by social consensus and editorial fiat. With this, the 72 commit what is otherwise condemned as the cardinal sin of ST, the drawing of a bright line between to be and not to be.

There is indeed a psychology of bright-line categorization. The early Tajfel (e.g., 1969) proposed accentuation theory as a way to make sense of diverse consequences of arbitrary (and non-arbitrary) categorization. He reported the replicable result that values placed on a continuum are perceived as respectively smaller and larger if they fall to the left (smaller) or the right (larger) side of a demarcation point (Krueger & Clement, 1994). Perceptual accentuation in the domain of statistical indices and decisions is not a particular sickness coming out of ST.

A final complication hiding in the 72 report is what to do with past results. Perhaps the 72 mean to imply that all findings with .05 > p > .005 be disregarded. Indeed, this conclusion follows from their proposal. As noted above, God (and Fisher) do not care about the relative chronology of results. Here the 72 can make a difference. They may elect to go on record and disavow all their own past findings with .05 > p > .005. Any potential later replication of these results is immaterial because it should – according to their own logic – have never occurred. 

[1] Aischylos, putting these words into Athena's mouth, emphasizes the power of persuasion over authority. Likewise, our scientific practices ought to respond to reasoned argument, not to proclamation by authority.

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., … Johnson, V. (2017, July 22). Redefine statistical significance. Retrieved from osf.io/preprints/psyarxiv/mky9j

Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The long way from a-control to validity proper: Problems with a short-sighted false-positive debate. Perspectives on Psychological Science, 7, 661-669.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.

Krueger, J. I. (2017). Reverse inference. In S. O. Lilienfeld & I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 110-124). New York, NY: Wiley.

Krueger, J., & Clement, R. W. (1994). Memory-based judgments about multiple categories: A revision and extension of Tajfel's accentuation theory. Journal of Personality and Social Psychology, 67, 35-47

Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in psychology: Educational Psychology [Research Topic: Epistemological and ethical aspects of research in the social sciences]. https://doi.org/10.3389/fpsyg.2017.00908

Sakaluk, J. K. (2016). Exploring small, confirming big: an alternative system to the new statistics for advancing cumulative and replicable psychological research. Journal of Experimental Social Psychology, 66, 47–54.

Tajfel, H. (1969). Cognitive aspects of prejudice. Journal of Social Issues, 25, 79-97.