The witch hunt for false positives

Is self-enhancement good for you?

Forgiveness is a good strategy - mostly.

Life is an exercise in experience sampling.

It's good to be back in Berlin and get yelled at.

I returned to Berlin as a Posthoc.

Use your words wisely.

Carl Friedrich Gauss

I have had my results for a long time: but I do not know yet how I am to arrive at them.

~ Carl Friedrich Gauss (found here)

Psychologists are chronically – or rather intermittently – worried about how scientific their science is. The hallmark of a respectable science is that it yields reproducible results. When this reproducibility is threatened, the science itself is threatened. Psychologists have shown much ingenuity in finding clever, even elegant, ways to harness phenomena. They have also been at the forefront of developing sophisticated methods of analyzing numerical data. Yet, the old headache, the ancient fear of being shown up as an impostor, just won’t go away.

One reason for this perennial malaise is that all the progress in the design of experiments and in the massaging, squeezing, and fondling of the data covers up the brute fact that most strategies of data analysis still boil down a Fisherian version of null (nil) hypothesis significant testing, NHST. I argued over a decade ago that NHST is fatally flawed yet resistant to death (Krueger, 2001). When the sense of crisis in the field periodically reaches the boiling point – as it does currently – the hand wringing comes in the form of p wringing. How can we tame that little p, when we expect so much from it and get so little?

The entire November issue of the esteemed journal of

Perspectives on Psychological Science, PPS, which is the place where theAssociation for Psychological Science, APS, airs its laundry, is dedicated to the current replicability crisis. Most of the papers in this issue are concerned with the question of how the incidence of false positives can be reduced. One paper, written by my colleagues Klaus Fiedler, Florian Kutzner, and myself, asks not the throw the baby out with the bathwater (to use another aqueous domestic metaphor). We think that tightening the screws on false positives threatens to choke the flow of true discoveries as well. In other words, it increases the incidence of false negatives, or misses.Let’s clarify some of these terms. In a typical experiment, the null hypothesis says that there is no relationship between an experimentally manipulated variable and an observed outcome measure. Unless p < .05 the null hypothesis is not rejected. If p < .05, however, that hypothesis is rejected and it is inferred that there is a statistical relationship (perhaps a causal one) between what is manipulated and what is measured. A false positive, or Type I error, occurs if p < .05 although the null hypothesis is true. A false negative, or Type II error, occurs if p > .05 although the null hypothesis is false. It is trivially true that as you require a lower p value for the rejection of the null hypothesis, you accept more Type II errors. Let’s take a look at one of the papers from the PPS special issue.

Pashler & Harris (PH) ask what proportion of published results are likely to be false positives. They develop four scenarios, where each has four input ingredients.

[1] The prior probability of the effect; i.e., the complement of the probability that the null hypothesis is true, 1 - p[H] or ~p[H].

[2] The power of the study; i.e., the probability that a statistical test is significant if the null hypothesis is false, p[D|~H], where D stands for data sufficient to render p <. 05.

[3] The probability of a hit, that is, the probability that the null hypothesis is false times the probability that a test is significant if the null hypothesis is false; i.e., p[~H] x p[D|~H].

[4] The probability of a false positive, that is, the probability that the null hypothesis is true times the probability that a test is significant if the null hypothesis is true; i.e., p[H| x p[D|H].

After setting p[D|H] to the conventional .05, PH derive [3] and [4] from [1] and [2]. They develop four scenarios by simply varying [1] and [2]. The scariest case is obtained when the null his highly probable (.9) and the study’s power to detect a true effect is low (.35). Here, the probability of a hit is .035 and the probability of a false positive is .045, yielding an unsettling probability of .56 (.45/[.35 + .45]) for drawing a false conclusion if the test is significant. PH also show that this probability falls to .13 if the probability of the null being true is raised to .5.

PH worry that the scarier case is the more realistic one because researchers tend to devise rationales for their non-null effects that make it seem like these effects are more probable than they really are (i.e., .5 instead of .1 for p[H]). I cannot believe this. Even if researchers are motivated to rationalize their findings, they ought to be motivated to rationalize the posterior (after study) probability of the non-null, not its prior. A researcher garners more glory by being able to argue that he found strong evidence against a null hypothesis that was very probable at the outset, not a null that was improbable.

PH focus on the relative incidence of false positives, while ignoring the relative incidence of false negatives (i.e., missing a true phenomenon). The probability of a miss is the probability that the null is false times the probability that the test is not significant if the null is false, p[~H] x p[~D|~H], where the latter term is the complement of the study’s power. The probability of a correct rejection is the probability that the null is true times the probability that the test is not significant if the null is true, p[H] x p[~D|H]. Now, the probability of a false result (Type II error) if the test is not significant is the probability of a miss divided by the summed probabilities of misses and correct rejections. In PH’s scary example, where the null is probable, .9, and power is low, .35, the relative probability of a Type II error is .07. When p[H] is raised to .5, the relative probability of that error is .41.

So here’s the pickle. Holding everything else constant, using a prior for the null of .5 instead of .9 reverses the relative probability of the two types of error. If the null is considered highly probable at the outset, Type I errors are the dominant risk. If the null is a fifty/fifty proposition at the outset, Type II errors are the dominant risk.

What makes me think that the latter case is the one to worry about? A hypothesis of an effect size that is exactly 0 has no probability at all. It is a point on a line that is infinitely long and infinitely break-downable. Only regions or ranges have probabilities. We can say that finding a positive value on a scale centered on 0 has a probability and that finding a negative value also has a probability. If we are ignorant before collecting data, the probability of a positive numerical effect and the probability of finding a negative numerical effect may be said to be the same, namely .5. The result is that we should worry about missing true phenomena.

Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The long way from α-control to validity proper: Problems with a short-sighted false-positive debate.

Perspectives on Psychological Science, 7, 661-669.Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method.

American Psychologist, 56, 16-26.Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined.

Perspectives on Psychological Science, 7, 531-536.