Jacob Cohen (1923 - 1998) was a pioneer of psychological statistics. He taught us about effect sizes, power analysis, and multivariate regression, among many other things. I have always admired his ability to combine technical rigor with good judgment. During the last decade of his life, Cohen published two particularly insightful papers in the American Psychologist. Both had to do with Null Hypothesis Significance Testing (NHST). In "Things I have learned (so far)," Cohen (1990) questioned the idea of living by p values alone and suggested the researchers avail themselves of the multiple tools they can find in the statistical box. In "The earth is round (p < .05)," Cohen (1994) outed himself as a Bayesian. He made clear what many were already dimly aware of, namely that what you want from statistical testing is the probability that an hypothesis is true given the evidence, whereas what you get from the standard tests is the probability of the evidence assuming that the (null) hypothesis is true. How to get from the latter out to the former is a matter of ongoing debate (see the brouhaha over Daryl Bem's time travel studies).
The topic of today's post, however, is Cohen's opening salvo in Earth. He asks us to
"consider the following: A colleague approaches me with a statistical problem. He believes that a generally rare disease does not exist at all in a given population, hence Ho: P = 0. He draws a more or less random sample of 30 cases from this population and finds that one of the cases has the disease, hence p = 1/30 = .033. He is not sure how to test Ho, chi-square with Yates's (1951) correction or the Fisher exact test, and wonders whether he has enough power. Would you believe it? And would you believe that if he tried to publish this result without a significance test, one or more reviewers might complain? It could happen" (p. 997).
Cohen must have assumed that the silliness of testing a null hypothesis of zero was obvious to everyone, so he spilled no more ink to explain it. Yet, his own story suggests that it is not obvious to everyone, including editors and reviewers, to set up and test a null hypothesis of exactly zero. So let me try to explain what Cohen meant.
Suppose you are the silly scientist and you go ahead to compute a one-sample chi square test. The chi square test statistic is obtained by squaring the differences between observed and expected frequencies, dividing them by the expected frequencies and summing these ratios. If the expected frequency is zero, the ratio is undefined and thus the chi-square test fails to run. The Yates correction does not change this because all it does is make the numerator a bit smaller.
What about the Fisher exact test? No dice either. This test requires two samples for comparison; one is not enough. Perhaps Dr. Silly presses on and computes a binomial test. To get the probability of the data under the null, Silly computes p to the power of k, where p is the probability of finding a positive event ("success") under the null hypothesis and where k is the observed number of such successes (here k = 1). He then multiplies the result with (1-p) to the power of n-k, which is the number of "failures," (here 29); the final multiplicator is n factorial. The result is the numerator. It is divided by k!(n-k)!, but never mind that because you already see that the whole ratio will be zero if the numerator is zero - and it is when p = 0.
It is beginning to dawn that if the null hypothesis is zero, a single instance is sufficient refutation. This does not sound like an inductive inference but a deductive one. That is correct, but why? Look at it this way: coming from samples, empirical data carry a measure of uncertainty. If you took a sample of the same size again, the result might not be exactly the same. There is a probability distribution around the sampled value, and this distribution becomes narrower as the sample becomes larger. The standard deviation, that is, the average departure from the center, of the distribution of sampled values is the standard error. The standard error of a proportion is √p(1-p)/n. We see that with p = 0 in the numerator, the null is doomed by a single observation. In other words, with a standard error of zero, the null cannot tolerate any departure from its assumed true value.
If the null is not nil (i.e., not exactly 0), it has a sampling distribution. When flipping a fair coin, p = .5 for heads, you may get various proportions in a sample. With true p = 0, however, you can't. To see why not, imagine you were repeatedly flipping a carnival coin with two heads.
But, you rebut, the standard error is computed from the observed proportion, not the theoretical one. In Silly's example, the error is almost identical to p itself, namely .0328. We also know that a 95% confidence interval is about 4 standard errors wide, so that 0 lies comfortably within the confidence limit. Those who believe that statistical significance can be read out of confidence intervals will claim that the null is not rejected. To re-rebut, the confidence exercise is strange because it places the lower limit in the murky realm of negative proportions. From the point of view of conventional NHST, the standard error is indeed estimated from the observed data, but it is then used to represent the expected slop under the null hypothesis. Cohen (1990, p. 1307) noted that "this seemed kind of strange and backward to me."
So what if we played it forwards? Dividing the empirical proportion by its standard error yields a z-score, which in turn has a cumulative probability. Using a correction suggested by Hays (1978, p. 372) we get z = .5 and p > .05 for Cohen's example of 1 positive case out of 30 tested. If, however, 4 out of 30 cases were positive, z = 1.88, with p < .05, one-tailed. It looks like the null hypothesis of zero was tested in both scenarios, but rejected only in the latter. Although I think that computing z to index the distance between an empirical proportion from zero on a metric that takes sample size into account (and I confess that I did so in Krueger, 2000), I also believe - for reasons stated above - that Cohen correctly suggested that the p value does not have its usual meaning of the probability of the observed data (or data more extreme) conditional on the truth of the null hypothesis. We have to accept the following asymmetry: A single confirmed violation of a null hypothesis of zero is sufficient to refute that hypothesis. It then remains possible that in future samples, null results can occur with a probability > .95 without overturning the falsity of that hypothesis. Once you have a confirmed case of a flying pig, the hypothesis that no pigs can fly is refuted, and no number of flightless pigs can bring it back.
Silly also asked about power. Power refers to the probability of finding a significant result assuming that a particular alternative hypothesis is true. There is no alternative hypothesis is Cohen's story, so we have to make one up, post hoc. Suppose the obtained result is the best estimate of the true nature of things. Here, p = .033 is the result and it is also now the alternative hypothesis. We know that a single "success" will nullify the null, so we may ask what is the probability that in another sample of 30 there will be no successes at all. We find that (1-p)^30 = .362. In other words, the probability that the null will be rejected is .638. If the sample is doubled the two respective probabilities are .131 and .839. So perhaps Silly's power question wasn't all that silly. The result he obtained had an almost 2/3 chance of occurring given that it was true. This only sounds weird because the power question was asked after the fact. In another sense, Silly's question was silly because power analysis, as other forms of NHST, is supposed to protect you from false positive results. As I have shown above, there are no false positives if the null is zero. All positives are true.
Cohen then moved on to develop his Bayesian reformulation of NHST without looking back at Silly. Now, Bayes's Theorem allows you to derive the probability of the null being true given the data from the prior probability of the null being true, the probability of the data under the null and the probability of the data under all other hypotheses. Can a Bayesian analysis change the verdict that an exact null is refuted by a single positive instance? No! The reason is that in the formula that delivers the probability of the null hypothesis given the data the probability of the data given the null hypothesis is in the numerator. Again, when the numerator is zero, the entire ratio is zero. In Silly's case, it does not matter how probable or improbable the disease in question is at the outset. A single case refutes the claim that no one has the disease.
This brings me back to Bem's brouhaha. Bem pitted his ESP hypothesis against the null hypothesis of random success, with p = .5. The now famous result, one might recall, is that participants succeeded at a task 53% of the time when chance would grant them only 50%. NHST yielded p < .05, thereby putting ESP back on the map. Well, sort of. What we're witnessing now is a series of charges and parries, which all have to do with statistical questions such as "Did Bem's study have too little (or too much) power?" "What is the probability that the results will replicate?" "Would the null hypothesis survive a Bayesian test?"
None of this would occur if Bem had set up a null hypothesis of true zero. Instead of counting a choice a success that could also occur by chance, Bem could have looked for confirmatory evidence of ESP in a realm where chance does not help. There are claims of other paranormal feats that satisfy this condition. To show that levitation is possible, for example, you need only one confirmed levitator, not someone who succeeds at attempted levitation significantly more often then 50% of the time. James "The Amazing" Randi used to ask for just that, I think. One compelling existence proof would do the trick. Yet, no one has been able to deliver under mutually agreed upon testing conditions. To refute the hypothesis that the paranormal is impossible, demonstrate it just once and leave the significance tests at home.
Folk psychology gives an out: Miracles. A miracle is a cake you can have and eat too. It's an event that violates a law of nature while at the same time leaving it intact. If you thought that all swans are white, finding a black swan takes care of that hypothesis. The black swan is not a miracle because there is no law that says there shall be no black swans. If, however, you think no pig can fly, it's because the law of gravity forbids it. A flying pig would not only destroy porcine beliefs but it would unhinge the law of gravity itself. If we agree that gravity must survive as a law in the face of evidence to the contrary, we must declare that evidence to be miraculous.
So why not classify paranormal events as miracles? I suspect that although this solution is acceptable to many folk believers in the paranormal, their more academic brethren would balk at the idea. Using the notion of miracle to explain the phenomena would amount to a disavowal of science and an embrace of religion. Trying to stay within the framework of science, however, presents a pardox: If a presumed natural law is refuted by an empirical fact, then there must either be another natural law that accounts for both traditional facts and the new fact or there must a collapse of the very notion of natural law. The latter option is undesirable because it denies science even more strongly than the miracle gambit does. The former option upholds science but devours the notion that something extraordinary has been observed. The presumed extra-sensory or extra-natural event has been assimilated and naturalized. Ergo, parapsychology is either false or self-eliminating.
You may encounter existence proofs in your own life. Unlike levitation, which appears to defy the laws of nature, there may be things worth doing once, things that are difficult or things that induce fear, but things that are possible. The movie "The bucket list" approached this topic from a comedic angle, but the issue is a deep one. What are the things you want to have done or want to have experienced before you sign off? The difference between never and once is categorical. After that it's just repetition. The single act, the single experience is significant in a personal and in a logical sense.
Back Chi Square
In this essay I have explored several ways in which a proportion > 0 can be seen in relation to an hypothesized population value of 0. The emerging lesson is that logically a single positive instance falsifies that hypothesis. At the same time, the routines of NHST can still be run with the intuitively satisfying result that p values decrease as empirical proportions or sample sizes increase. My final exercise in this direction is to heterodoxically pretend the test of proportion p against 0 is a two-sample test. Suppose again that we're faced with Dr. Silly's sample of 1 positive instance in a sample of 30. Instead of comparing 1/30 against a true value of 0, we might pretend we had another sample of 30 with no successes. Now we find that chi square = 1.02 and p = .31. If we had 4 successes in one sample of 30 and none in the other, chi square = 3.27, p = .07. But I suppose we're not supposed to make up data points, even if they're all nothing.
Cohen, J. (1990). Things I have learned (so far), American Psychologist 45, 1304-1312.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Hays, W. L. (1978). Statistics for the social sciences (2nd ed.). Holt, Rinehart & Winston.
Krueger, J. (2000). Distributive judgments under uncertainty: Paccioli's game revisited. Journal of Experimental Psychology: General, 129, 546-558.