The Life and Times of P
The American Statistical Association refuses to kill the p value.
Posted Jun 04, 2017
Totgesagte leben länger. ['Those pronounced dead live longer,' or in the Vulgar Latin: ‘Declaravit iam mortuum vivere’] ~ Origin unknown
I hope p values are legitimate measures. Otherwise I’ve learned nothing in stats. ~ Lauren Krueger, student of business and finance, Maastricht University
Statistics is about probability and no single probability index has seen as much use and as much abuse as the so-called p-value (see here for an earlier essay). Little p expresses the probability of the data (or data more extreme) assuming that a particular hypothesis (i.e., a theoretical model of reality) is correct. Often, this theoretical model is atheoretical in the sense that it assumes that there is nothing there. You might say, I don’t believe that you can tell the difference – from tasting alone – between milk having been added to tea and tea having been added to milk. To say that you can’t tell the difference is to say that each time you try you have a .5 probability of being correct. If then you succeed in 8 out of 10 attempts, p = .055 with a one-tailed test. By convention, we’d be intrigued by your successes, but we would not infer that you had a demonstrable ability to the order of pouring.
P is everywhere. Be it the assessment of associations among empirical variables or differences in means, medians, ranks, or proportions, p provides a common metric. The test statistics may vary (r, b, t, F, chi-square, U, or W), but p makes them comparable. Yet, many statisticians hate p because of the misinterpretation and misuse we have all seen or because of what p is not and does not pretend to be, namely the probability of the hypothesis given the data. The former grounds for grumpiness are a distraction because they are a matter of p’s reception and not of its nature. The latter are moot because p, if it could speak, would not claim to be the equal to its inverse conditional probability. Clearly, the probability of the data given the hypothesis, p(D|H), cannot pretend to be the probability of the hypothesis given the data, p(H|D). Only people who don’t understand how inverse conditionals are related can do so, which returns us to the issue of ignorance and misuse.
Often, contempt for p is mixed in with or justified by contempt for null hypothesis testing. The null (or nil) hypothesis of no effect is often portrayed as a straw man. We already know that it is false, so showing that it is false by way of reporting a low p value is a charade masquerading as science. Really? Do we already know that you have the ability to detect whether tea was added to milk or milk to tea (or the notable “ability” to get it backwards)? Null hypotheses are set up as testable predictions when a reasonable person would expect no there to be there. Then, when in a well-designed and replicated set of studies, p remains low, we have a (probabilistic) existence proof.
There has been clamoring about the horrors of p for a century, and recently it is again reaching fever pitch, in large part so because scandalous misuses of p have come to our attention, and not because the inherent horrors of the method have been revealed, either by smart mathematics or auto-da-fé. To whom do you turn for an authoritative judgment regarding p and its use? The American Statistical Association of course!
And behold! The ASA rose to the task and issued a statement regarding p. The board convened and invited experts of different schools of thought to offer their assessment, and in the end a judicious and cautious report was published (Wasserstein & Lazar, 2016). The tenor is that the p value has some evidential value but that it is easily misinterpreteted and misused. Care should be taken and other statistical tools should be used as well. This is hardly a condemnation of p values as devil’s work. Nor is it a declaration that alternative methods are available that are so clearly superior that significance testing and the reporting of p can and should be abandoned. In other words, the ASA report is remarkable in what it does not say. Researchers and their students may carry on as they have, while trying to be ethical and mindful. No more, no less.
The ASA report is the work of a committee, reflecting a condensation of a range of opinion into one narrative designed to minimize disagreement on average. Interestingly (and to the ASA’s credit), 21 commentaries are published along with the report as supplementary materials. May of the writers appear to have been involved with the preparation of the ASA report, so their individual assessments provide an interesting window into the variation in opinion that is aggregated out in the report. Here are some themes that emerge across the individual commentaries:
In my reading, four of the commentaries (Benjamin & Berger, Carlin, Johnson, and Rothman) clearly advocate an abandonment of the p value (i.e., the non-abandonment group is the majority, p = .007, two-tailed). The others grudgingly concede that p has some uses, that other methods (especially Bayesian calculations) have the same or different problems, or that the ‘real’ problem is not any particular statistical index, but the broader epistemological context. Some of the commentators even emphatically support the use of the p value if properly understood. Here are some memorable quotes, coming from 7 of the 21 commentaries:
“What made the p-value so useful and successful in science throughout the 20th century, despite of the misconceptions so well described in the statement? In some sense it offers a first line of defense against being fooled by randomness, separating signal from noise, because the models it requires are simpler than any other statistical tool needs.” ~ Benjamini
“Sometimes, especially when using emerging new scientific technologies, the p-value is the only way to quantify uncertainty.” ~ Benjamini
“P-values are handy measures of extremity and serve to describe a set of numbers in a way similar to that of Z-scores and confidence intervals.” ~ Berry
P-values “serve to describe a dataset of numbers and in that sense they are useful tools.” ~ Berry
“It is not an issue of abandoning P-values, it is an issue of abandoning poor research.“ ~Ionannidis
“P-values will continue to offer helpful insights.” ~ Ioannidis
P-values are “an index to the evidential meaning of the data within a statistical model.” ~ Lew
“P-values are a useable and defensible answer to the question of what the data say.” ~ Lew
“It’s incorrect to claim a p-value is “invalid” for not matching a posterior probability based on one or another prior distribution.” ~ Little
“P-values should be retained for a limited role as part of the machinery of error-statistical approaches.” ~ Senn
“Science progresses in part by ruling out potential explanations of data. p-values help assess whether a given explanation is adequate.” ~ Stark
But . . .
mis- and abuse remain a problem. When googling “the p value,” an essay by Deborah Rumsey comes in first. Writing for dummies.com, Deb declares that “a small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.” She asks us to swallow her argument with a gustative example, inviting us to imagine that “a pizza place claims their delivery times are 30 minutes or less on average but you think it’s more than that. You conduct a hypothesis test because you believe the null hypothesis, Ho, that the mean delivery time is 30 minutes max, is incorrect. Your alternative hypothesis (Ha) is that the mean time is greater than 30 minutes. You randomly sample some delivery times and run the data through the hypothesis test, and your p-value turns out to be 0.001, which is much less than 0.05."
And, to be sure you understand, De declaims that "In real terms, there is a probability of 0.001 that you will mistakenly reject the pizza place’s claim that their delivery time is less than or equal to 30 minutes.”
Were it only so. The ASA has a lot of work to do.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129-133. doi: 10.1080/00031305.2016.1154108
Commentaries are here