Replication Crisis

Rethinking P-Values: Is "Statistical Significance" Useless?

The American Statistician has published 43 papers on "A World Beyond p < 0.05."

Posted March 22, 2019 | Reviewed by Abigail Fagan

Source: GraphicMama-team/Pixabay

This week, The American Statistician published a special issue, "Statistical Inference in the 21st Century: A World Beyond p < 0.05," which includes 43 new papers by leading statisticians. The objective of this issue is “to end the practice of using a probability value (p-value) of less than 0.05 as strong evidence against a null hypothesis or a value greater than 0.05 as strong evidence favoring a null hypothesis."

Before diving into the content of the latest issue of The American Statistician, it's important to note: I'm well aware that writing about "probability values" is a geeky topic that probably seems dull and esoteric to the general reader. That said, p-values are really important and deserve more attention. Therefore, I'm going to do my best to make this as engaging and easy-to-digest as possible by writing in a first-person, conversational style.

As a science reporter, reporting on the latest “do’s and don’ts” of scientific reporting based on p-values is Exhibit A of metacognition in action. For example, choosing a "catchy" title for this post was practically impossible. I've never had to debate whether or not to use a wonky equation like "p < 0.05" in a title.

While typing these introductory paragraphs, I’m "thinking a lot about my thinking" and how to format the structure of this post. My self-imposed goal herein is to accurately convey the gist of the 43 recently published papers by leading statisticians around the world in less than 1,500 words.

I also want my reportage on the editorial recap of these 43 new papers in The American Statistician’s latest special issue to be Exhibit B of the editors’ four-letter acronym-based (ATOM) clarion call. This four-letter acronym stands for Accepting that there will always be uncertainty, being Thoughtful, Open, and Modest when reporting on science. Hopefully, this post reflects the ATOM model.

The three TAS editors of this special issue—Ronald Wasserstein, Allen Schirm, and Nicole Lazar—are on a mission to encourage scientists and science writers around the globe to adopt their ATOM acronym. In an editorial describing the layout of their March 2019 special issue, The American Statistician editors sum up the main takeaways and common threads in the 43 papers they've compiled:

“Based on our review of the articles in this special issue and the broader literature, we conclude that it is time to stop using the term 'statistically significant' entirely. No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical non-significance lead to the association or effect being improbable, absent, false, or unimportant.”

Wasserstein, Schirm, and Lazar go on to say, “So, let’s do it. Let’s move beyond 'statistically significant,' even if upheaval and disruption are inevitable for the time being. It’s worth it. In a world beyond ‘p < 0.05,’ by breaking free from the bonds of statistical significance, statistics in science and policy will become more significant than ever. Regardless of whether it was ever useful, a declaration of 'statistical significance' has today become meaningless.”

Overall, I found the language and tone of this editorial about the potentially eye-glazing topic of p-values to be surprisingly playful and fun to read. For example, in their introduction, the TAS editors speak directly to the reader:

“Some of you exploring this special issue of The American Statistician might be wondering if it’s a scolding from pedantic statisticians lecturing you about what not to do with p-values, without offering any real ideas of what to do about the very hard problem of separating signal from noise in data and making decisions under uncertainty. Fear not. In this issue, thanks to 43 innovative and thought-provoking papers from forward-looking statisticians, help is on the way."

Before listing five don’ts of science reporting, the editors humorously write:

“Don’t. Don’t. Just…don’t. Yes, we talk a lot about don’ts. The ASA Statement on p-Values and Statistical Significance (Wasserstein & Lazar, 2016) was developed primarily because after decades, warnings about the don’ts had gone mostly unheeded. The statement was about what not to do, because there is widespread agreement about the don’ts. There’s not much we can say here about the perils of p-values and significance testing that hasn’t been said already for decades... But If you’re just arriving to the debate, here’s a sampling of what not to do.”

Replication Crisis Essential Reads

Revisiting the Self-Refilling Bowl of Soup

Is Psychology Research Unreliable?

After this explanation, the editors provide a bullet point list of their five don’ts:

Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant."

Don’t believe that an association or effect exists just because it was statistically significant.

Don’t believe that an association or effect is absent just because it was not statistically significant.

Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.

Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).

After rattling off this list of don’ts, the TAS editors reiterate some of the long-term benefits of avoiding the use of “statistical significance.” Wasserstein, Schirm, and Lazar write,

“As we venture down this path, we will begin to see fewer false alarms, fewer overlooked discoveries, and the development of more customized statistical strategies. Researchers will be free to communicate all their findings in all their glorious uncertainty, knowing their work is to be judged by the quality and effective communication of their science, and not by their p-values. As 'statistical significance' is used less, statistical thinking will be used more. For the integrity of scientific publishing and research dissemination, therefore, whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight."

Because there’s way too much material presented in these 43 papers to discuss in a single blog post, I’ve decided to curate a short, alphabetical-order list of selected quotations by a handful of these authors taken from the TAS press release:

"Words like 'significance' in conjunction with p-values and 'confidence' with interval estimates mislead users into overconfident claims. We propose researchers think of p-values as measuring the compatibility between hypotheses and data, and interpret interval estimates as 'compatibility intervals' rather than 'confidence intervals.'" —From "Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication" by Valentin Amrhein, David Trafimow, and Sander Greenland

"Considerable social change is needed in academic institutions, in journals, and among funding and regulatory agencies. We suggest partnering with science reform movements and reformers within disciplines, journals, funding agencies and regulators to promote and reward 'reproducible' science and diminish the impact of statistical significance on publication, funding and promotion." —From "Why Is Getting Rid of P-Values So Hard? Musings on Science and Statistics" by Steven Goodman

"Reproduction of research should be encouraged by giving byline status to researchers who reproduce studies. We would like to see digital versions of papers dynamically updated to display 'Reproduced by...' below the original research authors' names or 'Not yet reproduced' until it is reproduced." —From "Quality Control for Scientific Research: Addressing Reproducibility, Responsiveness, and Relevance" by Douglas W. Hubbard & Alicia L. Carriquiry

"Evaluation of manuscripts for publication should be 'results-blind'. That is, manuscripts should be assessed for suitability for publication based on the substantive importance of the research without regard to their reported results." —From "The Impact of Results Blind Science Publishing on Statistical Consultation and Collaboration" by Joseph J. Locascio

"A number of factors should no longer be subordinate to 'p < 0.05'. These include relevant prior evidence, plausibility of mechanism, study design and data quality, and the real-world costs and benefits that determine what effects are scientifically important. The scientific context of the study matters and this should guide its interpretation." —From "Abandon Statistical Significance" by Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett

In closing, the TAS editors sum up the main takeaway of their March 2019 special issue, "Statistical Inference in the 21st Century: A World Beyond p < 0.05," without wasting any words. They conclude: “We summarize our recommendations in two sentences totaling seven words: "Accept uncertainty. Be thoughtful, open, and modest. Remember ‘ATOM.'"

References

Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar. "Moving to a World Beyond 'p < 0.05'" The American Statistician (First published online: March 20, 2019) DOI: 10.1080/00031305.2019.1583913

Valentin Amrhein, David Trafimow, and Sander Greenland. "Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication"The American Statistician (First published online: March 20, 2019) DOI: 10.1080/00031305.2018.1543137

Steven Goodman. "Why is Getting Rid of P-Values So Hard? Musings on Science and Statistics."The American Statistician (First published online: March 20, 2019) DOI: 10.1080/00031305.2018.1558111

Douglas W. Hubbard & Alicia L. Carriquiry. "Quality Control for Scientific Research: Addressing Reproducibility, Responsiveness, and Relevance." The American Statistician (First published online: March 20, 2019) DOI: 10.1080/00031305.2018.1543138

Joseph J. Locascio. "The Impact of Results Blind Science Publishing on Statistical Consultation and Collaboration." The American Statistician (First published online: March 20, 2019) DOI: 10.1080/00031305.2018.1505658

Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. "Abandon Statistical Significance."The American Statistician (First published online: March 20, 2019) DOI: 10.1080/00031305.2018.15272