Not being useless, significance testing can still bias perception.
Posted Sep 03, 2019
I have defended the use of significance testing, suggesting that the p value can play a role as a heuristic inferential cue. It does predict the inverse conditional probability, that of the tested hypothesis given the evidence, surprisingly well under most conditions (Krueger & Heck, 2017).
Significance testing, and the p value it yields, was designed to be useful when two conditions are met. First, the samples are not huge. Second, the observed effects are neither huge nor tiny. Ritualistic applications of significance testing ignore these conditions. I have seen very large correlations obtained with large samples tested against zero, and I have seen very large proportions in large samples tested against 50%. I asked one author why the test was run. The answer was: “Editors and reviewers will demand it anyway.” They might; but are they not educable?
We find an example of significance testing on tiny effects in a very large sample in Quoidbach et al. (Psychological Science, 2019). Here is what the authors report on page 1115: The “average happiness level was related to how much time [the respondents] spent—in order of importance—with their romantic partners, r(30790) = .19, p < .0001; friends, r(30790) = .08, p < .0001; best friends, r(30790) = .06, p < .0001; children, r(30790) = .06, p < .0001; other family members, r(30790) = .04, p < .0001; siblings, r(30790) = .03, p < .0001; acquaintances, r(30790) = .02, p < .0001; coworkers or clients, r(30790) = .02, p < .0001; and parents, r(30790) = .01, p < .05. The time they spent in the company of strangers was unrelated to happiness, r(30790) = .00, p = .35.”
The authors acknowledge that the effects are very small; but what do we gain from significance testing? It seems to me that reporting of the correlations, perhaps with a third decimal digit and a one-time note on the degrees of freedom, would be just fine. The final p value of .35 for a correlation of .00 is particularly mysterious. What if p had been .11? It seems that this sort of reporting violates the Gricean norm of quantity of information.
The p value, once it has been computed, cannot be held responsible if it is used for categorical inferences using rigid thresholds. Once this is done, the narrative becomes classificatorial. Quoidbach et al. (2019, p. 1115) conclude that “social contacts—especially with people who are close—play a role in people’s happiness.” Contacts with parents (r = .01) are now in the same league as contacts with romantic partners (r = .19), although the happiness benefits of parents seem rather close to the non-benefits of strangers (r = .00). Oh, parents! We hardly know thee!
To be fair, Quoidbach et al. modulate their categorical inference with the ‘especially' clause. Their assessment of ‘closeness’ is rather ad or post hoc, and it raises another question. Once significance testing is let loose, why only test all effects against zero? We might find that happiness is significantly greater during contacts with friends (r = .08) than with close friends (r = .06). Then what?
Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2017.00908
Quoidbach, J., Taquet, M., Desseilles, M., de Montoye, Y.-A., & Gross, J. J. (2019). Happiness and social behavior. Psychological Science, 30, 1111-1122.