J. Krueger
Source: J. Krueger

What to do with significance testing, the bench scientist’s statistical workhorse? Short of tossing it out once and for all, there are proposals to toughen conventional practice. These proposals seek a reduction in the errors that befall decision-making under uncertainty. A Type I error is a False Alarm (FA). It occurs when a result is declared statistically significant, although the null hypothesis is true. A Type II error is a Miss (M). It occurs when a result is not declared significant, although the null hypothesis is false. Benjamin and 71 co-authors (in press) ask that the probability alpha (for FA) be lowered from the conventional level of .05 to the tough level of .005. Button et al. (2013) ask that the probability beta (for M) be lowered. For illustrative purposes, I will use the values of .2 and .02 to preserve the proportionate increase in toughness. Many want to tighten significance testing at both ends; 3 of the 7 authors of the Button paper are also among the 72 on the Benjamin paper.

Lowering the significance threshold alpha decreases FA and Hits (H), while increasing Misses (M) and Correct Retentions (CR of the null hypothesis). Raising power (1 - beta) by lowering beta (ceteris paribus) increases H and reduces M, while leaving FA and CR unaffected. It thereby reduces the FA ratio and M ratio. The overall proportion of significant results (H and FA out of all results) increases. Stated differently, lowering the rejection threshold increases its specificity of the test (lowering the FA rate), and increasing statistical power increases its sensitivity (increasing the Hit rate). The two major ways of making this happen are improvements in the precision of measurement and larger data sets. Power analysis targets the latter directly. It asks how many observations (N) are needed so that there is a specified probability (sensitivity) of obtaining a significant result assuming that an effect of a particular size exists. Power analysis targets the former indirectly in that it implies that larger standardized effects require fewer observations to meet a particular power level; greater measurement precision makes standardized effects—if they exist—larger by reducing measurement error.

Assuming that researchers measure their constructs as precisely as they can, they can only increase sample size to meet the demands of the new, and tougher, statistics. Consider a comparison of two independently sampled means where mu = 5, sigma = 10, and d = .5, where the prior probability of the null hypothesis is. 5. A study with conventional settings of alpha = .05 and beta = .2 requires fewer observations (n = 63 per condition) than a tough study where alpha = .005 and beta = .02 (n = 190). The total difference comes to N = 126 vs. 380. In the tough study, the FA ratio drops from .059 to .005 and the M ratio drops from .174 to .020. If we assume that—as many scholars do—the prior probability of the null hypothesis is high, say .8—the FA ratio drops from .020 to .002 and the M ratio drops from .050 to .005. These changes look like nice gains, particularly if we look at them as ratios. The improvements are tenfold. Yet, the error rates are quite low to begin with. Researchers must ask if it is worth paying for a reduction in the incidence of FA from .04 to .004 with a threefold increase in sample size. The gain of true positives is particularly slim when the null hypothesis is probable a priori. In our example, the probability of H rises from .16 to .196. By the ratio metric, a 3-fold increase in sample size yields a 1.225-fold increase in discoveries.  

Considerations of error control should be informed by valuations of correct and incorrect decisions of either kind. In our examples, the probability of an FA if the null is true is lower than the probability of an M if the null is false, i.e., alpha < beta. The test’s sensitivity is lower than its specificity. The common interpretation of this inequality is that researchers abhor FA more than they abhor M. Yet, the valuations of the four possible outcomes are rarely presented before study. It is instructive to infer what these valuations should have been given the available estimates of alpha, beta, and p(H).

To begin, assume that the value on each correct decision is 1 and that the value of each incorrect decision is -1, i.e., VH = VCR = 1 and VFA = VM = -1. Multiplying these values with their respective probabilities, the expected values are EVH = .16, EVFA = -.04, EVM = -.04, and EVCR = .76 for the conventional case of alpha = .05, beta = .2 and p(H) = .8. In contrast, EVH = .196, EVFA = -.004, EVM = -.004, and EVCR = .796 for the tough case of alpha = .005, beta = .02 and p(H) = .8. Adding EVH and EVFA and adding EVM and EVCR we obtain respectively the expected values of significance and nonsignificance. In the conventional case, EVsignificance = .12 and EVnonsignificance = .72. In the tough case, EVsignificance = .192 and EVnonsignificance = .792.

The key assumption of this reverse analysis is that researchers would place value on the outcome in such a way that they are indifferent between obtaining a significant or a nonsignificant result. Both would be equally welcome. The researcher would have no incentive and no reason to nudge the study towards significance or away from it. The numerically equal (but with different signs) values in the example show a great imbalance. No matter whether the study is conventional or tough, the expected value of significance is lower than the expected value of nonsignificance. In the context of reverse engineering the values, that is, on the assumption of rational indifference, we must assume that researchers value H or FA more highly than they value M or CR. What would it take to achieve indifference? We can raise VH to equal (VM x prM + VCR x prCR – VFA x prFA) / prH. For the conventional case, we find that VH = 4.75, and for the tough case, we find that VH = 4.06 renders the researcher indifferent to the outcome of the significance test.

These inferred values of H are quite similar. The small difference that exists means that those researchers who are a little less excited about true discoveries would be willing to pay the cost of tripling the sample. Conventional wisdom, however, suggests that it is the high negative value of FA that motivates use of the tough strategy. But if we assign a more negative value to FA, the expected value of a significant result drops for both the conventional and the tough case. Given the assumptions of this exercise it does not seem possible that those who value FA most negatively will accept the cost of raising statistical power.

If we repeat this exercise for a null hypothesis that is improbable a priori, p(H) = .2, we find that indifference is achieved if VH drops to .0625 in conventional testing and to .235 in tough testing. In other words, if the null hypothesis is a weak contender to begin with, confirming it with conventional means adds little value. An interesting and important regularity emerges. The value of a true discovery increases inasmuch its prior probability is low, a regularity that aligns well with the general inverse relationship between expectancy and value in nature and culture (Hertwig & Pleskac, 2014). In the present case, of course, this regularity is a matter of mathematical necessity.

These considerations, informal as they are, make we wonder whether tightening the noose on significance testing is a productive idea. The value analysis I have attempted here is only illustrative. There is no easy way to estimate the values of the 4 test outcomes reliably and validly. Moreover, these values are bound to change over people and occasions. This is an argument for not making them a regular feature of significance testing. What remains is the conclusion that the current settings of alpha and beta respectively at .05 and .2 are heuristically adequate.

References

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., … Johnson, V. (2017, July 22). Redefine statistical significance. Retrieved from osf.io/preprints/psyarxiv/mky9j

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365-376. doi:10.1038/nrn3475

Pleskac, T. J., & Hertwig, R. (2014). Ecologically rational choice and the structure of the environment. Journal of Experimental Psychology: General, 143, 2000–2019.

The power estimates were obtained using this website.

The photo shows part of the front page of MaxPlanckResearch magazine, issue 2, 2017.

You are reading

One Among Many

Homo Dichotomus

Can we get from probability to decision?

Game of Cheating

Leave the money on the table.

Sapolsky on Free Will

Primates don't have it.