The Statsman Always Rings Twice

Take another look at the pattern of results.

Posted Sep 14, 2018

Matej Kastelic/Shutterstock
Source: Matej Kastelic/Shutterstock

Those of you paying some attention to the scene in the science of psychology know that another specter is roaming the streets, and its name is Failure to Replicate. Most findings, watchdogs and vigilantes tell us, are false, in psychology, medicine, and Lord knows where else. The reasons are many, but near the top of the list we find human shenanigans. Collectively, these shenanigans are known by the epithet of p-hacking. The ‘p’ stands for the p-value you harvest from statistical significance testing, and the ‘hacking’ refers to a suite of (self-)deceptive practices that depress these p-values below the conventional .05 threshold so that investigators may declare a result significant in the sense that the null hypothesis of noise renders the obtained data unlikely.

If we contemplate a single study with a p-value of, say, .03, we cannot, from this result alone, conclude that it was hacked. We’d need some information about how the researchers went about their business, or we need the results of replication studies to look for revealing patterns. If there is one attempt at replication and it yields p = .07, it would be as foolhardy to declare the original finding void as it would be to declare victory over the null hypothesis after the first study alone. More data is (as they write these days) needed.

Suppose we have multiple replication studies. Now the plot thickens. We can look at the distribution of p-values and deploy the tools of p-curve analysis (Simonsohn, Nelson, & Simmons, 2014). The basic idea is that under any set of rational assumptions, the frequency distribution of the p-values may be skewed, but it would be unimodal. There shouldn’t be any local peaks, and there should not be a particular peak in the sweet area between .05 and .01, the area that both yields significance and saves resources. This local peak would be suspicious because we know that the distribution of the p-value is flat (uniform) under a true null hypothesis and increasingly skewed (with more small p-values) under a false null hypothesis (Krueger & Heck, 2018).

P-curve analysis does not exploit the available information. Looking over a set of studies, we also have—or can compute—information on sample size (or degrees of freedom) and effect size. Over studies, the intercorrelations among p-values, sample size (df), and effect size (ES) can be revealing or at least they can—as contemporary pundits love to say—“raise questions.”

To illustrate the potential for this kind of approach [and it may not be novel], I use data from a publication by Lasaletta et al. (2014), again, not to impugn the authors, but to try out a kind of statistical pattern analysis. The authors sought to test the interesting hypothesis that being in a nostalgic frame of mind reduces a need for and an appreciation of money. In six studies, they find that nostalgia increases the willingness to pay for products, increases generosity in a dictator’s game, reduces the perceived importance of money, reduces the perceived value of money, increases the willingness to endure aversive stimuli for a given amount of money, and reduces the perceived size of certain coins. The six p-values are .031, .020, .045, .027, .062, and .026. Notice the clustering in the sweet area between .05 and .01, with one tolerable exception. This provides only weak grounds for worry because the authors might have predicted a medium effect size throughout, done a power analysis, and collected the advisable sample (but they don’t report that they did any of this). The effect sizes are .55, .48, .46, .48, .37, and .63. They are medium (where d is around .5, with d being the ratio of the difference between the means over the within-group standard deviation). But there is also variation in the df (sample size), namely, 67, 125, 81, 98, 102, and 56.

Now we can intercorrelate p, df, and ES, and ask if the results “raise questions.” Here’s what we get: First, the correlation between p-values and ES, r(p,ES), is -.71. Larger effect sizes go with smaller p-values. This is what we’d expect if we had predicted the same medium effect for all six studies, resulting in the same power analysis and the same df. Then ES, not being perfectly identical over studies, would correlate negatively with p. Second, the correlation between sample size (df) and effect size (ES), r(df,ES), is -.68. Larger ES go with smaller samples. This is what we’d expect if differences in ES had been predicted, and power analyses had yielded different recommendations for sample size. So we have one correlation, r(p,ES), that makes sense if constant and medium ES had been predicted so that df could be constant. And we have another correlation, r(df,ES), that makes sense if variation in ES had been predicted so that small samples would suffice for large expected effects. It’s is one or the other, not both.

Having two conflicting correlations “raises questions” about the third, the correlation between df and p. We find that r(df,p) = .03. Larger samples may yield the same p values (on average) as small samples do if the differences in ES had been predicted, and power analyses had yielded different sample sizes. In other words, accurate

power predictions shrink the range of the obtained p values and decouple them from df.

To review, ES is negatively correlated with both p, and df. That is, as effect size gets larger, both p-values and sample sizes become smaller. This is the conflicting result. Again, we can imagine how as ES gets larger, p gets smaller without a change in df. And we can imagine how as ES get larger, the df gets smaller without much change in p. But we cannot imagine both at the same time. We can now ask what kind of correlation between p and df we are entitled to expect if there were no differences in ES that correlated negatively with p and with df. The partial correlation between p and df, controlling for ES is -.89. So if variation in ES is unknown, larger samples will yield lower p values. This did not happen here, and it raises the question: Why is there considerable variation in df with the result that df is unrelated to p?

An alternative analysis

Responding to this essay, Uli Schimmack proposed this analysis:

The Test of Insufficient Variance is the most powerful test of publication bias (or some other fishy QRPs).

Step 1
Convert the p-values into z-scores, using z = -qnorm(p/2)

p = c(.031, .020, .045, .027, .062, .026)
z = -qnorm(p/2)
[1] 2.157073 2.326348 2.004654 2.211518 1.866296 2.226212

Step 2
Compute the variance of the z-scores
var.z = var(z)
[1] 0.02808286

Step 3
compare the observed variance to the expected variance (standard deviation of z-scores = 1)
pchisq(var.z*(k-1),k-1) with k = number of p-values (6)

> pchisq(var.z*5,5)
[1] 0.0003738066

Conclusion: The probability that the p-values stem from a set of independent studies is very small, p = .0004.Fisher observed long ago, “[t]he political principle that anything can be proved by statistics arises from the practice of presenting only a selected subset of the data available” (Fisher 1955, p. 75) [thanks to Deborah Mayo for the quote]


Krueger, J. I., & Heck, P. R. (2018). Testing significance testing. Collabra: Psychology, 4(1), 11. DOI:

Lasaletta, J. D., Sedikides, C., & Vohs, K. D. (2014). Nostalgia weakens the desire for money. Journal of Consumer Research, 41, 713-729.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143, 534–547