Another look at the so-called replicability crisis

Skeptical notes on recent offerings in Psychological Science

Groups emerge from moving parts.

Spring green. It is primavera somewhere.

At 80, Erich von Däniken is going strong.

The quantum physics of happiness is a nerdy thing.

Where Nozick’s thought experiment went wrong.

Paul Meehl

And if I by Beelzebub cast out devils, by whom do your children cast them out? therefore they shall be your judges.

~ Matthew 12:27

After yesterday’s argument that failures to see non-random reality, or Type II errors, are ignored at one’s own peril, today’s question is how badly the psychological literature is muddied by the inverse error, aptly called Type I, or the failure to not see what is not there. Psychologists, and the statisticians who keep them honest, worry about this type of failure. Paul Meehl famously told his students that half of what he told them was crap (i.e., not true), he just did not know which half. Ioannides (e.g., 2005), trying to keep medical researchers honest, claims that over half of the published results are crap. Psychologists have heard his voice and now they worry – again.

In the November 2012 issue of

Perspectives on Psychological Science, which is fully dedicated to the replicability crisis, Bakker and colleagues claim that the crap factor in the literature is grave (Meehl preferred the termcrudfactor). If almost all published papers report significant results, while the median statistical power is barely half that, there’s got to be massive publication bias in the form of inflated claims concerning things that are not there. Psychological science is, in other words, an exercise of collective hallucination.Bakker et al. characterize the conduct of psychological science as a game, in which players, ah scientists, have ample room to fudge, feign, and finagle, and to thereby fool themselves (and the rest of us). Their prime motivation is to score points by publishing papers, and to do so they need statistically significant results, for NHST (Null Hypothesis Significance Testing) is the creed of the land. Statistical significance is obtained when the probability of the evidence under the null hypothesis of nothingness is .05 or less. Little p has a sampling distribution and it turns out that the distribution of published p values is skewed such that there is a hump right at and below .05. This is suspicious.

Bakker et al. explore this point by looking at

funnel plotsfrom published meta-analyses. A meta-analysis is a study of studies, which allows the analyst to compute a weighted mean of all the reported effect sizes. A funnel plot is a graphical display of these effect sizes as a function of statistical power (more precisely: the square root of N-3). The plot shows an area within which an individual empirical effect size would not be significant if the null were true. This area looks like a cone, wide at the bottom where N is small and narrow at the top where N is large. There is evidence of publication bias if the distribution of published effect sizes is asymmetrical around the meta-analytic effect size, particularly if too many effect sizes lie outside the cone of non-significance.Reviewing a good number of meta-analyses, Bakker et al. find evidence for publication bias but the incidence is not enough to explain the grand significance-power paradox that motivated their article. Only about a third of the meta-analyses appeared to be based on biased input. In an ironic twist, Bakker et al. use a chi-square significance test (Beelzebub) to detect and drive out the false positive devils. And who will judge the chi square?

A

funnel plotcan reveal a collective illusion, and to see if it does you only need to look at it. If there is at least one very large study and its effect size (or the effect sizes of several large studies) is close to zero, while most of the small-scale studies are significant, the case for publication bias is strong. One can also regress the effect sizes on the square root of N. If the line goes through zero, the significant effects obtained in small studies are not credible. Bakker’s selection of meta-analyses does not contain a case of this type. Some meta-analyses show an unbiased literature and some of these end up with a grand effect size near zero.Funnel plot from Bakker et al. 2012

A prototypical case for worry is the meta-analysis Bakker et al. selected to illustrate their argument (see picture). There are 12 studies, all have significant results, and the effect sizes are negatively correlated with sample size. In other words, the effect sizes of the small studies somewhat inflated the weighted average effect size.

How bad is it? I think Bakker may have overlooked something, namely the order in which the studies were done. It is reasonable to assume that in an area of research, small studies are initially conducted in an effort to poke around, find suitable study protocols and boundary conditions. Many studies fail and results are not reported. Significant results are published and encourage greater investments and large-scale studies. If those succeed, you get what Bakker reports in many examples, namely regression lines against the square root of N that do not run through the zero point. The reverse would seem nutty: early large-scale studies, followed by many small ones. Pursuing this strategy would be an inefficient way to play the game.

It seems to me then the demonstrable publication bias in the scientific literature is a necessary byproduct of a rational strategy to search for non-null truths. Conducting small studies is a type of foraging. These studies explore the potential yield of a field. Large studies can then exploit what’s there – or put an end to that particular line of research.

When individual studies are planned and conducted, it is a taboo to analyze data sequentially and stop when p = .05. One would be capitalizing on chance and contribute to the epidemic of false positives. The same logic cannot apply to meta-analysis [a fact that questions the logic of that logic]. There is no one who can plan or dictate how many studies ought to be conducted on a particular topic before a meta-analysis may be run to cull the truth from the field of effect sizes and p values. From a Bayesian point of view, each individual study contributes evidence that can be integrated with past knowledge to update the estimate of the probability of the null hypothesis being false. The same logic applies to individual data points as they pile up in a study.

What then is a successful replication? The orthodoxy is to pray at the altar of the p value. Peter runs a study and finds p = .05. He rejects the null hypothesis and declares that there’s a there there (i.e., not nothing). Paul (or Peter himself) runs an exact replication (same protocol and sample size) and finds p > .05. He concludes that the attempt at replication was unsuccessful. Over time, Peter, Paul, and the rest of the field look back on a store of (un)successful replications. Meta-analyses eventually cut through this morass by aggregating the evidence, estimating grand effect sizes, and if the analysts are so inclined, they ask if that effect size is significantly different from zero. Once the meta-analytic effect size is on record, one could go back to the individual studies and ask if their effects differ significantly from that value instead from zero. Now significant testing is used in its strong sense, as Paul Meehl (1978) demanded long ago. A significant result would suggest that the study is an outlier relative to the cumulative scientific record. In other words, NHST would be used to reject data, not hypotheses (Krueger, 2007).

Let us close with a small but telling example of how the results of an individual replication study might be treated. For simplicity’s sake, suppose Peter makes 20 observations with a mean of 2, a standard deviation of 5, and a null of null. Using a single-sample one-sided t-test, he finds that p = .045 [

sound of cork popping]. Paul runs an exact replication, but gets only a mean of .76 (with the same standard deviation). Now, if Peter and Paul were to pool their data, they would find p = .044. Is it not rational to view a follow-up as a successful replication if it reduces p? Once an initial study is run, a gain range can be calculated that shows the minimum effect size in an exact replication that would be sufficient to decrease p.Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science.

Perspectives on Psychological Science, 7, 543-554.Ioannidis, J. P. A. (2005). Why most published research findings are false.

PLoS Medicine, 2, e124.Krueger, J. I. (2007). Null hypothesis significance testing. In N. J. Salkind (Ed.)

Encyclopedia of measurement and statistics(Vol. 2, pp. 695-699). Thousand Oaks, CA: Sage.Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology.

Journal of Consulting and Clinical Psychology, 46, 806-834.