Once More With Feeling
On the importance of the so-called replicability crisis and what it means
Posted Sep 15, 2015
Crisis? What Crisis?
You stride into the casino as if you own the joint (if it helps, envisage yourself as either Eva Green or Sean Connery dressed appropriately for the role). You stroll over to the roulette wheel and get a feeling for the future. You casually put a big chip down on one of the 36 bettable spaces…the wheel spins…the ball settles…you’ve won! What are the odds?
(Slightly under 3%, actually).
Gathering up your winnings, you nonchalantly saunter to the craps table, effortlessly throwing a hard eight—double fours with two dice. What are the chances of that?
(Again, about 3%).
There’s a blackjack table, and you try your luck there. Not bothering with the complications of doubling down, or counting cards, you just get a straight pontoon. Crossing the ace over the king you look at the croupier expectantly. Wow. Your luck is really in tonight…
(Is it? How much is it in? About a 2% chance as it happens).
How unlikely is all this to happen? Well, not that unlikely actually—about 2% or 3% in each case. And that’s why replication matters in science. When a result is reported as statistically significant followed by the arcane symbols (p<.05 or p<.01) all this means is that there is a 5% or 1% chance, respectively, that the results—an association or a difference between what you have measured--are due to chance*.
Now, and this is very important, this does not mean that we are 95% (or 99% or even 99.999%) sure that the result is true. Science isn’t school. There is no authoritative teacher’s edition of the textbook with the answers in the back. Nature doesn’t give us a teachers’ edition. All we have are the investigative tools we have painstakingly developed over the years. And we haven’t been using those tools on humans for very long, so don’t expect perfect results the first time.
(And while I’m on the subject, peer-review is not a guarantee of truth either. When I get papers to review all I can do is check if the authors have done their due diligence in reading and understanding the previous work in the field, check their sums, suggest alternative hypotheses to discuss, and so on. I don’t have the answers in the back either.)
All it means when scientists report a finding (“p< .05”) is that we think its worthy of attention. Like rolling a hard eight, hitting a roulette layout bet, or making pontoon. It’s interesting. It’s exciting. It’s worth having another go, a closer look, an attempt at replicating…
(And by the way, is it a coincidence that the standard levels for reporting statistical significance—around 1%-5%—are about the same as those for exciting single events in gambling games? I wonder…)
Betteridge's law: “Any headline that ends in a question mark can be safely answered with the word no”
This is important because the way that the so-called psychology replication crisis is being reported in certain quarters, you might imagine that a bunch of scientists have been caught with their metaphorical hands in the methodological cookie jar of nature. Leaving aside the frauds for the moment (and they do exist) this simply isn’t true. (1)
Brian Nosek and his team have completed a heroic replication of a hundred important studies in various areas of psychology—but mainly social and cognitive psych. (2) They found that 36% of the studies showed statistical significance (in other words, are the results worthy of attention) this time around. The details of how this has been done are interesting and should be studied by students of the subject but are probably a bit arcane for the casual reader. Still, the transparency and rigor of this replication is to be applauded. Not just applauded—emulated in other scientific fields. And, I don’t want to scare anybody, but replication rates in the basic science underlying cancer research have been reported as low as 25% (or even 11%) by some authorities (3). Jerry Coyne recently estimated the replication rate in evolutionary biology to about 50% (4).
And when it comes to physics, and the proposed entities that didn’t survive close scrutiny, a recent commenter on physics.org listed the following:
SUSY, LGQ and stringy theories, extra-dimensions, scalar field, quintessence, mirror matter, quantum gravitation, axions, dilatons, inflatons, heavy and dark photons, leptoquarks, dark atoms, fat strings and gravitons, magnetic monopoles and anapoles, sterile neutrinos, colorons, fractionally charged particles, chameleon particles, dark fluid and dark baryons, fotinos, gluinos, gauginos, gravitinos and sparticles and WIMPs, SIMPs, MACHOs, RAMBOs, DAEMONs, Randall-Sundrum 5-D phenomena (dark gravitons, K-K gluons a microblack holes).
And the research here (often by banging sub-atomic particles into one another at near light speeds in the large hadron collider) can cost a fortune.
And science isn’t “broken” either
And that’s fine too. Money seriously well spent in a species that wants to understand itself and its universe. Bold conjectures, conclusive refutations, as Popper said science should be. We forget this history because, like the credulous client of a psychic cold reader, we only remember the hits (“is there someone here with a name beginning with a ‘J’?”) and not the misses (“Oh, was your father not run over by an egg lorry, then?”).
Is it really the case that all the music in the past was better? Or, is it more likely that we remember the good stuff, forget the rubbish, and the vast morass of art that simply wasn’t up to the mark simply fades from memory. Scientific method is about the gradual sifting of the hits from the misses, and it takes a while, and if we knew what we would find before we started looking, it wouldn’t be research, would it now?
Incidentally, this gradual building up of a picture of the way the world works—“theory is what tells you which questions to ask the data”—is one of the reasons why it’s disturbing that an ex presidential candidate and neurosurgeon doesn’t understand evolution. It’s not a case of believing in evolution or not. Nature doesn’t give two hoots if you believe in her or not. But, if you think that evolution is just a hunch that is seriously up for question then you simply don’t get how science works. Or, as Pope John Paul II put it: “the theory of evolution… (is) more than a hypothesis… The convergence, neither sought nor fabricated, of the results of work that was conducted independently is in itself a significant argument in favor of the theory.”
Precisely. New findings are like pieces of a jigsaw, we turn them this way and that, some pieces turn out to be from another jigsaw…or we were holding them upside down…or we thought they were a bit of sky when in fact they were a bit of sea…and so on.
There isn’t a “scandal” either
But, what does this mean? Does it mean that the initial studies were fraudulent, badly done or that somehow the whole enterprise was worthless? Not at all—this is how science works. If studies were not replicable—this might mean that the methods were not transparent—and this would be a real worry.
The first paper my Ph.D. supervisor gave me to read was Ioannidis’ seminal 2005 demonstration of why most published findings are false. I pass the same paper on to all my students. (5) Since then the work of sifting has been attracting attention--hence the Nosek et al study. So why don't all studies replicate? Dozens of reasons. From the Nosek abstract:
"In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser pre-selection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance.”
Some of these are normal technical issues. A lot of studies, even if they are statistically significant (e.g. unlikely to be random noise) are not actually important because the measured effect is tiny. Remember the latest tabloid headline that says that coffee (or wine, or chocolate, or puppies, or lately bacon) doubles your chance of cancer? Frightening (and very likely statistically significant). But what if it doubles it from a 1 in 100000 to 1 in 50000 chance? How frightening is that really? (Given that the lifetime chance of dying in a car crash in the UK is 1 in 240 and that won’t stop you driving, will it?) (7).
Let’s say you are an early human and you are testing out the power of rocks and you do this by chucking them at your neighbor’s head and measuring the bumps that result. A test of significance will tell you if the rocks work, but only by measuring the size of the bumps will you know what effect you had.
This brings up the issue of effect sizes and power. Power is the likelihood that, if the effect tested for is real, then your test will be able to detect it. And this in turn is determined by type of test, the number of people tested and the effect size. And the effect size—what actually happens when you change X, is itself determined by reliability of the change and the size of the change.
What this means in practice is that if the thing you are testing for has a large effect, to detect something at the p < .05 significance level you need a sample of 60 (30 per group if testing and retesting the same people) and for things that create only a small effect this climbs to a large (for many psychology tests) sample of 800. And, most psychology (and don’t even get me started on medical tests) lack this sort of power. For more details Tom Stafford has an excellent discussion of this issue. (8)
How to solve all this? It can get quite technical. Beyond the clichéd “more research is needed in this area and “larger sample sizes are required” there have been proposals for (and some brief comments on each from me and others):
1) Using different probability measures that take into account prior assumptions. This is called Bayesian inference. (This probably works sometimes but we should be cautious of too swiftly abandoning uniform frequentist models that allow for meaningful comparisons of studies);
2) Publishing null findings. (This probably would be smart in some cases but it’s tough to distinguish failure to replicate from bad design);
3 Discouraging data mining—given any large data set all sorts of spurious links can be found between things. (Making research more theory driven will help here. See below for some details. In the meantime enjoy this XKCD spoof)
4) Using better statistics such as Bonferroni (which correct for figuratively rolling the dice multiple times), and publishing of overall effect sizes, and use of power measures. (Always a good idea, this will certainly improve matters but its not the whole story);
5) Discouraging salami science—the practice of publishing small studies (to increases publication output) rather than large bodies of cumulative work with internal replication. (This is a great idea but requires a revision of the models we currently use to publish and assess impact of our science. This probably means extending the publication bases so that more stuff can get out there more cheaply so that nerds like myself and my colleagues can kick it around);
6) Pre-registering of experiments—so it gets published on the method not the findings--and being open in everything—such as code, data sets. (Very likely helpful, it's already being done, and science isn’t engineering where intellectual property must be protected. On the other hand--surgery clearly is more like engineering which is why Ben Carson can be an excellent surgeon and know nothing about science);
But probably the most important thing that will come out of all this is what I started out by discussing—namely what is a scientific theory really? A theory is not a hunch like the clever detective going “I have a theory that the murderer was a tall man, with a wooden leg called Gerald.”
A theory is a large scale framework into which the data are fitted and a picture builds--a program that generates research and suggests further questions (9) As Steven Pinker recently put it in relation to this so-called crisis, many of the findings were cute but atheoretical, in that they were not driven by computational, evolutionary, optimality or neurobiological priors.
He’s being kind. What he means, at least in some social psychology, is that there is a hand-waving “standard social science” model, often characterised by folk saying that such and such a behavior is “socially constructed” without any attention paid to what it is socially constructed from. Reifying “the culture” as if it floated free from the human brains that make decisions that result in the aggregate of preferences, behaviors and decisions that we call a culture can result in some odd an unsupportable claims. Freeing behavioral science from hidden ideological constraints won't hurt either. (10) Humans are notoriously bad at separating facts and values and one reason we need diversity in the behavioral sciences is to help to correct for this.
Not only do values threaten to intrude into behavioral science, it's worse than that. Science is not common sense and this is obvious once you realize that even high school physics (which was only discovered in the last four hundred years) is wildly counter-intuitive. (11) For example, heavy objects, despite expectation and observation, don’t fall faster than light ones…But there is a difference in psychology, especially social psychology, in that sometimes what science tells us is not just weird but genuinely unwelcome.
Sometimes, for example, findings in social psychology tell us that ordinary people can be manipulated into committing atrocities and that this doesn’t require brainwashing—a light rinse will do the job. And the converse is also true—it turns out that improving society is going to require a little bit more than changing the words we use to describe it (as Dr Primestein hilariously satirises, (10)) or just removing all the references to bad things in the environment (as those who think that banning violent videos would wish). And then there's that whole pesky confusing values with facts thing I mentioned above...
We don’t want to believe these uncomfortable things so we have a tendency to torture the data until it tells us the comforting story that we want to hear. It shouldn’t surprise us if these “confessions” turn out to be unsafe.
Recently Dan Gilbert et al have published an analysis of the Nozeck replications. It looks like a number of the replications used very different populations (such as asking Italian students rather than American ones about prejudicial attitudes towards African Americans). needless to say--its only a replication if you use relevantly similar variables (12)
For anyone who remembers the "power posing" TED talk and subsequent so-called replication crisis, the following is an example of exactly what I am talking about. The original author showing that the evidence is against it and moving on. It would help if people didnt treat science as a zero-sum game with winners and losers. The winners are often those who admit when something was tried and found to be false. As here
To repeat--trying something and finding that it was a dead end is itself a contribution to the sum of human knowledge.
Third Update (18/5/2017)
This piece in Slate "Daryl Bem Proved ESP Is Real Which means science is broken",
Is actually a much better article than the title would lead one to believe. Up to a point. It still does that journalistic thing of insisting that science is about winners and losers. Needless to say, science isn't broken. In 2010, Daryl Bem, a prestigious psychologist, revealed that for the last ten years he had been conducting experiments into "Psi"--those fringe areas of psychology known as parapsychology. Things like telepathy, clairvoyance and so on are not main stream. Bem appeared to be showing, with every appearnce of rigor, that some degree of precognition occurs in humans. No one (least of all me) is suggesting that Bem fabricated his results. Maybe it's an artefect of his design. More likely he somehow fell foul of that most subtle of dangers in science--self fooling. Feynman himself put it perfectly when he said "The first rule is to not fool yourself, and you are the easiest person to fool". How he did this? We may never know. But more importantly, he made all his methods replicable by others. And no-one has been able to replicate them. If they could--we'd have to revise our beliefs. But they can't. Replication is a cornerstone of science for just this reason. Once again, this is not science being "broken". This is the scientific process working exactly as it should.
1) Science in crisis? http://phys.org/news/2013-09-science-crisis.html#jCp
2) Brian Nosek and the huge team involved in OSF reproducibility project https://osf.io/ezcuj/
3) Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531-533.
Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: how much can we rely on published data on potential drug targets?. Nature reviews Drug discovery, 10(9), 712-712.
4) Jerry Coyne on reproducibility in evolutionary biology
Jerry Coyne on the 50% reproducibility in biology https://whyevolutionistrue.wordpress.com/2015/09/03/on-the-irreproducibi...
5) Ioannidis, J. P. (2005). Why most published research findings are false. Chance, 18(4), 40-47.
6) Chances of dying in various accidents and diseases http://www.medicine.ox.ac.uk/bandolier/booth/Risk/trasnsportpop.html
7) Tom Stafford on power and effect sizes http://www.tomstafford.staff.shef.ac.uk/?p=335
8) Physics and common sense http://www.newrepublic.com/article/118655/theoretical-phyisicist-explain...
9) Students of the philosophy of science will recongise that this is how Lakatos described how science works and thats quite deliberate on my part because Lakatos was right. http://www.jstor.org/stable/495757?seq=1#page_scan_tab_contents
What's wrong with pseudo-science programs like the theory of intelligent design (for example) is not so much that it is false (although it is) it's that it generates nothing to test.
10) Jon Haidt et al on the importance of diversity in behavioral science http://heterodoxacademy.org/2015/09/14/bbs-paper-on-lack-of-political-di...
11) Dr Primestein’s scurrilous and hilarious take-downs of priming excesses can be found here http://www.psi-chology.com/anti-priming-tin-foil-hat/?utm_content=buffer...
13) Grafen, A., Hails, R., Hails, R., & Hails, R. (2002). Modern statistics for the life sciences (Vol. 123). Oxford: Oxford University Press. Highly recommended in explaining how the general linear model works and dispelling confusion.
* If you want you could express this as the reason for rejecting the null hypothesis (Ho) if you are a follower of Ronald Fisher (the "F" in Fisher stands for "F" test), or you could use the number to distinguish between acceptance of the alternative hypothesis on the basis of the relative error rates (if a follower of Pearson). Or, you could use the hybrid method widely expressed in stats textbooks which manages to attract the ire of both groups.** Needless to say the proponents of each regard the others as heretics and deviants and all of them are correct in this. Splitters!
** Apparently this footnote wasn't enough for some people so I have footnoted the footnote. The p-value is a conditional probability. "It tells us the probability of obtaining that value of that test statistic or a more extreme value given that the null hypothesis is true" (Grafen & Hails, 2002, p. 334, italics in original). *** If folk want to quibble with me whether this way of putting it leads to systematic distortions and scads of Type I and Type II errors then please go ahead. The probability of my giving a toss (based on Bayseian analysis of previous quibbles) is about .05.
*** A picture is worth a thousand words they say. So here is one (after Grafen & Hails 2002, above). I realize that footnoting an already footnoted footnote might seem like overkill...but I'll run the risk
A & B are two sets. Helpfully labelled "A" and "B". The area around is the rest of the universe (helpfully labelled "U"). In this world A is the set of all circumstances (experiments) where the null hypothesis is true. B is set of all circumstances (experiments) where the null hypothesis is rejected. No experiment can reveal to us the size of either A or B and, in an ideal universe, A & B would have no intersection. Unfortunately, we do not live in that universe so the area of intersection is labelled with appropriate darkness...however, what we can do is adjust the size of that dark region to any size--any size that is except nothing. It can't be 0. To err is human. This is why when students helpfully report a result where "p = 0.000" which is something SPSS can throw out, they make their tutors throw up.
So...with all this in mind, the p value that we set is the probability that we are in set B (yay!) while also being in set A (boo!). This is technically known as a Type I error ("Type I, the study's a con"). And, just as we do in other areas of human endeavor, we accept that this will happen some of the time. How much of the time? Five per cent of the time (as industry standard). Furthermore, this means that statements like
1) There is a 5% chance that the null hypothesis is true or
2) There is a 5% chance that there is no difference between condition X and condition Y
While sounding similar, are technically not the same. To say (1) or (2) would be to say that the size of A is .05. And, we can't say that because we can't know what size A is. We also can't really say that, having concluded that our results are significant (p<.05) this means that 5% of the time H0 (the null hypothesis) is true. This would only be true if A and B are exactly the same size, and we can't know that. Statistical analyses are very ingenious but they can only tell you so much.