Bem, Bayes, and the limits of statistical inference
Is there psi? Don’t look to stats.
Posted Jan 19, 2011
I've complained about it, but. . ."
-- House (Laurie),
pre-sponding to D. Bem
"The more extraordinary the event, the greater the need for it to be supported by strong proofs."
-- Pierre Simon Laplace (1814) stating a non-controversial principle of rational inference
When the Bem buzz broke, I doubted that the dispute could be settled by statistical arguments. Recall that in a paper published in the Journal of Personality and Social Psychology (2011), Daryl Bem claims to have found in 9 experiments that events that have not yet happened can causally affect human behavior. My disbelief in precognition, and backward causation in particular, comes rather from the strange implications these claims would have if they were true. Still, I expected that some of Bem's critics would seek to show that his conclusions do not hold up when different sorts of (presumably more advanced) statistical analyses are applied. Two teams of researchers, Jeff Rouder and Rich Morey, and Eric-Jan Wagenmakers and colleagues (JPSP, 2011) have re-analyzed Bem's data from a Bayesian point of view and concluded that the evidence against the null hypothesis of chance (no psi) is weak.
The statistical dispute was mentioned in the New York Times. The writer noted correctly that the fight over how to analyze data in the social sciences has been going on for decades. The Bem-vs.-Bayes episode is the most recent rehashing of familiar arguments. I don't think that this flare-up over statistical orthodoxy will be the last one unless the contestants relinquish some metaphysical (i.e., delusional) ideas.
Let's review what's going on. Bem deploys a simple version of "Null Hypothesis Significance Testing." The null hypothesis is that there is no association between current behavior (e.g., opening one of two boxes) and a future event (a box opened by a mechanical random choice generator). Each subject performs a number of choices and the percentage is calculated with which (wo)man and machine make the same choice. Some subjects have a percentage above 50, whereas others have one below; but if the null hypothesis is true, the average percent should not differ significantly from 50%.
To evaluate significance, the researcher subtracts the hypothetical value of 50% from the empirically obtained average. The difference is then divided by the standard error of the empirical average, which is the standard deviation of the individual percentages divided by the square root of the number of subjects. The resulting ratio is a t statistic. These t values have a known distribution, and one can look up the probability of a t value at least as extreme as the obtained one under the assumption that the null hypothesis is true. This probability is the famous little p value on whose back academic reputations have been built and destroyed.
Before moving on, let's note 3 regularities. It is easily seen that t goes up and p goes down as (1) the difference between the observed and the expected average increases, as (2) the observed individual scores become less variable (i.e., as the precision of measurement increases), and as (3) the number of observations (subjects) increases. A researcher who wants p to be small can make it so for each of these 3 reasons.
What does a small p indicate? It indicates that the observed data (or data more extreme) are not probable if the null hypothesis is true. Period. And this is the end of objectivity. Remember, though, that the researcher wants to test the null hypothesis and not the data. In other words, the researcher wants a low probability of the null hypothesis being true given the data, but what the study yields is the inverse. This is a real pickle. How do you reject a hypothesis as being improbable if what you have is a probability that can only speak to the data? The answer is you rely on plausibility and convention. If p is small then there is probably some hypothesis other than the null under which these data are more probable. The convention is that if p < .05, you reject the null.
The appeal of this method and its stubborn refusal to die lies, in my opinion, in its apparent objectivity. The probability p can be dispassionately calculated, and the leap of faith from data to hypothesis is constrained by a code of convention. There appears to be no room for researchers to insert their idiosyncratic biases.
To the Bayesian revisionist (I cleverly use this word in both its meanings), the objectivity wrested from calculation and convention is a false god. Bayesians make a game out of catching null testers betraying their objectivist ideology. Take Daryl Bem. He accepts Laplace's principle of rational inference,and he claims that he has evidence that meets the threshold of extraordinariness. In his reply to Alcock's critique, he writes that "Across all nine experiments, the combined odds against the findings being due to chance are greater than 70 billion to 1." In other words, he knows that a single significant t test is not enough to support the extraordinary claim of psi.
In Bayesian terms, an extraordinary claim is a hypothesis ("The future made me do it!") that is improbable even before study. What this means is that a p of .05 does not translate into a probability of .05 that the null hypothesis is true given the data. Likewise, it cannot be suggested that the odds against the null hypothesis being true are 70 billion to 1. If the prior probability of the null hypothesis was very high, a smaller p value can lower it, but it won't bring it down to its own value. This is the crux of the Rouder-and-Wagenmaker critique of Bem. The p value is deceptive; the null is rashly rejected; the posterior probability of the null (after study) tends to be higher than p.
Bayesians calculate the posterior probabilities for a set of hypotheses given the prior probability of these hypotheses and given the probabilities of the data under these hypotheses. The logic is simple, although the formulas look off-putting (see pic). The key difference between the Bayesian and the classic Fisherian approach is that using the former, the researcher must specify at least two hypotheses. The null cannot stand alone. What is more, the researcher must commit to a distribution of prior probabilities, thus making explicit of what is meant by a claim being extraordinary or plain ordinary. Then, in an objective interlude, the numbers can be crunched. In the end, the researcher has a revised probability for each hypothesis, given the data, and given the prior probabilities selected before study.
True Bayesians would stop here. They would say: "All my cards are on the table. I have not hidden my expectations, biases, or preconceptions." This up-front subjectivity means that researchers entering the fray with different expectations will still disagree after the data are analyzed. But they will disagree less than they did before. This is the beauty of Bayes. The data gradually crowd out differences in prior belief. "Gradually," ladies and gentlemen. No crowbar, black-and-white decisions about who gets to stay and play and who gets rejected.
But old habits - inculcated in graduate school - die hard. The objectivist desire is to make a categorical decision as to what to believe. Nuanced, graded, probabilistic beliefs in the tradition of the Reverend Bayes are hard to entertain. Although they discuss the role and the logic of prior probabilities, Rouder and Morey stake their claim on the Bayes factor alone, that is, on the ratio of the probability of the data under the alternative hypothesis over little p (i.e., the probability of the data under the null hypothesis). Although the authors are not crystal clear about how they selected the alternatives, and although they set aside the crucial Bayesian manoeuver of computing posterior probabilities, they draw what appears to be conclusions. In their abstract, they write:
"We find the evidence that people can feel the future with neutral and erotic stimuli to be slight, with Bayes factors of 3.23 and 1.57, respectively. There is, however, a surprising degree of evidence for the hypothesis that people can feel the future with emotionally-valenced nonerotic stimuli, with a Bayes factor of about 40. Though this value is certainly noteworthy, it is several orders of magnitude lower than what is required to overcome appropriate skepticism of such implausible claims."
This passage if littered with subjective terms. Evidence is "slight," "surprising," or "noteworthy." At the same time, there lurks the normative claim that there is some sort of "requirement" that must be met before "appropriate skepticism" can be overcome. This is unfortunate. To me, the appeal of Bayesianism is that is permits the expression of subjective belief and mathematical integration of belief and data. Raising normative demands undercuts principled subjectivism.
I conclude that, sadly, the orthodox Fisherians and the revisionist Bayesians continue to talk at cross-purposes. Their shared delusion is that the math, if done right, will eventually tell us for sure if there's psi. In this case, Bradley Efron, a professor of statistics at Stanford and editor of the Journal of Applied Statistics, put it this way: "No general formula will free the scientist, or anyone, from having to use judgment in interpreting evidence" (NYT, January 17, 2011).
Hence, I stake my skepticism about psi on logical rather than empirical grounds.
An afterthought comes to mind (October 23, 2011). In cognitive psychology, statistical principles often sit in as theories of mind. When making judgments and decisions, ordinary people are thought to think like Bayesians or to perform intuitive analogs of null hypothesis significance testing. About either approach, it has been said that it is all but a metaphor. At the output level, these statistical ideas may fit what humans do, but that does not mean that humans perform operations in their heads that in any way resemble what researchers do when calculating conditional probabilities. If human data fit well with Bayesian calculations, it has been said that this is only a matter of fit, and that one cannot infer anything about people's actual thought processes. Bayes's Theorem is, as it were, "as-if" psychology. Amos Tversky even questioned the fit. He once wrote (if pressed, I could unearth the exact reference) that people are not just bad Bayesians but no Bayesians at all. This brings to mind the question how he, Tversky, arrived at that conclusion. Did he use Bayesian methods to refute Bayes, or, more intriguingly, did he refute Bayes using some type of null hypothesis significance testing?
Returning to the Bemian theme of this post, I want to share my premonition that telepathy does not exist.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407-425.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426-432.