Two Implications of Bayes’ Theorem

The Rev teaches uncertainty.

Posted Mar 27, 2018

In science, progress is possible. In fact, if one believes in Bayes' theorem, scientific progress is inevitable as predictions are made and as beliefs are tested and refined. ~ Nate Silver

If the probability that Bayes' theorem is true is .9, what is the revised probability of it being true if we reject the hypothesis of it being false at p = .05? ~ JIK

Thomas Bayes was an English cleric and mathematician who was interested, among other things, in finding a proof of god. He couldn’t, but he left a treatise and a theorem, which, after it was published posthumously (Bayes, 1764), became the basis of what we now call Bayesian statistics. What Bayes’ theorem does, in conceptual terms, is describe how pre-existing belief (conjecture, hypothesis, or hunch) should be updated in light of new evidence (observations, data) in such a way that there are no contradictions. In other words, Bayes’ theorem guarantees coherence and it promises gradually increasing degrees of belief accuracy. No wonder many people (statisticians, psychologists, machinists) view the theorem as the definition of rationality. In this mildly technical essay, I point out two implications of Bayes’ theorem that are not particularly deeply hidden in the math, but that are profound in their relevance for research and religion. But first we need to introduce the terms of the theorem and how they are related to one another (which is the theorem’s job to illuminate).

J. Krueger
Figure 1. Bayes' theorem.
Source: J. Krueger

Figure 1 shows the theorem. The probability that a belief (H for hypothesis from here on out) is true given the evidence (D for data), or p(H|D), is equal to the product of the prior probability of the hypothesis, p(H), i.e., before the new data are introduced, and the “diagnostic ratio.” This ratio is the probability of the data assuming that the hypothesis is true, p(D|H), over the total probability of the data, p(D), i.e., the summed probability of the data under all hypotheses. To make matters simple (yes!), let’s assume that there is only one alternative hypothesis, ~H, the probability of which is 1 – p(H). Now we can say that p(D) = p(H) * p(D|H) + p(~H) * p(D|~H). The theorem is complete. Look again at Figure 1 to appreciate this fact.

The first implication of Bayes’ theorem is that the reverend could have proven god in theory, but that the necessary condition is extreme. It is possible for p(H|D) to be 1, but only if p(D|H) = 1 and p(D|~H) = 0. Certainty of belief requires certainty of data. The data must be certain given the hypothesis of interest and impossible under the alternative hypothesis. When this latter pair of conditions is met, the prior strength of the belief (in god or whatever) is irrelevant. Proof (i.e., the combination of p(D|H) = 1 and p(D|~H) = 0) eradicates the difference between the advocate and the skeptic.

So much for religion. In most empirical sciences, incontrovertible proof is rare. Data come with noise and uncertainty, and hypotheses and the beliefs and assumptions they support tend to remain probabilistic. At most, researchers might say that they have ‘moral certainty’ that X is true. Morality being famously imperfect, the door for a change of mind given new data is left ajar.

The second implication of Bayes’ theorem is relevant for the question of how well aligned the probability of the data under the hypothesis, p(D|H), is with the posterior probability of the hypothesis, i.e., given the data, p(H|D). This question if of interest to all researchers who wish to test hypotheses and not just whether the data are credible. These researchers want to draw inferences from the data to the hypotheses. They want to use p(D|H) to infer p(H|D). To do so, they need the full theorem. They need to know (or postulate) p(H), p(~H), and p(D|~H). An inference from p(D|H) to p(H|D) is strong inasmuch the two terms are correlated with one another. Using simulation experiments, we found that these correlations are positive, but that their magnitude can vary widely in predictable ways (Krueger & Heck, 2017). Here we want to find the conditions under which p(D|H) and p(H|D) are identical.

Bayes’ theorem shows that p(D|H) = p(H|D) if and only if p(H) = p(D). Now let’s consider the case of p(D|H) = .05, where the researcher, following convention, declares the result to be significant. In all likelihood, p(H|D) will not be as low as p(D|H), but it might be. Today’s question is: What does it take to make it so? A little algebra reveals that p(D|H) = p(H|D) if p(D|~H) = (p(H) – p(D|H)) / p(~H). Let’s try some examples. Having selected p(D|H) = .05, we might have an hypothesis that appears neither particularly likely nor unlikely at the outset, i.e., p(H) = .5. Now, if p(D|~H) = .9, we have our desired equality of p(H|D) = p(D|H) = .05. This is a nice arrangement. The prior belief is maximally uncertain (p(H) = .5); the results are significant (p(D|H) = .05) and highly likely under the alternative hypothesis (p(D|~H) = .9); and the null hypothesis is indeed rejectable (p(H|D) = .05, which means that p(~H|D) = .95.

Now consider the more troubling consequences that emerge when we depart from this best-case scenario. What if the researcher selects a risky alternative hypothesis, that is, a case where p(H) is high? If p(H) = .8, for example, p(D|~H) would have to be 3.75 so that p(D|H) = p(H|D) = .05. An impossible result! Bayes’ theorem forbids it. If you pursue risky research (were p(H) is high) and manage to obtain statistical significance, it is guaranteed that the hypothesis is not as unlikely as are the data that lead to its rejection. At p(H) = .525, p(D|~H) = 1. For any higher value of p(H), p(H|D) > p(D|H). This is one horn of the dilemma.

The other horn emerges when research is safe. When p(H) is low, that is, when the probability of the alternative or substantive hypothesis, p(~H), is high a priori, the equality of p(H|D) and p(D|H) is easily obtained, but for the price that p(D|~H) is low. For example, if p(H) = .1, and both p(D|H) and p(H|D) = .05, then p(D|~H) = .056. This may seem like a grotesque result. On the one hand, the alternative hypothesis is regarded very likely a priori (p(~H) = .9), whereas on the other hand this very hypothesis provides a fit with the data that is almost as poor as the fit with the hypothesis (H) that is being rejected.

The moral of the story is that Bayes’ theorem not only teaches us coherence, but it also urges us (if it could speak) to do our best to select hypotheses of intermediate likelihood for testing. It is here that empirical research yields the greatest rewards.

Proof? What proof? When writing down the first implication ('Proof eliminates the disagreement between the advocate and the skeptic') I was jolted out of my Humean slumber. David Hume (1764) famously argued (and proved!) that you cannot prove the validity of induction by deductive means (see here in the Stanford Encyclopedia). The clichéd example for this very deep insight is that no matter how many white swans you have seen, you cannot take it as proven that no black swan exists. This is so when there is no bound on the possible number of swans out there. The argument does not hold in a finite population. Now we must ask if p(D|H) can be 1. If we are working in the land of theory, assuming the presence of a Gaussian (or otherwise unbounded) distribution, it is hard to see how that might be asserted on the basis of data. Data - as they come in a measurements - are finite in their numerical value. Therefore, a more extreme value is always possible. Therefore, the probability of these data or data less extreme must be less than 1. Therefore, the argument that I have made, namely that Bayes' theorem allows us to extract certain belief from observed data is valid only in theory but not in practice. Hume wins (see here for an interesting historical note suggesting that  Bayes' efforts were motivated by the desire to refute Hume). 

We end with a quote from David Hume, just to show that the great skeptic had a wicked sense of humor. "I have written on all sorts of subjects... yet I have no enemies; except indeed all the Whigs, all the Tories, and all the Christians" (found here).

Bayes, T. (1764). An Essay Toward Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London, 53, 370-418.

Hume, D. (1739). A Treatise of Human Nature. Oxford, England: Oxford University Press.

Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in psychology: Educational Psychology. https://doi.org/10.3389/fpsyg.2017.00908