What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.
~ Harold Jeffreys
Exactly! ~ J. K.
Working during the era of enlightenment in the England of the 18th Century, the Reverend Thomas Bayes tried to devise a mathematical system to prove the existence of god. With sufficient evidence for His good works, the Reverend reckoned, god’s existence would become a matter of rational belief rather than blind faith. Bayes failed and never published his treatise. This task was left to his friend Richard Price, who published the work under Bayes’s name five years posthumously (Bayes, 1763). Bayes's theorem is now familiar. The probability of an hypothesis given the evidence, p(H|E), is equal to the prior probability of that hypothesis, p(H), times the likelihood ratio, LR. The LR is the probability of the evidence under the hypothesis in question, p(E|H), divided by the probability of the evidence under any hypothesis, p(E). This latter term, p(E), is the sum of p(E|H) times p(H) and p(E|~H) times p(~H), where ~H refers to all other hypothesis or simply the idea that the tested hypothesis, H, is false. Formally,
p(H|E) = p(H) x p(E|H) / [p(H) x p(E|H) + p(~H) x p(E|~H)].
We see that certainty about god’s existence cannot be wrought from uncertainty. Unless the prior, p(H), is 1, the posterior, p(H|E), will be less than 1. When the prior probability is 1, we have a matter of faith, and no evidence can shake it. Although Bayes’ theorem is coherent in the sense that each of its components can be perfectly derived from all the others, it is an exercise in induction, and induction – as Hume taught – has no logical force beyond the evidence given. The future is irreducibly uncertain.
Bayesianism has returned to empirical science in the guise of “a new statistics,” which is ironic because Bayesianism is older than the frequentist methods it now seeks to replace. Frequentist statistics are conventionally used in hypothesis or ‘significance’ testing. Here, prior beliefs may be acknowledged as reasonable assumptions about the riskiness of a research question, but they are not quantified to be part of the analysis. Most frequentist methods yield a p value, which expresses the probability of the data (or rather, the test statistic) – or data more extreme than the obtained data – under the assumption that the tested hypothesis is true. The inductive leap is to infer the falsity of that hypothesis when the p value is low (conventionally < .05). The tested hypothesis is often – but not necessarily – the idea of ‘no effect,’ i.e., that the data reflect only noise and no signal. This version of significance testing is called Null Hypothesis Significance Testing (NHST), and it has been the workhorse of empirical psychology (see Krueger, 2001, for a flogging).
There are a handful of authors who for years have been diligently and prolifically advocating a switch from frequentist to Bayesian methods in research practice. Their arguments are a mix of a critique of NHST and a promotion of Bayes. Most interested readers are familiar with these arguments. Over the last decade, there has been little news other than the increased availability of easy-to-use programs for Bayesian analysis. With this concerted delivery of the “frequentist = bad, Bayes = good” mantra, the promoters may eventually get their way. If they do, will it be for good reason? I doubt it, and in this essay I explain why.
My approach is to revisit the arguments presented in a recent chapter (Ortega & Navarrete, 2017), which I take to be a prototypical effort. There are no new arguments against NHST or in favor of Bayes; each presented argument has by now become a stock in trade. I am noting this because in my opinion, the critical literature has become degenerate. The push for Bayes has become tiresome; its goal no longer seems to be to change researchers’ minds but to wear them down. In the process, the internal contractions of these arguments and their rhetorical fast-and-looseness are becoming more evident.
In the interest of not making this into a dissertation, I simply quote from Ortega & Navarrete (in italics) and add “oh really.”
It has also been emphasized that such a cumulative knowledge—for a true psychological science—is not possible through the current and widespread paradigm of hypothesis testing. (236)
What is “true psychological science?” How is cumulative knowledge not possible with ST (significance testing)? Suppose hypotheses H1 and H2 have each been tested 5 times. H1 yielded p = .10, .05, .01, .005, and .001; whereas H2 yielded .10, .20, .30, .40, and .50. While we see no difference between the two hypotheses after the first set of experiments, the accumulated p values suggest H1 is false and H2 is not.
Null hypothesis testing can actually impede scientific progress. (236)
“Can” is a hedge; but does it? What does “impede” mean? Is progress inverted into regress; i.e., are we stupider after applying ST than after doing nothing? Or is progress slower than it would be with alternative methods such as Bayes? The former is hard to believe (a priori, as it were) and requires a strong argument, which is not given. The latter grants that ST does promote scientific progress, just that it is not as fast as it could be.
Bayesian analysis allows us to move from a dichotomous way of reasoning about results (e.g., either an effect exists of it does not) to a less artificial view that interprets results in terms of magnitude of evidence (e.g., the data are more likely under H0 than Ha), and therefore, allows us to better depict to which extent a phenomenon may occur. (236)
The claim that ST allows dichotomous decisions between rejecting and not rejecting the tested hypothesis is a favorite complaint. It has been repeated so many times that it is beginning to appear self-evident. Yet, nature often either is or is not in a state of interest, and we have to act on our inductive inference. The woman is pregnant or she is not. Gravity can or cannot bend light. Cow manure makes good fertilizer or it does not. Alas, the same cannot be said about the claim that Bayesian analysis is superior or not superior to ST. Besides, ST does not force researchers to make dichotomous decisions (although the Neyman-Pearson school suggests it more strongly than the Fisherian school does).
A Bayesian approach naturally allows us to directly test the plausibility of both the null and the alternative hypothesis, but the current NHST paradigm does not. In fact, when a researcher does not reach a desired p-value oftentimes it is—falsely— assumed that the effect “does not exist.” (236)
The first part of this statement asserts that only Bayesianism but not ST directly compares hypothesis H with the alternative ~H. However, when two specific hypotheses are articulated, p values can be computed with ST for each. Taking the ratio of the two (or the ratio of their corresponding likelihoods) yields the Bayesian ratio. In other words, Bayesianism exhausts itself by using the ingredients provided by ST. The second part of the statement is ironic as it charges practitioners of ST with the commission of a Type II error, when in fact it appears to be the Bayesians who seek to show that ‘the effect does not exist.’ Bayesians usually attack ST on the grounds that it yields too many Type I errors (false rejections of true null hypotheses).
NHST constitutes an amalgamation of two irreconcilable schools of thought in modern statistics: the Fisher test of significance, and the Neyman and Pearson hypothesis test. (237)
Bayesians roundly criticize all versions of ST. If Fisherian significance testing and Neyman-Pearson hypothesis testing are both fatally flawed, then an amalgam of the two cannot make matters worse for ST (or better for Bayes). What, for example, is objectionable in reporting exact p values (Fisher) and performing a power analysis (Neyman)? It is, however, logically possible that an amalgamation might improve statistical practice.
Most scientists from different research fields adopted standard significance levels (i.e., α = 0.05 or α = 0.01), which have been used—or misused—regardless of the hypotheses being tested. (238)
Which is it: used or misused? If current practice is a misuse, the problem does not lie with the method . Most Bayesians dismiss dichotomization as bad practice, and thereby dismiss decision-making in favor of belief updating. Again, however, when action must follow belief, action reveals decision. If, for example, you reject the hypothesis that fertilizer F does not work, you will use it, right?
This simple and appealing decision rule may constitute a very seductive way of thinking about results, that is: A phenomenon either exists or it does not. However, thinking in this fashion is fallacious, led to misinterpretations of results and findings, and more importantly “it can distract us from a higher goal of scientific inquiry. That is, to determine if the results of a test have any practical value or not.” (238)
Why is it fallacious to think that a phenomenon either does or does not exist? Isn’t a better understanding of nature the goal of science? Questions of practical value often arise for humans studying nature, but they don’t have to (at least not at the time of study). What would be, for example, the practical value of knowing whether life exists somewhere in the Andromeda galaxy? It would be mightily interesting to know, though. Besides, questions of practical value are as external to Bayesian statistics as they are to ST (in fact, Neyman-Pearson are closest to acknowledging them).
Badenes-Ribera et al. recently reported the results of a survey conducted to 164 academic psychologists who were questioned about the meaning of p-values . Results confirmed previous findings regarding the occurrence of wrongful interpretations of p-values. For instance, the false belief that the p-value indicates the conditional probability of the null hypothesis given certain data (i.e., p (H0|D)), instead of the probability of witnessing a given result, assuming that the null hypothesis is true. (239)
This complaint about the “inverse probability fallacy” is so tiresome that it has achieved canard status. Of course, the p value does not reflect the posterior probability of the tested hypothesis, but it is a useful heuristic cue predicting it (Krueger & Heck, 2017). The same, incidentally, may be said for the Bayesian likelihood ratio, LR, which predicts p(H|D) only when multiplied with the prior, p(H).
NHST uses inference procedures based on hypothetical data distributions, instead of being based on actual data. (239)
This statement is descriptively correct, but that does not turn it into a valid criticism. Bayesian analysis is based on the observed data and it ignores the sampling distribution of the data, which would speak to the probability of observing more (or less) extreme data under any hypothesis of interest. The neglect of sampling variation in the data may itself be the target of vigorous criticism.
NHST does not provide clear rules for stopping data collection; therefore, as long as sample size increases any H0 can be rejected. (239)
Standards for a priori estimates of sample size are readily available in the Neyman-Pearsonian context of power analysis. The idea that any null hypothesis will be rejected by tests performed on a very large sample is another canard. If, for example, we assign participants randomly to two groups without a treatment and measure a random psychological property, the shrinking standard error will chase the shrinking observed chance effect with no attraction to a non-null result (Hagen, 1997). Besides, what are the stopping rules for data collection in Bayes? Does Bayes offer protection against too-small samples (or too large ones)?
The focus of research should be on what data tell us about the magnitude of effects. (240)
. . . if the effects are there! Aren’t the Bayesians the champions of null effects, or invariances in nature? If – if – effects exist, then surely an index of effect size going along with the p value is salutary. This is good advice that Bayesians themselves may want to heed. The Bayes Factor (the ratio of posterior over prior odds, i.e., a doubly derivative index) does not unveil effect size.
[Bayes] permits the continuous update of evidence as long as new data are available, which is in line with the nature of scientific inquiry. (240)
Continuous updating is fine and it captures the cumulative aspirations of empirical science. But how much updating is enough? Where is the stopping rule? Recall the Reverend. Bayes surrendered his struggle to prove the existence of god inductively. Perhaps he read his fellow Briton Hume and came to realize that it could not be done. Besides, ST can handily accommodate the desire for continuous updating. Methods of meta-analysis, including sequential meta-analysis abound.
One of the most common misinterpretations of p-values it has been to consider a p-value as a valid indicator of the magnitude of evidence of a result (i.e., effect size fallacy). (248)
Nobody really believes this anymore.
In a well-functioning scientific community, a debate should take this general form:  Here is a statement of what needs to be achieved.  Here are the pros and cons for method A.  Here are the pros and cons for method B.  Weighing the pros and cons of both methods, we conclude that A is better than B, or vice versa. The current debate over statistical method in psychology is not of this type. Instead we see a self-replicating critique machine pummeling the conventional practice of ST, and, assuming its own success at being critical, offering another flawed method as a presumably superior alternative. Weaknesses of that alternative (Bayes) thereby never come into focus. We do not get to discuss how we might obtain reasonable constraints on variation in prior belief, and we do not get to discuss how a focus on relative support can make a horrible hypothesis look great simply because it is being compared with an hypothesis yet more horrible.
The greatest disappointment in the Bayesian attack on significance testing is the rhetorical nature of many of the arguments. One gets the feeling many of the writers took a course on Schopenhauer’s dialectic on how to win argument without having truth on one’s side (Krueger, 2016). Schopenhauer’s first stratagem for successful – if insincere – argumentation is extension. Present the opponent’s view in the broadest terms, attack and refute one detailed aspect of it, and then declare general victory. A recurring theme in the critical literature on ST is that many practicing scientist fail to comprehend what the p value means. Therefore – so the illogical conclusion – ban the p value. It’s like blaming the victim. Oh, Arlene always dressed so provocatively; no wonder she was raped (this is a reference to the Dice Man, who was not a Bayesian, but that is another story).
A reader suggested that I do "not seem to understand that a result can be statistically significant but practically irrelevant," to which I retort that "questions of practical relevance are orthogonal to any statistical method, hence they don't affect the debate between Bayesians and frequentists. Conversely, data evaluation from the perspective of practical relevance alone, without regard to their statistical properties, raises the question of whether we need data at all."
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. The Philosophical Transactions of the Royal Society of London, 53, 370-418.
Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.
Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.
Krueger, J. I. (2016). Schopenhauer talks back. Psychology Today Online.
Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in Psychology: Educational Psychology.
Ortega, A., & Navarrete, G. (2017). Bayesian hypothesis testing: An alternative to null hypothesis significance testing (NHST) in psychology and social sciences. In J. P. Tejedor (ed.), Bayesian inference (pp. 235-254). http://dx.doi.org/10.5772/intechopen.70230