Judgment Under Uncertainty: Statistics and Biases
Homo heuristicus goes to stats class.
Posted May 18, 2016
There are still some frequentists left. ~ Joe Austerweil, while mulling “a hairy” Bayesian problem
Significance testing is seen by many of its practitioners as the haven of objectivity, the heart of the scientific method, and the Holy Grail leading to career-defining discoveries. Data are gathered, a test statistic is computed, and the probability of a statistic at least this big is found. If this probability is less than .05, the null hypothesis is rejected. Something else, not nothing, is assumed to be going on. Typically that 'not nothing' is thought to be whatever treatment separated the experimental subjects from the controls. The method is objective in the sense that everyone who knows the drill gets the same result.
‘Objective’ does not mean ‘valid.’ The validity of methods of significance testing has been questioned for a century (an early critique can be found in the Book of Job; see Job note). Yet, these methods prevail (at least for the moment; the party could be over tomorrow). Why? Gerd Gigerenzer (somewhere, sometime) observed that the use of the p value, that is, using the probability of the data under the null hypothesis, p(D|H), to infer the inverse, i.e., the probability of the hypothesis given the data, p(H|D), is an instance of judging by the representativeness heuristic. He did not elaborate, as far as I recall, so I will here.
Remember (or look up) that p(H|D) = p(D|H) * p(H) / pD). The data do speak to the hypothesis. Their effect (likelihood) must be multiplied by the ratio of the base rates, i.e., the prior probability of the hypothesis divided by the overall probability of finding that kind of data (under whatever hypothesis). Reverend Bayes says thou shalt multiply and divide. Significance testing, however, the great seductress, tempts the researcher to leap directly from p(D|H) to p(H|D), and base rate ratios be damned. This difference between using and ignoring background information is what distinguishes thinking from perceiving in Tversky and Kahneman’s work and in much of what they inspired.
The representativeness heuristic became famous for its definitional neglect, nay, dismissal, of base rates (priors). Let us listen to Tversky & Kahneman (TK; 1974): "Many of the probabilistic questions with which people are concerned [are of the type that asks] what is the probability that object A belongs to class B?” A refers to the findings of the study, and B is a potential underlying reality as described by the hypothesis. Then, “in answering such questions, people typically rely on the representativeness heuristic, in which probabilities are evaluated by the degree to which A is representative of B, that is, by the degree to which A resembles B.”
TK review 6 features of judgment by representativeness. Let’s see if they apply to significance testing and its practice.
 Insensitivity to prior probability of outcomes. Does this apply? Yes. To a fault. Significance testing explicitly brackets out the prior probability of the null hypothesis, or any other hypotheses. Researchers may quietly contemplate the riskiness of their project (i.e., the chances of finding something as opposed to nothing), but they are not invited to formalize these contemplations and let them affect their inference about the hypothesis after they collected the evidence. In this sense, significance testing is even more robustly heuristical than the garden-variety representative thinking (er, perceiving) you and I settle for when wondering whether our daughter’s boyfriend belongs to the category of ‘jerks.’ He does not behave like a jerk, nor does he look like a jerk, ergo . . . and we ignore the size of the category of jerks, i.e., we ignore how probable it is a priori that the young man is a jerk. Incidentally, it is a bit odd that TK introduce the representativeness heuristic in terms of its defining features and its outcomes. Bayes neglect (more precisely ‘base rate neglect’) seems to wear both hats.
 Insensitivity to samples size. Significance testing is sensitive to sample size, so in this sense the method does not resemble the heuristic. The larger the sample, the more likely it is to discover an effect, if there is one. However, as TK note, many practitioners of significance testing show this kind of insensitivity. It is as if they think about a particular type of representativeness heuristic by using another one.
 Misconceptions about chance. Again, this is a problem of people rather than procedure. People have poor intuitions about chance, which is one reason for their vulnerability of being exploited by casinos, lottery mongers, and insurance saleswomen. Significance testing has assumptions about chance built in. They help produce the p value.
 Insensitivity to predictability. Here, TK mean that people’s judgments are swayed by good stories. They predict the value (something positive or something negative) from the value of the story while ignoring the reliability of the story, e.g., whether it is based on expert opinion or hearsay. Significance testing – and I am going out on a limb here – has what appears to be a similar (representative as it were) feature. The inferences it suggests about the truth or falsity of the null hypothesis (i.e., the predictions) are based on the data only, and not on what other hypotheses are in play. It might just so happen that the p value under the null is low, but that the p value under an alternative hypothesis is far lower still, in which case a Bayesian would argue that there is relative evidence in favor of the null hypothesis.
 The illusion of validity. TK argue that reliance on representativeness fosters a false sense of validity. This would have to be so if people rely on a heuristic that is less than perfectly valid. If they had no illusion of validity, they wouldn’t be relying on the heuristic. At any rate, significance testing – as noted in the first sentence of this essay – seduces research folk to be illuded in the same way. Thinking that significance testing is the master tool for scientific discovery, they can only end up overconfident.
 Misconceptions of regression. That’s a good one. Last but not lost. Looking for genius and finding little, Galton (Sir Francis) “discovered” regression (to the mean). The sons of outstanding men were just not as outstanding. Today we know regression as an essential feature of a probabilistic world. Yet, thinking representatively, we predict A from B as if the correlation between the two over cases were perfect even when it isn’t. In the context of significance testing, regression rears its head when researchers assume that significant findings will replicate. This is related to points  and , and it’s mainly a problem of the tests’ users and only partly a problem of the p value; p does speak to its own replicability, but with a very low voice.
The rest of the story is this: TK intone in the long forgotten discussion section of their famous paper “It is not surprising that useful heuristics such as representativeness [. . .] are retained, even though they occasionally lead to errors in prediction and estimation.” There it is: TK themselves claimed that these heuristics are useful and that we should not be surprised that people use them. If significance testing is indeed -- as I have attempted to show -- a formalized version of the representativeness heuristic, there still may be some life left in it yet.
And what is meant by “useful?” A heuristic is useful if it produces sufficiently accurate judgments and choices at low cost. Just how well significance testing and its p value do in this regard is still being debated. After some simulation work, I am beginning to think that significance testing is not as bad as it is cracked up to be.
Job note. Job, steadfast man of legend, refused to reject the hypothesis that god was good despite overwhelming evidence to the contrary.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases, Science, 185, 1124-1131.
Loose association: If you can stomach another let's say 'remote' association, how about this one: Critics of significance testing charge that the method is biased against the null hypothesis, i.e., the idea that there is 'not nothing' is accepted too easily. Does this mean that the Null Hypothesis suffers from 'rejection sensitivity?'
This post was ghost-written by Ovum Caput, Ph.D.