The mother of all decision heuristics.
Posted Mar 18, 2012
If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.
Psychologists remember Tom W. as the most famous fictional graduate student of the last century. Kahneman and Tversky (1973, p. 238) described him as "high in intelligence, although lacking in creativity. He has a need for order and clarity, and for neat and tidy systems in which every detail finds it appropriate place. [etc. etc.]." Kahneman and Tversky asked one group of students (the similarity group) to judge how similar Tom W. was to the typical graduate student in nine fields of study. They asked another group of students (the prediction group) to rank the nine fields according the their likelihood of being Tom W.'s field. They asked a final group of students (the base rate group) to rank these fields according to the relative percentages of graduate students enrolled in them.
The findings are legendary. Judgments of similarity predicted predictions almost perfectly (r = .98). Tom W. looked like a computer science student, i.e., his personality sketch was representative of the stereotype, and he did not look like the typical humanities student. Base rate estimates did not predict predictions (r = -.61). At the time, there were far more humanities students than computer science students, and that should have had a positive impact on predictions. The negative correlation between predictions and base rates does not mean, however, that respondents actively decided against base rates; they simply ignored them. When the correlation between predictions and base rates is computed while controlling for similarity judgments, it is about zero.
What should respondents have done instead? They should have integrated similarity information with base rate information. Lacking the information that would have allowed them to do this precisely, they could have split the difference. Had they done that, their predictions would have ended up being correlated at about .5 with base rates and similarity each. As a result, respondents would have concluded that Tom W. was most likely a student of business administration and least likely a student of medicine or physical and life sciences. These more even-handed predictions would have been correlated at about .4 with actual predictions. In short, respondents' predictions were indeed biased, as Kahneman and Tversky claimed, but they were not horrible.
Splitting the difference is a crude, though serviceable, way to integrate predictive information (Dawes, 1979). A more principled way of information integration involves Bayesian updating. This method bases predictions on base rates after revising them in light of relevant evidence. In Bayesian jargon, p(C) is the prior probability, or base rate, that a case (here: Tom W.) belongs to a category (here: a particular field of study). Next, p(D|C) is the probability that a member of category C can be described with personality sketch D. This conditional probability represents representativeness. Each p(C) can then be multiplied with its corresponding p(D|C) and the sum of these products is the overall probability p(D) with which the description D is observed in the student population.
Now we can compute for each field of study (category) the probability that someone belongs to it if he fits description D. This is the posterior or revised probability of the category in light of the evidence, or p(C|D). There is no single correct way to do this with the ranking data Kahneman and Tversky provided, but we can approximate a solution by transforming the ranks to a scale ranging from 0 to 1 to represent values of p(D|C) and by making the simplifying assumption that the average of these values as well as the overall value of p(D) is .5. Now we can compute the nine Bayesian predictions as p(C|D) = p(C) x p(D|C) / p(D). We find that Tom W. was most likely a business student (p = .19) and least likely a student of library science (p = .04). Across the nine fields of study, similarity (p(D|C)) turns out to be a slightly better predictor of optimal prediction, r = .43, than base rates, r = .37.
This is a surprising result and it suggests an equally surprising vindication of the ordinary prognosticator. Faced with the difficult task of making sense of a personality sketch that was more typical of rare than of common fields of study, and being unwilling to let the two conflicting predictors (base rates and similarity) cancel each other out, respondents may have resorted to using the more powerful cue, and that happened to be similarity. When optimal prediction requires the integration of multiple cues, simply taking the best is not a bad idea (Gigerenzer & Goldstein, 1996).
Alas, similarity is not always the best cue. In fact, base rates usually are. This is not an assertion of empirical fact but an implication of Bayesian logic. Suppose base rates, p(C), and similarity, p(D|C), both have equal chances to take any value from 0 to 1, and that they are independent of each other. In this complete and unbiased survey situation, the expected correlation between base rate and prediction is .8, whereas the correlation between similarity and prediction is a mere .38. Unless the evidence is selected so that it is more telling than the base rate—which was the case in Kahneman & Tversky's study—base rates should indeed dominate, but only then.
Why do base rates dominate in a complete and unbiased survey? The answer is subtle and it requires a moment of contemplation. Consider Bayes's Theorem.
p(C|D) = p(C) x p(D|C) / p(D), i.e.,
p(C) x p(D|C) / [p(C) x p(D|C) + p(~C) x p(D|~C)]
Note that p(C|D) becomes larger with both p(C) and p(D|C) because both are parts of the numerator of the Bayes ratio. When p(C) becomes larger it strongly affects the numerator; it does not have much of an effect on the denominator because p(~C) = 1-p(C). As p(C) gets larger, p(~C) gets smaller, which reduces the overall effect on the denominator. In contrast, when p(D|C) gets larger, it not only affects the numerator, it also affects the denominator. Both become larger, which reduces the correlation between similarity, p(D|C), and prediction, p(C|D).
Leaving Tom behind, Kahneman & Tversky pressed on with Dick and Jack (not Harry). Whereas Jack sounded like an engineer (he was conservative, careful, and liked mathematical puzzles), Dick was a regular Joe. He had a family and his colleagues liked him. For each target person, respondents had to estimate the probability that he was an engineer (vs. a lawyer). In this study, base rates were explicitly provided, with the prior probability of being an engineer being either .3 or .7. Kahneman & Tversky reported that over descriptions, the judged probability of the person being an engineer went up with the diagnosticity of the description—as it should—but that it was unaffected by the base rate. Dick's description, being nondiagnostic of engineers or lawyers, should have led to predictions that were equal to the base rates. Instead, Dick was estimated to be equally likely a lawyer or an engineer because his description looked equally uninformative with respect to both.
The general impression one gets from textbook descriptions is that judgments by representativeness lead to error (they can), and that the size of the error increases with the degree to which the evidence is diagnostic of or similar to the judged outcome. This graph shows that the latter impression is false. The largest errors are made for Dick, whose description is entirely unrepresentative of either engineers or lawyers. The sum of the errors is .4 ([.5 - .3] + [.7 - .5]). As we move away from the coordinates of .5:.5, the sum of the errors shrinks. At the limit, the evidence dictates the prediction. When a description is a perfectly positive fit with one target category and a perfectly negative fit with the other, base rates no longer matter.
Why is it tempting to think that greater representativeness leads to greater error? My hunch is that this impression stems from a common focus on low base rates. Many of the canonical studies in the research literature were concerned with the overestimation of small risks. In medical diagnosis, for example, a positive test result is evidence representative of a disease. If physicians or patients conclude that the probability of the disease given a positive test result is the same as the probability of a positive test result given the disease, they are making an error. The error will be large to the extent that the disease is rare and to the extent that the evidence is representative (i.e., to the extent that p(positive test | disease) is high). The graph makes clear, however, that for every overestimation given a low base rate, there is an underestimation given a high one. It is the sum of the errors that gives the full picture of how far off one drifts when relying only on evidence.
Kahneman & Tversky made one interesting simplification when drawing up the graph. They assumed that a description's representativeness of one category (e.g., engineer) is the complement of its representativeness for the other (i.e., lawyer). In other words, they assumed that p(D|~C) = 1-(D|C), which makes the graph pretty and mirror-imaginary. What happens when the assumption of complementariness does not hold? Suppose, for example, that descriptions vary only in how representative they are for one profession, while being neutral with respect to the other. We find that the errors are now larger. On average, they rise from .13 to .19. This is counterintuitive because Kahneman & Tversky's complementariness assumptions creates likelihood ratios (p(D|C)/p(D|~C)) ranging from 0 to infinity, whereas holding p(D|~C) to .5 sets an upper limit of 2. The explanation for this discrepancy lies in the fact that if the base rate is low, its revision is larger when the similarity is divided by its complement than when it is divided by .5. To illustrate, if p(C) = .3 and p(D|C) = .95, p(C|D) = .45 and .89 respectively for p(D|~C) = .5 and .05). In other words, Kahneman & Tversky's assumption of complementary values of representativeness and counter-representativeness downplayed the potential magnitude of base rate neglect.
The phenomenon of base rate neglect has stirred up controversy in cognitive and social psychology for 40 years now. We now know more about when, why, and under what conditions base rates are used, and what we can do to keep it so. Some have suggested that we often use base rates, when we should not, as in the case of stereotyping. According to this argument, stereotypes are base rates, and the moral imperative is not to use them. When we do use them, we appear to be bucking morality and cognitive psychology at the same time. A comment is in order. Base rate neglect occurs in the context of categorization. Given some descriptive information, we place a person in a social category without regard of the category's size. This is, in fact, a form of stereotyping. Computer scientists are stereotyped as nerdy. Here is a nerdy person. He is probably a computer scientist.
Base rate neglect is not, however, a feature of stereotype application. In stereotype application, we update a probability that is already conditional. The probability that someone is nerdy given that he is a computer scientist is already perceived to be high. This is the stereotype. Note that it is the type of conditional probability that we treated as the index of representativeness above. Now we learn that the person did a very nerdy thing. He wrote some clever code. This calls for a second round of revision. The probability that he is a computer scientist given that he is nerdy and wrote a clever piece of code is higher than the probability that he is a computer scientist given that he is nerdy. Additional behavioral information affects final judgments, but so do base rates (Krueger & Rothbart, 1988). This is stereotype (base rate) application, not neglect.
Gigerenzer, G., & Goldstein, D. G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650-669.
Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80, 237-251.
Krueger, J., & Rothbart, M. (1988). Use of categorical and individuating information in making inferences about personality. Journal of Personality and Social Psychology, 55, 187-195.