Sir Francis Galton found the vox populi
by averaging the voices of individuals. When farmers estimated the weight of an ox at a country fair, the average of their estimates was more accurate than the estimates of most individual farmers. To Galton (1907), this finding could not have been surprising because the average is the most regressive prediction. When individual estimates are independent of one another, their average has a good chance of being the most accurate prediction (see Galton, 1886, on regression, and Fiedler & Krueger, 2012, for a recent overview).
In a provocative paper, Herzog & Hertwig (HH, 2009) showed that the aggregation effect (vox populi, wisdom of the crowd) can be observed within individuals. HH asked respondents to provide a second round of estimates after carefully contemplating how and why the first estimates could have been wrong. They then averaged the two estimates for each individual and re-assessed accuracy.
HH noted that the average estimate will be better than the first estimate, E1, if the second estimate, E2, falls within the gain range. If, for example, E1 is below the true value, T, E1 is the lower bound of the gain range and E1 + 4(T-E1) is the upper bound. If, for example, the question is when Harvard University was founded (T = 1636) and E1 = 1620, then the inner crowd will be wise unless E2 > 1683 or < 1621. Analogously, if E1 is greater than T, E1 is the upper bound of the gain range.
First estimates in Herzog & Hertwig
HH used 40 historical dates as T and recruited 50 respondents, thus obtaining a total of 2,000 estimates at both stages of the procedure. Figure 1 shows a histogram for the deviations E1 – T with T set to 0 and bin sizes of 20 years. On average, estimates were off by 152 years (SD
= 118). The average of the signed values of E1 yielded a smaller deviation (M
= 106, SD
= 116), confirming that aggregation of many independent estimates can yield an accuracy gain of moderate size. In standard units, the effect size, obtained by dividing the difference between these averages by the average of the standard deviations, was .39.
Second estimates in Herzog and Hertwig
Figure 2 shows a histogram for the values of E2 when these estimates are scaled to a gain range bounded by -1 and 3. Fifty-six percent of E2 lay inside the gain range, which suggests that averaging E1 and E2 will improve accuracy.
Averages of E1 and E2 in Herzog and Hertwig
Figure 3 shows a histogram for the bootstrapped estimates, which were obtained by averaging E1 and E2 for each individual case. The average of the signed deviation scores (M
= 97, SD
= 156) was lower than the average obtained with E1 alone, but more importantly, the average absolute deviation dropped from 152 to 146 (SD
= 112). Scaled in standard units, the average accuracy gain was small (d = .11), but it was statistically significant thanks to the large number of observations, t
(1999) = 5.07, p
We applied Herzog and Hertwig’s method to an area of social judgment. Respondents (N = 127) read 10 personality statements (e.g., “My hardest battles are with myself.”) and made two consensus estimates (What percentage of people agree with the statement?).
First estimates in social prediction study
Figure 4 shows a histogram for the E1 deviation scores with a bin size of 5 percent. The average size of the absolute estimation errors was 21 (SD
= 15 over 127 x 10 cases). When the signed values of E1 were averaged, the result was a smaller deviation (M
= -5.6, SD
= 25). In standard units, the effect size was .77, which reflects a large accuracy gain.
Second estimates in social prediction
Figure 5 shows a histogram for the values of E2 scaled it to a gain range bounded by -1 and 3. Fifty-one percent of these estimates lay inside of the gain range.
Averages of E1 and E2 in social prediction
Figure 6 shows a histogram for the bootstrapped estimates. The average of the signed deviations (M
= -6, SD
= 23) was virtually the same as the average obtained with E1 alone. Nonetheless, the average absolute deviation dropped from 21 to 19 (SD
= 14). The effect size of this gain (d = .15) was again small but significant, t
(1269) = 5.42, p
Though small in size, the effects of the internal crowd make an important psychological point. Individuals can improve their perceptions, memories, and predictions by generating diverse judgments and averaging them. The strategy recommends itself in betting contexts, when money is at stake. A farmer who engaged in dialectical bootstrapping could have outperformed most of his competitors. Aside from the improbability that people might think of this strategy on their own, it remains to be seen how they respond to it when asked to use it. Some may argue that they would rather choose between E1 and E2 than average.
We thank Stefan Herzog and Ralph Hertwig for sharing their data with us.
Fiedler, K., & Krueger, J. I. (2012). More than an artifact: Regression as a theoretical construct. In J. I. Krueger (Ed.). Social judgment and decision-making (pp. 171-189). New York, NY: Psychology Press.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Britain and Ireland, 15, 246-263.
Galton, F. (1907). Vox populi. Nature, 75, 450-451.
Herzog, S. M., & Hertwig, R. (2009). The wisdom of many in one mind: Improving individual judgments with dialectical bootstrapping. Psychological Science, 20, 231–237. doi:10.1111/j.1467-9280.2009.02271.x
White, C. M., & Antonakis, J. (2013). Quantifying accuracy improvement in sets of pooled judgments: Does dialectical bootstrapping work? Psychological Science.