Fifty million Frenchmen can be wrong. ~
Fifty million Frenchmen can be wrong. ~Cole Porter (paraphrased)
The Flat Earth Society has few members these days, but the Homeopathic Society is well enrolled. There is good evidence for the sphericity of the Earth, but there is no good evidence for homeopathy. Many people agree that the Earth is a sphere, but there are also many who assert that homeopathy heals (however mysteriously).
What is the relationship between agreement (social consensus) and accuracy (truth)? To a realist, facts and thus truth exist independently of perception. To most empirical scientists—and this may surprise many non-scientists—findings become facts only when dignified by the collective nod of the scientific community of recognized experts. Ludwik Fleck knew this and Thomas Kuhn popularized it. The idea that findings become facts became a fact, as it were, only after Kuhn got a lot of academics to agree.
When things go well, reality drives social consensus. Imagine a jar filled with jelly beans. When everyone estimates the number of beans, the average estimate is the best stab at the truth—until someone actually counts the beans. Then, reality asserts itself such that everyone will converge on the same number and be correct.
Often things are trickier, especially in domains that psychologists and the rest of you care about. In judgments of personality, for example, a hard criterion of what a person is” really like” is often lacking. In such cases, many researchers settle for agreement among observers or agreement between observers and the target person as a proxy for accuracy. Reports of agreement data usually come with the cautionary note that although agreement does not guarantee accuracy, it is as good enough as an approximation.
Why do these researchers accept agreement as a proxy for accuracy? One proposal is that agreement is correlated with accuracy when accuracy can be directly assessed (by counting behaviors or by performing some other objective measurement). By this logic, we can infer high (low) accuracy from high (low) agreement even in cases without direct measures of accuracy. We can predict accuracy from agreement by using the statistical method of regression. If, for example, agreement and accuracy are correlated at .8, and if we observe an agreement score of .7 (which is also a correlation), then our best prediction is that accuracy is .8 x .7 = .56. This method does not equate accuracy with agreement unless they are already known to be identical. Otherwise, the level of predicted accuracy is less than that level of agreement.
There is a more serious complication. The logic of regression works only if the agreement scores and the accuracy scores come from the same population (e.g., the same set of studies). This condition is not satisfied when we need it the most. For example, we may be able to measure both agreement and accuracy in the domain of intellectual performance because both observers’ perceptions and countable performance data (criteria) are available. For this domain, we can compute a correlation between agreement and accuracy over, say, different tasks. As we turn to the domain of personality, however, we look in vain for computable accuracy criteria, and so we turn to the proxy of agreement. The problem is that with accuracy scores lacking, we cannot know the correlation between agreement and accuracy. Is this correlation the same as it is in the domain of performance? Is it also .8? There we cannot know, which means that if we proceed to make predictions using a correlation from a different domain, we are using a double proxy: Agreement sits in for accuracy, and a known agreement-accuracy correlation from one domain sits in for the unknowable correlation in another.
The fallback position is to point out that the correlation between agreement and accuracy is far more likely to be positive than negative, even if we do not and cannot know its size. This claim, though weak, has some appeal. All that is required for it to be true is that overall, perceptions are more likely to be accurate than inaccurate. If there is a reality out there that people are more likely to perceive correctly than incorrectly, they will also find themselves in agreement. Accuracy thus generates agreement (not the reverse).
Consider an example. Two psychologists, Al and Bert, observe a group of five (Cesar, Diane, Ed, Fay, and Giulio). The five talk at different rates (C > D > E > F > G). If A and B pick up on these individual differences, their judgments of extroversion will agree with each other and they will also be accurate (inasmuch as the relative rate of talking in this situation reflects the trait of extroversion). It is possible to imagine a world in which both A and B think that the order from greatest extroversion to greatest introversion is G > F > E > D > C, but what kind of a world is it that leads perceivers to systematically see the opposite of what is true? It is less of a stretch to imagine a world (or a situation) in which perceptions are uncorrelated with reality.
Now consider the structural limitations of reality that link agreement to accuracy. Suppose there are two perceivers whose judgments can be correlated with each other to give an agreement correlation. Also suppose there is a set of true values, which, when correlated with each of the perceiver’s judgments, yields 2 accuracy correlations. Next, suppose that both, agreement and the mean accuracy correlations, are sorted into a high and low group, where high means .8 or above, and low means 0 or thereabouts.
There are four theoretical combinations:  high agreement – high accuracy,  high agreement – low accuracy,  low agreement – high accuracy, and  low agreement – low accuracy. Notice that combination  is not possible. If the judgments of A and B are uncorrelated, the averages of their judgments will have little variance, thereby preventing a correlation with another variable. Hence, there can be no accuracy, if there is no agreement.
In syllogistic form, we can assert that if there is accuracy, there is agreement. Therefore, without agreement, there can be no accuracy (modus tollens). It does not follow, however, that if there is agreement, there is accuracy (affirming the consequent).
In statistical form, we have a situation consisting of four cells in a 2 x 2 table. Three of these cells are filled, and one is empty (number 3: low agreement – high accuracy). For lack of better knowledge, we assume that the number of observations in the 3 filled cells is the same. The correlation (coefficient phi) between the two variables of agreement (high vs. low) and accuracy (high vs. low) is .5.
A correlation has no direction. Some scholars take this to mean that we can infer accuracy from agreement just as well as we can infer agreement from accuracy. But look at it in terms of probabilities. The conditional probability of agreement given accuracy is p(agreement | accuracy) = 1.0, whereas the conditional probability of accuracy given agreement is p(accuracy | agreement) = .5. Now we see that we are dealing with a case of reverse inference. The reverse inference from agreement to accuracy is weaker than the forward inference from accuracy to agreement because of the difference in the base rates. The base rate of agreement is high (2/3) because agreement can occur regardless of accuracy. The base rate of accuracy is low (1/3) because accuracy requires agreement. So yes, once we see agreement, the probability of accuracy has risen from 1/3 to .5, but we have come to a place of perfect uncertainty. Do we want to bet on an outcome (accuracy) that will be obtained with a flip of a coin?
In psychometrics, the asymmetrical relationship between agreement and accuracy is known as the reliability–validity paradox (Brennan, 2001). The reliability (agreement) of measurement sets an upper limit to its validity (accuracy). Valid measures must be reliable, but reliable measures may be valid. Distressingly, increases in reliability can bring along a decrease in validity (Lord & Novick, 1968).
Agreement is a blessing and a curse. We need it to quench our epistemological thirst, but it is never quite enough.
Brennan, R. L. (2001). Some problems, pitfalls, and paradoxes in educational measurement. Educational Measurement: Issues and Practice, 20, 6-18.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.