Testing, Testing!

How to make most people worse (or better) than average

Posted Jun 19, 2012

There are two major approaches to testing. One approach is to measure individual differences. The other is to measure whether a person has a reached minimum standard  of competence. The first approach is associated with the measurement of presumably stable, trait-like personal properties, such as intelligence. The second is associated with the assessment of skill acquisition as in driving a motor vehicle.

In education, both types of testing can be found, and they are often confounded. Students like to ask if the scores on an exam will be “curved.” They may also want to know what minimum score they need to succeed, not realizing that they are fuddling the two schools of measurement. The educators who respond to such questions may not realize it either.

Most educational testing involves a distribution of points that can be grouped into distinct categories of grades. The point system suggests that the assessment of individual differences is the primary goal of testing. Students not only wish to do well, they also want to beat the minimum score needed to pass. If they do not, they may look for ways to take the exam again. This is where the trouble starts.

Suppose you give a psychometrically well-behaved test. The mean score is 100, the standard deviation is 10 and test-retest reliability is .8. Now suppose that all testees with below-average scores (and half of those with a score of 100) exercise an option to withdraw. It is as if they never took the test (some colleges allow such things). Now the average score of the remaining students is 110 and ca. 68% of these remaining students see their scores going from above to below average. Now suppose the low-scoring half lobby to get a second chance, and succeed. Given the test’s less-than-perfect reliability, the average retest score of this group will be about 92, raising the population average to 101. This may not seem like much, but it is a lawful distortion, and its size would increase with lower test reliability.

The size of the bias increases with repeated test-taking, actual additional learning, and vacuous learning that is specific to the test but dissociated from the underlying construct the test is designed to measured (i.e., repeated testing capitalizes on unreliability and it undermines validity). At the limit, the initially low-scoring students persist until all have obtained a score of 100. Now the population mean has risen to 105, but the distribution is no longer normal. Now, 84% of the students have below-average scores. Everyone gets an apple for passing, but fewer can say they have excelled relative to the rest of the field. Putting these arguments in reverse, we can see how by selectively retesting only the high scorers, we can build a society in which most people are better than average.

Retesting would be fine and even desirable if it were done for everyone and the scores averaged for each. The high scorers will not, however, agree to it. Psychometrically, though, this is a could idea because aggregation reduces unreliability and hence increases validity. The distribution of average scores will be narrower then either of the two individual distributions of test scores. It will appear as though the lowest scorers improved and that the highest scorers deteriorated. It is not, however, a good idea to allow testees to retain their best score and treat is as though it were the most valid one. This method converts random error to systematic bias. You could avoid bias bias by lowering the passing threshold, which would make everyone look more competent.