Why we test determines what we test
Before selecting a test, consider why you are testing at all.
Posted Aug 18, 2013
This blog is inspired by thoughts that I had while I read (for a review) Scott Barry Kaufman’s recently published book Ungifted (Kaufman, 2013). Those interested can see my review, which appears in the journal Intelligence (Hunt, 2013). While writing it I had some thoughts about the use of tests of cognitive ability, which include honestly labeled intelligence tests and other tests of cognitive ability marketed under a variety of terms, such as “scholastic aptitude.”
Kaufman was concerned with the results of an obvious fact; in an imperfect world tests can misclassify people. He then worried, and I join him in worrying, about the errors that result when one or just a few measures to place students in different academic streams in the K-12 system. Understandably, Kaufman’s greatest concerning was for students placed in the slow section; in the United States the euphemistically named “special education” class. The most extreme example is probably the British “Eleven Plus” examination, which at one time was given to all students in England, Wales, and Northern Ireland. The results were used to stream students into academic and non-academic programs. The Eleven-Plus is much less widely used today, but in its heyday, roughly from 1945 until 1974, it was a potent part of the educational system.
Misclassification is a real issue, especially if decisions based on test scores are close to irrevocable. Fortunately, one of the great strengths of the American school system, compared to the European systems, is that in the United States there are many opportunities for ‘second chances,’ including adult education and the community college system. I personally know of two cases, at different universities, of people who returned to graduate school in their late 40s, and subsequently won ‘young investigator’ awards shortly after receiving their doctorates, although both were well into middle age! To their credit, many European countries are now changing their own educational systems to open up second career choices such as this.
Both Kaufman and I are very much for the provision of such opportunities. However, our critiques are not directed at the tests, they are directed at misuse of test data. This got me to thinking about how the intended use of a test determines both how the test should be formed and what use should be made of it.
Intelligence tests were originally used, and sometimes misused, as Kaufman complains, to identify students unlikely to be able to keep up with the standard educational system. Today a more sophisticated approach is to use tests to diagnose some cognitive condition; dyslexia, ADHD, or general low level mental ability---which impacts on the examinee’s ability to perform in either school or some aspect of society. The motivation for testing is to gather data so that a therapist, social worker, etc. an help the examinee. If we look at things from a system’s engineering viewpoint, the test should be designed to channel an individual into a treatment intended to remove or ameliorate a defect. It is simply good engineering practice to use periodic further evaluations to see if the treatment is working, has worked, or should be re-designed. The failure to do this is what Kaufman objected to, and yes, some school systems use extremely poor system design! The key points here are that the institution has a continuing obligation to the individual. Tests should feedback information to the institution, to help them fulfill their obligation.
Education is not alone in taking this tack. The use of tests in clinical psychology and in the courts fit this paradigm. The role of the test is to provide information that can be used to help the examinee.
However tests are also used for another purpose, personnel selection. This role goes back to the Army Alpha Test of World War I days, and is alive in its modern form as the Armed Forces Qualification Test. It is also the basis of employment testing, and, less obviously, college and university entrance examinations. Test scores are not intended to help the examinee, they are intended to aid the personnel selector (recruiting officer, University registrar, etc.) select a workforce or student body that best suits the institution’s missing. The tests are supposed to identify within the population of applicants those applicants who, depending on the institution, will provide the most effective fighting force, maximize economic returns to the owners an enterprise, and, in the case of colleges and universities, provide highly educated people to the society as a whole. Compared to clinical and similar applications, the emphasis has (properly) shifted from focusing on an examinee’s absolute abilities to focusing on relative abilities within the applicant pool.
A third use of tests is to evaluate training or educational programs. A good illustration of this was the consternation expressed during the summer of 2013, when several states found that less than half the students in their K-12 systems could meet the standards set in the federally developed “Common Core” standards for science, mathematics, and language.
I believe that in both education and psychology we fairly often get into trouble because we try to use the same or similar tests for all three purposes. Individual diagnosis, personnel screening, and system evaluation are different goals. Each requires its own ends.
In the case of individual diagnosis, we don’t need a test battery, we need a test armory. In the artillery, which is where these terms came from, a battery is a collection of cannons that fire as a unit. An armory is a place where the cannon are stored. When an expeditionary force is sent out the commander selects the appropriate cannon for the mission.
We take the battery approach when we give a child who is struggling in school the full Wechsler Intelligence Scale of Children (WISC) or some similar test. It is also what we do, inexplicably, when we attempt to determine criminal responsibility by, in part, giving a person the entire Wechsler Adult Intelligence Scale (WAIS). What is needed is an armory approach. Those in charge of evaluation should consider the nature of the cognitive demands in each person’s situation, and then select tests to evaluate the individual’s cognitive capacities with respect to those demands. The rules change for personnel screening. In this case a person’s relative standing with respect to others in the applicant pool is a crucial piece of information. This can only be done if the same measurements are made on each individual. (Note that this does not imply that the tests are used to derive a single IQ or g score. Multidimensional test batteries are used all the time!) In personnel screening three questions have to be asked about the battery; how much does it cost to give it, how much of an increase in the performance of the system as a whole, not by any one person, if the test is used to select applicants, and is there an alternative way to achieve at least the same benefits, at a lower cost?
Finally, what about the sort of testing needed for system evaluation? To take a frequently discussed case, suppose that we want to know if a K-12 system is providing adequate instruction in science. It does not follow that we should construct an omnibus “science” test. To do so would force us into what has derisively been called “swamp” testing; cover broad areas with shallow questions. Far more information can be obtained, at far less cost, by constructing separate and searching tests of biology, chemistry, physics, etc. and randomly selecting some students to take the physics test, some to take the chemistry test, and so forth.
If there was a deeper inquiry into why a testing program was being used prior to asking how to design it, I believe that there would be fewer cases of injustices due to testing, of the type described by Kaufman, and much more useful evaluations of K-12 and other training systems. Personnel classification would be unchanged for, somewhat unthinkingly, the personnel selection tests are widely used already.
Hunt, E. (2013) A gifted discussion of misclassification: Ungifted: Intelligence Redefined by Scott Barry Kaufman. Intelligence (in press August 2013)
Kaufman, S.B. (2013) Ungifted: Intelligence Redefined New York: Basic.