Educational Testing and the Great Divergence
How changes in educational testing could increase the divergence in society
Posted Sep 27, 2012
Introduction: The different types of reasoning
This blog is going to cover a lot of ground, ranging from Sherlock Holmes to the No Child Left Behind (NCLB) act. I am going to deal with a possible unintended result of current educational testing practices. The result is important because, over not-so-long a time, it will exacerbate the move in the United States toward separate societies, that are increasingly differentiated by wealth, location, and opportunities for economic and social advancement (Murray, 2012; Noah, 2012). Noah refers to this as “The Great Divergence,” and both he and Murray (and many others) have worried that the divergence is weakening the traditional idea that the United States is one nation, rather than, as the pledge of allegiance claims, a nation that has “liberty and justice for all.”
While there is no one cause for the Great Divergence, one factor is clearly differential access to opportunities to acquire the skills needed for the modern workplace. Reich (1991) presciently referred to today’s workplace as being the domain of the “symbol analyst,” people who work with their brains rather than their hands. The skills to be a system analysis are typically acquired through college or university level training, or beyond. Therefore, in order to avoid making the divergence still worse, we want to be sure that all students who can benefit from education to be a symbol analyst have a chance to sharpen their cognitive skills. .
What are these skills? One could write a book about this…and in fact, I have (Hunt, 2011). There I pointed out that cognitive skills can be categorized in many ways, and that different categorizations have different purposes. Here I will take just one way, the distinction between fluid (Gf) and crystallized (Gc) intelligence, illustrate it by examining that epitome of intelligence, Sherlock Holmes, and then explore the consequences of the distinction for educational testing. The fact that Holmes is fictional is of no consequence. Doyle’s depiction of him fits into my argument.
Sherlock Holmes relied a great deal upon deductive inference, i.e. pure rules of reasoning that applied regardless of content. One of the rules of deductive logic, which people have difficulty understanding, is that if A implies B, and B is not observed, then A must not be the case. Holmes solved one case by reasoning that since alert watchdog did nothing in the night a burglar had not been present. In another case he observed that if you eliminate all other explanations the one remaining, no matter how apparently unlikely, must be the correct one. This is a pithy description of the reasoning behind Bayesian induction, which is at the foundation of much of modern statistics.
Holmes also relied a great deal upon specific subject-area knowledge. He kept files of newspaper articles, social events, and criminal reports. These were used to identify, among other things, the second most dangerous man in London. He had made a study of poisons. And, like all people who rely upon their memory, from time to time he gets his facts wrong. He identified one killing as a case of murder by snake, but the mechanism he described is impossible. But this is nit-picking. The point is that Holmes was comfortable solving crimes either by reasoning or by subject-matter expertise.
These two types of thinking are enshrined in modern psychology, by the distinction between fluid intelligence (Gf, solving problems by reasoning) and crystallized intelligence (Gc, solving problems by applying relevant knowledge). The two types of reasoning are highly correlated (r ~ .9 in several studies) which means that if you know a person’s score on a Gf test, you can make a pretty good prediction of that person’s score on a Gc test, but there are exceptions (Hunt, 2011).
It is very hard, if not impossible, to write a pure test of Gf that does not depend in any way upon what examinees have learned prior to taking the test. The reason is simple; problem solving procedures themselves can be acquired. Both of the reasoning methods in my Sherlock Holmes example; deductive logic and Bayesian inference, are taught in schools. Similarly, unless the questions are extremely specific, such as “Who was the first president of the United States?” it is hard to write a test that evaluates only Gc. However you can certainly write a test that emphasizes either Gf or Gc.
Now let us look at this from the viewpoint of a student who has to take a “high stakes” educational test, such as a test that is used to screen applicants for college. Should the student rather take a test that emphasizes Gc or Gf?
The answer to this question depends upon how the student has been prepared for the test. The majority of students who take entrance examinations have had opportunity to be “adequately prepared” in the sense that their school has offered classes that cover the topics of the examination, in approximately the same depth as the examination will probe their knowledge of these topics. This does not mean that the “average student” is going to do very well on an exam, for my statement is that the student has been exposed to the necessary knowledge. Whether he or she has acquired the knowledge depends upon the student’s own cognitive power (intelligence at the time the class was taken) and how much he or she paid attention during the class.
Consider a student who has paid an average amount of attention in class, and has gone to a high school (HS) that offers an average array of courses. Because of the high correlation between the two types of examinations, the student’s standing relative to other applicants would be about the same regardless of the type of examination. Therefore the student should not care whether the test emphasizes Gf or Gc.
Now consider a student who, for some reason, has developed an exceptional interest in the topics to be covered in the Gc type of examination, and has, as a result, spent an inordinate amount of time studying them. This student has developed expertise in the subject matter, and therefore should prefer a test emphasizing Gc.
Finally, consider a student who has had less than the average exposure to relevant class material. This could happen in one of two ways. One is that the student, personally, may not have put much effort into the coursework that was offered. The student may be a classic “late bloomer,” the family situation may not have been conducive to study, or for several other reasons the student may have been distracted from course work. The other possibility is that the student’s school may not offer the relevant coursework, either not at all or not at the depth that is appropriate for the examination. In either case, the student should seek a more content-free Gf examination.
The Minimum Standards Tests.
Probably the most talked-about tests used today are the state “standards” tests required by the No Child Left Behind (NCLB) act. Each state, individually, is required to generate tests covering English comprehension, mathematics, and science (considered as a single topic). The 10th grade, can be thought of as a summary evaluation of students’ knowledge shortly before they leave the educational system. At the individual level, these tests indicate whether or not a student has achieved a minimal level of academic accomplishment. In addition, the test scores serve as yardsticks by which state and local school systems are judged. Substantial federal aid is awarded, by whether or not students are meeting the state’s own standards. The NCLB places special emphasis upon the performance of students from demographic groups who, on the average, have shown low academic achievement. Attention is usually centered on the performance of African-American and Latino students, as these are by far the largest minority groups in the US.
The NCLB has become something of a political football, and will undoubtedly be modified. However it is quite likely that the use of test scores to evaluate teachers, schools, and school districts as well as evaluating students will probably increase. In addition, there is pressure for the creation of a national ‘core’ curriculum, especially in the science, technology, engineering and mathematics (STEM) fields. The idea is that the federal government would propose such a curriculum and each state would voluntarily adopt some or all of it. If this happens testing will become more uniform across states.
The standards tests intentionally emphasize Gc, because they are intended to determine the students level of academic achievement. For instance, there is a test of knowledge of science, rather than separate tests for biology, physics, chemistry, earth sciences, etc. Similarly, “Language arts” covers ordinary reading comprehension and knowledge of vocabulary and grammar, but does not evaluate knowledge of literature, nor do most, if any, of the examinations cover the use of subtle mechanisms in language, such as irony or metaphor. Finally, because the NCLB and, more generally, public interest, focuses upon whether or not examinees meet minimum educational standards, there are relatively few questions that allow students to demonstrate superior educational attainment.
I have heard this sort of test described, somewhat amusingly, as a SWAMP test. Just as a geographic swamp covers a large area with shallow water, in education a SWAMP test covers many topics, but does not probe a student’s knowledge deeply in any of them. There is a strong bias toward asking for facts or demonstrating knowledge of algorithmic procedures taught in class, such as procedures for interpreting graphs.
These remarks are not meant to denigrate the standards movement. Ensuring that a high percentage of students meet acceptable standards of accomplishment is an important goal. At present, at least, testing is seen as the most cost-effective way of ensuring that this goal is being met. In addition, test results show educational policy makers what parts of the system need to be fixed. Running educational programs at the state and district level without testing would be analogous to driving an automobile without a gas gauge or a speedometer. Finally and arguably most importantly, a high-stakes standardized test is a policy instrument. If teachers’ evaluations and salaries are tied (partially) to student test scores, teachers will “teach to the test.” The teachers would be fools not to. Therefore test design is a powerful device for controlling teachers’ behaviors. Education policy makers should want teachers to teach to the test, providing that those same policy makers take care to design the test in such a way that it encourages the sort of teaching behaviors that the policy makers want to encourage.
Testing and college admissions
The second big use of testing in education is to screen applicants to colleges. Virtually all of our accredited, four year colleges and universities screen applicants for academic potential, by combining high school grades and scores on one of two examinations, either the SAT-Reasoning test (formerly SAT-I) sponsored by the College Board and administered by the Educational Testing Service (ETS) or the ACT, prepared by ACT Inc., a non-profit educational corporation. Historically the SAT and ACT were developed using different philosophies for examination, In general, the SAT placed more emphasis on reasoning abstracted from specific curricular content (Gf) while the ACT was developed with an emphasis on the content of common High School (HS) curricula (Gc). Nevertheless, and as would be expected, the two tests order individuals in essentially the same way. A study by the University of Texas, Austin, concluded that nationally the correlation between the two tests is .92, and that the correlations between equivalent subtests, such as reading comprehension or mathematics, are .80 or above (Lavergne & Walker, 2001).
Both the ACT and and SAT are considerably more difficult than the tests used to determine minimum levels of achievement. There is need for a wide range of difficulty between items on the tests for there is considerable difference in the performance of successful applicants at different universities. This is illustrated in the following table, which shows the 25th and 75th percentile points for the reading and mathematics sections of the SAT Reasoning Test within the ten campuses of the University of California system. There is a substantial variation in scores of successful applicants, within the same, well regarded public university system. Nationally the variation is even greater.
Campus Reading 25% Reading 75% Math 25% Math 75%
Berkeley 600 730 630 760
Davis 520 650 570 680
Irvine 510 620 560 680
Los Angeles 570 680 610 740
Merced 430 550 460 590
Riverside 450 560 480 610
San Diego 540 670 610 720
Santa Barbara 540 650 550 670
Santa Cruz 490 630 510 640
Table. The 25th and 75th percentile points on SAT reasoning subtests for successful applicants to the University of California campuses. The University of California, San Francisco is not shown for it is exclusively a graduate study institution.
There has to be some way of screening applicants, for there simply is not room for all the people who want to go to the most prestigious schools. In 2011 UC Berkeley admitted only 21% of its applicants.
Two alternatives to the present reliance in part upon test scores have been suggested.
There has been an extended and, I believe, somewhat pointless argument over whether HS grades or test scores are better predictors of success in college. This controversy is something of a red herring. In practice, most college and university admissions decisions are based on both HS grades and test scores.
A second proposal is somewhat more interesting. It was made by Richard Atkinson (2005), the (now retired) president of the U. of California system. A parallel proposal has been made by the social commentator Charles Murray (2012). As Atkinson’s discussion was more detailed than Murray’s I will consider it.
Atkinson observed that any test is “coachable,” in the sense that people can be trained to execute behaviors that will help them on the test. If the test emphasizes Gc (reasoning) practicing emphasizing solving abstract reasoning puzzles, such as solving analogical word problems (e.g. cat is to lion as dog is to : zebra, wolf, elephant, horse). (Analogy problems have been dropped from the SAT, although they are fairly good indicators of general intelligence (Hunt, 2012).) By contrast, if the tests are based on curricular topics the way to study for a test is to study the topic. Presumably what the student learns while practicing for (Gc loaded) subject matter tests will be useful later in their college careers.
Atkinson expressed two concerns. One is that practicing behaviors that are only of use to take a test are not good uses of a student’s time. The second, which seems to have been his chief concern, is that there are fairly costly commercial test preparation courses that specialize in training people to take the SAT Reasoning test. Atkinson felt that this might give a special advantage to students from wealthier families (“high socioeconomic status,” SES for short). For these reasons Atkinson suggested that greater weight be given either to ACT scores or to scores on the SAT, part II, in which students may selectively choose to take subject-matter tests in 20 different fields.
An additional trend in university admissions has been to consider how many “advanced placement” (AP) classes a student has taken, and how well he or she has done on nationally developed AP examinations. The AP courses are courses that go beyond the normal high school curriculum, both in the extent to which material in a discipline is covered and the depth of understanding expected by the student.
Both the Atkinson-Murray suggestion and the increased use of AP courses would have the effect of increasing the weight of Gc skills in admission decisions. Would this be a good thing?
Interactions: The combined effects of standardized tests and changes in college admissions practices could increase the divergence
As was pointed out above, whether or not a test emphasizes Gc or Gf skills does not matter very much to students who have had the classes that are assumed by the Gc test developers, and who have paid an average amount of attention to their classwork. The problem arises when we consider those students who have not had or have not taken the opportunity prepare for in-depth Gc tests.
High school AP offerings vary greatly with the school and school district. As a rough guide, the prevalence of AP classes depends upon the extent to which the community sees the schools’ role as preparing children to compete for places in high prestige public and private institutions as opposed to ensure that their children can either get jobs or go to community colleges and lesser ranked four year institutions.
This is one example of a general finding; the extent to which students get something out of HS education is heavily influenced by the general community’s support for education. Both the educational opportunities offered and the student’s involvement in them are heavily influenced by general community expectations (Steinberg, 1996). This support can vary widely; ranging from the nearly hostile view of studying found in the adolescent culture of some low SES neighborhoods to what has been described as an “insane” high SES competition to get children into private schools. The result? Students who have graduated from schools in high SES neighborhoods will be better prepared for a Gc loaded test than students from schools in low SES neighborhoods.
This trend is likely to be exacerbated if teachers are held accountable for student results on statewide tests of academic achievement. In principle, there is nothing wrong with holding teachers accountable for student performance, any more than holding doctors and dentists accountable for their patients’ health. In both cases, though, the devil is in the details. All the standards movements of which I am aware have stress the importance of having students demonstrate that they are above the minimally acceptable standard. If teachers and schools are going to be judged by the percentage of students meeting minimal standards, then teaching efforts are going to be directed toward those students who are at risk of not meeting those standards. Because the school system has far from limitless resources, remedial courses and drills will be stressed, inevitably at the expense of encouraging above-average performance and offering AP and gifted student programs.
This will not be a great problem for schools in higher SES districts, because in those schools only a small percentage of students are in danger of failing the relatively easy SWAMP tests. A personal illustration is in order.
I live in the Bellevue, Washington school district. With the exception of one area containing a recent immigrant population, Bellevue is a comfortably middle and upper middle class community. As of 2011 four of the districts’ five standard high schools were ranked in the top one percent of US News and World Report’s ranking of US high schools. My grandchildren and their friends have made it quite clear to me that they do not spend much time preparing for the state standards tests. The word “joke” was used. Statewide, though, about 50% of the students did not meet minimum standards on the mathematics and science assessments.
On the other hand the high schools in the Bellevue district trumpet another statistic on their websites…the high performance of their students on the SAT.
Consider two students, one from a low SES neighborhood and one from a middle-high neighborhood. Assume that they both have equal fluid intelligence, Gf. If they are asked to take a Gc loaded examination, such as the SAT part II or the ACT, the student from the higher SES neighborhood will have a better chance of doing well than the student from the lower SES neighborhood, simply because he or she has been better prepared for the examination! The advantage that students from high SES families hold in the admissions process would increase; exactly the opposite of the result intended by Atkinson and Murray.
As a historical note, this would be an example of ‘back to the future.’ The SAT was originally introduced, and intentionally slanted toward Gf rather than Gc testing, in order to reduce the advantage in college admissions held by students from high SES families, who often sent their children to private preparatory schools (Lemann, 1999).
The standards movement and reforms of the college entrance process, both eminently sensible in themselves, are likely to interact to increase divergence in educational opportunities. This is serious. Educational disparities are not the only cause of the divergence in American society, but they do contribute to it. What can be done? This blog is not a policy blog, nor am I a policy expert, so I will present solutions that I believe are consistent with findings in psychology and education, without concerning myself with their political feasibility.
Tests that emphasize Gf should continue to be used. They will provide late-blooming and/or poorly prepared applicants a chance to show their potential, without their relying on information presented in academic programs that may not have been available to them. The use of Gf tests will not eliminate the correlation between family SES and admissions to prestigious colleges and universities, for Gf and family SES are themselves positively correlated. However it has been shown that Gf scores, such as the SAT Reasoning Test score, are predictors of success in college over and above family SES. (See Hunt, 2012, for details.)
The testing programs associated with the standards movements ought to be changed. At present the standards tests are used to do two things; certify that students, as individuals, have met at least minimum standards in educational achievement and to evaluate the academic quality of teachers and schools. The two goals are not the same. The SWAMP testing model is appropriate for determining which students are at or above acceptable standards. However the SWAMP model does not provide information about the extent to which a teacher or school is producing superior students.
States should augment the SWAMP model by developing focused, in depth examinations in specific fields, e.g. Physics and Chemistry rather than “Science.” If the state or school district (and, by inference, the community) is concerned about academic achievement outside of the fields dictated by federal and/or state regulations testing could be extended to cover subjects such as History, Literature, Art, Music and so on. To avoid an unnecessary burden on students, the students taking a particular subject matter test should be chosen at random from those who have taken appropriate courses. Complete results from all students would not be needed because the goal would be to evaluate the quality of instruction at the school or district level, rather than certifying individual student achievement. If the resulting information about above-average student performance were to be used in teacher and school evaluation, in addition to the SWAMP results, teachers would have an incentive to be concerned about the development of the better students rather than, at present, having incentives only to focus on meeting minimum standards.
In fact, such a program already exists at the national level, the National Assessment of Educational Progress (NAEP) examination program, sometimes referred to as the “Nation’s Report Card.” My proposal is to extend testing programs like the NAEP throughout the nation’s schools, on an annual basis, and to give the results some teeth because they would be used in teacher, school, and district evaluations.
The down-side, of course, is that everyone involved would have to develop more sophisticated standards for schools than the present, single-minded NCLB focus on minimally acceptable student achievement. That would mean that people would have to think more deeply about what the educational system is supposed to do. Granted, this implies more work for everyone, but it is probably a good idea.
What do you think?
Atkinson, R. C. (2005) College admissions and the SAT: A personal perspective. Observer (Magazine of the Association for Psychological Science) 18 (5).
Hunt, E Human Intelligence.(2011) Cambridge: Cambridge U. Press
Lavergne, G.M., Walker, B. (2001) Developing a concordance between the ACT Assessment and the SAT I: Reasoning Test for the University of Texas at Austin. Research Report, Office of Admissions Research, The University of Texas at Austin: Austin, Tx.
Lemann, N. (1999) The big test: The secret history of American meritocracy. Farrar,Strous , Giroux
Murray, C. (2012) Coming Apart: The state of White America 1960-2010. New York: Crown Forum
Noah, T. (2012) The Great Divergence: America’s Growing Inequality Crisis and What We Can Do About It. New York: Bloombury Press
Reich, R. (1991) The work of nations: Preparing ourselves for 21st century capitalism. New York: Knopf
Steinberg, R., with Brown, B.B. & Dornbusch, S.M. (1996) Beyond the classroom: Why school reform has failed and what parents need to do. New York: Simon & Schuster