Measurement Validity Explained in Simple Language
How do we know when a personality test is valid?
Posted Nov 17, 2017
In my previous blog post, I noted that reliability and validity are two essential properties of psychological measurement. Measures of intelligence, personality, vocational interests, and so forth that lack reliability and validity are worse than useless. When we make important decisions for ourselves or others based on psychological measures that lack reliability or validity, those decisions are likely to be wrong and harmful. Therefore, I think it is important for anyone who undergoes psychological testing to understand reliability and validity and to recognize when a psychological measure might lack these vital characteristics.
As I indicated in my previous post, if a personality test is reliable, you will get almost exactly the same score anytime you take the test. Let's say you take a test that is supposed to measure your level of shyness. The first time you take the test you score a 90 out of a possible 100. Two weeks later you take the test a second time, and, again, you score a 90. And let's say the same is true for a thousand other people; their scores from the first testing are identical or almost identical to their scores two weeks later. People with low shyness scores, in the range of 0 to 34, also scored low the second time. People with scores in the middle, say 35 to 65, the first time also received average shyness scores the second time. And persons with high shyness scores, 66 and over, the first time also scored high the second time. We have what seems to be a reliable test for measuring shyness.
But wait—this reliable test that gives virtually the same scores every time is measuring some stable characteristic, but how do we know that this set of items actually measures shyness and not some other consistent trait?'' This is a question of test validity. A valid test measures what the test author claims it measures and not some other trait. Knowing that a test is valid requires more than just the appearance of validity. The shyness test might seem valid on the face of things because it contains items such as "I tend to shy away from social gatherings" and agreeing with such statements gives you points toward shyness. Personality tests that contain items whose content seems obviously related to what the test is supposed to measure possess what is sometimes called "face validity." But this is not enough evidence to conclude that these items actually measure shyness. If people are unable or unwilling to answer these items appropriately according to their actual level of shyness, a personality test that looks valid might in fact lack validity. The question remains, how do we know that a personality test actually measures the personality trait the test author claims it measures?
This turns out to be a very difficult question. The literature on validity is large and complex. Dozens of articles on this topic have been published. Psychologists have written about different kinds of validity such as criterion validity, predictive validity, concurrent validity, and incremental validity. What I aim to do in this blog post is to cut through the complexities to explain validity in ordinary language that does not over-simplify the extremely important concept of measurement validity.
Despite the various "kinds" of validity that have been written about, psychologists agree that they all depend on a basic, central notion called construct validity, discussed in a classic monograph by Cronbach and Meehl (1955). (Paul Meehl has been described as the smartest psychologist of our time.) Psychological constructs such as shyness, social intelligence, depression, conscientiousness, and so forth are theoretical ideas that cannot be easily reduced to one simple behavior. Shyness is not just avoiding people, although in everyday life non-psychologists might use this one behavior to distinguish shy people from those who are not shy. For research psychologists, shyness as a theoretical construct is something that underlies and explains a wide range of thoughts, feelings, physical states, and behaviors. A shyness test with demonstrated construct validity is backed by evidence that it really measures differences in this theoretical construct, shyness. To see how evidence for construct validity is established, let's see what researchers have said about shyness.
Jonathan Cheek, an expert on shyness, has suggested that shyness underlies inner states such as self-critical and self-conscious thoughts, worries about being evaluated by others, fear of rejection, and feelings of tenseness, upset, and awkwardness in social settings. It also underlies physical symptoms such as sweating, trembling, and blushing in the presence of others, as well as clearly visible behaviors such as quietness, not looking people in the eye, stumbling awkwardly in conversations, and avoiding social situations altogether.
So what exactly is this theoretical construct, shyness, that leads to such a wide range of consistent thoughts, feelings, and behaviors? Well, researchers don't know exactly. The construct of shyness is like the construct graviton in theoretical physics, which is hypothesized to play a role in gravitational force. Presumably individual differences in shyness correspond ultimately to some yet-unobserved consistencies in brain functioning. There is something about the brains of shy individuals that differs from the brains of people who are not shy. But just as physicists lack a way to detect individual gravitons, psychologists cannot yet detect all of the differences in brain functioning that correspond to individual differences in shyness (although theories have been offered). Thus, shyness remains a theoretical construct.
Theories in science make predictions about what will be observed under certain circumstances. Theories of shyness predict what we will observe when a person is put into various social situations (or is asked to imagine being in certain social situations). Depending on one's theory of shyness, we might predict that shy people show more physical signs of anxiety (muscle tension, trembling, sweating) in a group of people engaged in a competitive game than in a group of people who are all watching a video. Testing such a prediction requires us to measure shyness in some way—whether it is with a shyness questionnaire, a simple self-rating of shyness, judgments of shyness from knowledgeable acquaintances, or some other shyness measure. Each time we conduct a study to test a prediction about the construct of shyness, we are simultaneously testing the construct validity of the method we are using to measure shyness.
In the words of Hogan and Nicholson (1988), "construct validation is nothing more or less than hypothesis testing" (p. 622).
Let's say we actually conduct the study described above. We have everyone in the study complete the 20-item Revised Cheek and Buss Shyness Scale (RCBS). We attach non-obtrusive sensors for measuring muscle tension, trembling, and sweating on all research participants and we randomly assign them to groups. Some groups are given a competitive game to play and others are instructed to watch a video. After all the data are gathered we compare the psychophysiological recordings with scores on the RCBS. We find that participants with high scores on the RCBS Showed slightly more muscle tension, trembling, and sweating than those with low RCBS scores when the participants were watching a video. But when engaged in a competitive game, persons with high scores on the RCBS showed significantly more muscle tension, trembling, and sweating than those with low scores. Our prediction was confirmed.
Is that the end of it? Can we now say that the RCBS possesses construct validity, that it really measures shyness?
In a word, no. Confirming one prediction is just a piece of evidence supporting the construct validity of the RCBS. No theory of shyness says that shyness is nothing more and nothing less than experiencing muscle tension, trembling, and sweating during competitive activities. Shyness is so much more than this, and a powerful theory of shyness can generate enough testable predictions to keep researchers busy for a lifetime. Every time a new prediction is confirmed, this is simultaneously evidenced by the construct validity of one's shyness measure as well as the validity of the theory that generated the hypothesis we tested.
But let's say that our prediction was not confirmed. Let's say that persons with high RCBS scores showed much more muscle tension, trembling, and sweating in both conditions—watching the video and engaging in the competitive game. Does this mean that the RCBS has zero construct validity and should be scrapped for a new measure of shyness? Not necessarily. When predictions are not confirmed, this might mean that the measure lacks construct validity. But it might also mean that there was a flaw in the underlying theory. Perhaps shy people experience physical symptoms of anxiety in virtually any group setting, not just competitive situations where they are worried about being evaluated. Or maybe there was a methodological problem. Maybe the video chosen for the study portrayed social situations that made the shy participants self-conscious. Perhaps a video about animals would have produced the predicted results.
Just as one confirmed prediction does not give us absolute confidence in a theory and in the construct validity of a test, one failed prediction does not mean necessarily scrapping the theory or the test. A close examination of the results might lead to the abandonment of a theory and/or measure. But it is more likely that researchers will make slight revisions to the theory, methods, or measure and try again. As I indicated earlier, construct validation and theory-testing are never-ending processes, keeping researchers busy for their entire careers.
Naturally, like anyone else, academic psychologists want to have successful careers, and one claim to a successful career in psychological measurement is coming up with a new measure that is regarded as reliable and valid by the research community. Unfortunately, the desire to prove your success sometimes leads researchers to make premature claims about the construct validity of their measures. I don't know how many times I have reviewed a manuscript submitted for publication—or even seen a published article—where the authors claimed to have "established" the construct validity of their new measure in one set of studies. Sometimes the claim is made based on one factor-analysis of a data set! Cronbach and Meehl (1955) mention factor-analysis as one statistical procedure for investigating construct validity. It seems that some researchers, in a hurry to advance their careers, focused on that one portion of Cronbach and Meehl's monograph and ignored what they said about construct validation being a never-ending process.
So, do not believe psychologists who say they have demonstrated the construct validity of a measure in one paper. Neither should one failed prediction convince you that a theory is wrong or a scale is invalid. Scientific knowledge is not like a tower of bricks, where knocking out one brick would destroy the tower. Scientific knowledge is more like a web, what Cronbach and Meehl called a "nomological net." If you cut one strand of a spider web, the entire web does not fail. Valid scientific knowledge does not stand or fall in one study. If our web or net of interlocking ideas is large and well-established, it stands even if one study fails.
I recently heard Al Gore make this point at a climate convention. Climate change deniers are in error to single out the few studies that failed to find evidence that climate change is primarily caused by human activities when the web of scientific findings overwhelmingly supports the theory of human-caused climate change. Consider this: did you ever conduct an experiment in a high-school physics or chemistry course and fail to get the expected results? As often as this has happened in high schools around the world, this does not mean that the laws of physics and chemistry need to be revised.
Ultimately, attempts to establish construct validity are a search for truth. The search for truth has always been difficult. Since the beginning of civilization, philosophers have asked, "What do we know, and how do we know that we know it?" Even a completely bias-free person has trouble answering that question. Science, as a group exercise in establishing knowledge, has a great track record of establishing knowledge, evidenced by all of its accomplishments. But individual scientists can have biases. Sadly, scientists sometimes design their research to produce results that please the corporations that give them funding. Scientists sometimes get so attached to their theories that they argue for them in a one-sided way, like lawyers rather than in an unbiased way.
One form of bias I've seen in construct validation is including items that do not describe defining features of the construct, but instead predict outcomes that the researcher wants to be associated with the construct. In fact, what motivated me to write this blog post on validity was a story I read about attempts to measure spirituality and demonstrate the outcomes of living a spiritual life. Let me explain.
After seeing how complex a definition of the shyness construct can be, you can probably imagine the complexities and ambiguities of defining spirituality. The article I read on attempts to measure spirituality noted that clear definitions of spirituality are hard to come by, although more than three dozen measures of spirituality can be found in the psychological literature.
Some research claims that spirituality is linked to positive social relations and good health outcomes. Yet measures of spirituality sometimes contain items about positive social relations such as "I have a general sense of belonging" and "I feel a kinship with other people." Because other research has already demonstrated that good health outcomes are associated with positive social relationships, notes David Speed (2017), claiming that spirituality leads to good health with such measures is like including an item about "not smoking" in a spirituality scale and then claiming that spirituality protects people from cancer.
The lesson here for researchers is that they need to carefully define the constructs they are measuring and to avoid including items that represent predicted outcomes rather than items that define a construct. The lesson for consumers is that when you read that personality trait X predicts life outcome Y, you might want to check to see if the measure of personality trait X contains items about Y. Even one Y item will make it look like X predicts Y, when actually it is the one Y item that is predicting Y. Caveat emptor.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. DOI: 10.1037/h0040957
Hogan, R., & Nicholson, R. A. (1988). The meaning of personality test scores. American Psychologist, 43, 621-626. DOI: 10.1037/0003-066X.43.8.621
Speed, D. (2017, October 11). What is spirituality, anyway? eSkeptic. Retrieved from https://www.skeptic.com/reading_room/is-spirituality-so-broadly-defined-that-testing-is-meaningless/