Do We Have a Measurement Problem?

Can psychology's allergy to measurement explain problems in willpower work?

Posted Oct 31, 2019

From Pixabay at Pexels
Flake and Fried argue that psychology takes a "measurement schmeasurement" attitude towards quantifying important constructs.
Source: From Pixabay at Pexels

Standardized testing is controversial. Does an IQ test really measure intelligence, or is it just related to your ability to take tests—not solve more practical everyday problems? Does the GRE really measure who will do best if they’re admitted to a graduate program? Even if these tests pick up on these quantities reasonably well, are there certain systematic biases in the results they give? How precise are their measurements—should we trust them to two decimals, or round them to the nearest whole number?

Addressing these questions is part of research into psychological measurement, a foreboding sounding area of research that often gets overlooked when people think about psychology. But measurement isn’t something that should get discussed in regards to high stakes testing. Jessica Flake and Eiko Fried need you to know that it’s central to almost all areas of psychology, even if we don’t talk about it enough.

Flake is a quantitative psychologist at McGill University with thoughtful and opinionated critiques of current research culture. Her Twitter thread about the GRE led me to write an earlier article about whether the GRE should be dropped from graduate admissions. Her recent interview on the Two Psychologists Four Beers podcast better explains where she’s coming from. Most in depth is her academic article with Fried titled “Measurement Schmeasurement” where they demonstrate “that psychology is plagued by a measurement schmeasurement attitude: QMPs [questionable measurement practices] are common, offer a stunning source of researcher degrees of freedom, pose a serious threat to cumulative psychological science, but are largely ignored” (from the abstract).

Flake and Fried use the term QMPs as an analogy to QRPs—questionable research practices—a term that gained ubiquity in psychology in 2011 and 2012. QRPs relate to statistical inferences. If you use these, you might be able to go from claiming the effect of one variable on another (let’s say the effects of dieting on self-control) isn't present to claiming it is. You go from saying “I don’t have an effect” to “I do have an effect.”

But Flake and Fried open up a broader framework of issues, based on measurement theory. Making the right call on a statistical inference (“there is an effect” vs. “there isn’t”) is just one type of validity. Another is internal validity, which establishes causal relationships between variables. A common issue raised here is whether variables have the same relationships in different cultures or settings. For example, self-control might be measured by asking people to keep persisting on a hard puzzle, to avoid laughing at a funny movie, and to keep gripping a gripmaster for as long as you can. Among middle-class American college undergraduates, performance on all of these tasks might correlate positively, giving an internally valid measure of self-control. Among Mensa members, performance on the puzzle might be differently related to the other tasks, because they love them some brain teasers. This measure therefore would be limited in its applicability.

External validity establishes how generalizable findings are. Maybe we saw that dieting reduced willpower among college students in the Northeast in the 80s. Will it do the same thing among working adults in the Southwest in the 2010s? If not, the finding may be less useful, because it doesn’t hold in general.

Construct validity establishes how we measure the variables in a study. Is the important thing for the “not laughing at a funny movie” task to hold off on laughing for as long as possible—or is it to consistently regain composure and make as little noise as possible? You could get very different numbers for a person’s self-control score from this measure if you have someone who “breaks” and then can’t stop laughing. If you score the task the first way, we have a self-control champ; score it the second way, and they’re a self-control chump.

Photo by Pixabay on Pexels.
Measurement questions are so fundamental in everyday terms we forget that they are serious and tricky for psychology.
Source: Photo by Pixabay on Pexels.

Flake and Fried cite a lot of “yikes!” statistics about measurement. For example, a review of the Journal of Personality and Social Psychology (JPSP) found that 19% of the time a standard scale was modified in some ad hoc (not validated) way, and a review of measures of emotion specifically found that about 90% were modified. The review of JPSP also found that 40% of scales used didn’t make clear where they came from, 19% didn’t say how many questions they had asked, and 9% didn’t say what people’s response options were.

When we talk about standardized tests, scientists and the general public know to be very careful about establishing that they really work. This is a continuous process, and new ways to detect bias in tests are continually being researched. We should care this much about the more abstract and less well understood concepts being studied in many areas of psychology!

For example, we should continually be worrying about whether a scale for depression really measures depression, whether it works for all groups of people, and whether we’re combining the items in a way that makes sense (should some be weighted more than others?). We need to do this before we start making claims about which treatments reduce depression. If we don’t know that our measure of depression is giving valid readings, what’s the point of seeing if we can change it? (Fried has a great paper suggesting that measures of depression are very inconsistent.)

I believe this kind of measurement issue is key to understanding one of the most controversial research areas in recent psychology: ego depletion. Reading the original, classic manuscript on ego depletion, you find four studies claimed to be measuring “willpower” (a synonym for ego depletion in the literature) using four different experimental set ups.

In one, participants are in a room with a bowl of fresh baked cookies and a bowl of radishes; some people are allowed to eat the cookies, others the radishes. Not getting to eat cookies was meant to tax willpower. In another, participants were told they were going to record a prewritten speech in favor of a position that they chose, or they were assigned to record a speech that did not support their own position. Giving a speech against your own natural position was meant to tax willpower. In the third, participants were shown highly emotional movie clips and told either that they could express their emotions freely or that they should hide their emotions. Hiding their emotions was meant to tax willpower. In the fourth, participants were told to look at a page of text and cross off instances of the letter “e” or to consult multiple rules about which letters to cross off the page. Having to consult more rules about which letter to cross off the page was meant to tax willpower.

To me, the biggest claim in the paper was not the central claim (that willpower gets pushed around through common tasks). It was the assumption that all these common tasks had the same effect! The researchers seemed to believe that any task that seemed hard or inconvenient was necessarily depleting in the same way. They never did the preliminary steps of making sure they were measuring something carefully.

By A. Danvers (photo credits in image).
Willpower researchers have not worried about which tasks were valid measures.
Source: By A. Danvers (photo credits in image).

As a result, decades later, when an ego depletion task failed to replicate, the original researchers claimed that the reason was just that the right task hadn’t been used. Because no one had ever gone back and taken the time to establish the validity of the various willpower tasks, it was impossible to say whether the one used was good or bad. It was all left up to the intuition of different groups of scientists. Even now, none of the solutions on the table to understanding willpower involve going back and doing the missing measurement research.

Like intelligence, and like depression, willpower is a big, important topic with implications that may be relevant to a lot of day-to-day life. We should care about making sure we measure them accurately! If you’re fired up about whether standardized tests are fair, you should be fired up about whether we’re diagnosing depression accurately! Or actually figuring out what it means to have willpower in your day-to-day life! Psychology measurement research isn’t just for stats nerds and opinionated people on Twitter—it’s for anyone who cares about getting an accurate understanding of how people think.