Measurement Reliability Explained in Simple Language

Why you absolutely must understand the basics of psychological measurement.

Posted Oct 29, 2017

Those of us in the business of psychological measurement use the terms reliability and validity a lot. You've probably seen those terms on the Psychology Today website (they appear thousands of times) and elsewhere. You might have some sense of what it means for a psychological test to be reliable or valid. You have probably assumed that a good test must be both reliable and valid (and you would be right about that).

But what are reliability and validity exactly, how do we assess reliability and validity, and why are these properties of psychological tests so crucially important? In this and a following blog post, I hope to answer these questions in a totally non-technical way, avoiding statistical language as much as humanly possible. If I succeed, you will see why understanding measurement reliability and validity is so important for judging the usefulness of an IQ or personality test. Many psychological "quizzes" on the Web have absolutely no evidence of reliability or validity, so you should not take them seriously. Even the claims about the reliability or validity of professionally-developed tests are sometimes overstated. Your understanding of reliability and validity from reading this blog post may help you to recognize when this happens and to use caution before accepting results based on overstated claims.

A word of warning: Even though I am writing about reliability and validity in a non-technical way, my two blog posts are in-depth, intensive treatments of these topics. As a result, they are longer than a typical PT blog post. So if you are looking for fluff and entertainment about personality, these posts are not for you. If you are serious about understanding reliability and validity in psychological measurement, welcome aboard.

To avoid total information overload, I decided to write about reliability and validity in two different posts. In this first one, I'll cover measurement reliability, because that property is more basic. It is possible to have reliable measurements that lack validity. However, unreliable measurements can never be valid. So let's start here with reliability.

The psychologist Edward Thorndike (1918, p. 16) famously wrote, "Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality." Measuring quantities is a basic activity of any science, whether we are talking about measuring the size, mass, temperature, and velocity of physical objects or the intellectual and personality traits of human beings. And measurement in any science assumes that our attempts to measure the actual quantities of things or people will inevitably involve some measurement error. As we strive to determine actual quantities with our measuring devices, the measurements we record will sometime be too high, sometimes be too low, and sometimes be right on.

Measurement reliability refers to how closely a measurement procedure gets us to the actual quantity we are trying to measure. Stated another way, reliability is the absence of measurement error. The closer a measurement procedure can get to the actual amount of something, the less measurement error we have, and the more reliable the measurement procedure. But how can we know the reliability of any measurement procedure?

Let's look at this question first with an example of physical measurement. Let's say that we have a piece of wood that we somehow know to be exactly three feet (or 36 inches) long. (We will ignore for now how we know that.) We have two tape measures, one made out of cloth fabric, and one made of steel. To see how closely these two tape measures assess the actual lengths of boards, we try them out on the three-foot board. To increase our confidence in our little experiment, we measure the board 200 times—100 times with the cloth tape measure and 100 times with the steel tape measure, recording our reading each time.

CC0 Creative Commons
Source: CC0 Creative Commons
CC0 Creative Commons
Source: CC0 Creative Commons

Lo and behold, we discover that the cloth tape measure tended to produce somewhat inconsistent results. About 70% of the time it did indicate that the board was exactly 36 inches long, but about 5% of the time it produced measurements that were too large, like 36 1/16 inches or even 36 1/8 inches. And 25% of the time it underestimated the true length of the board, with measurements like 35 15/16 inches. We might say that the cloth tape has some reliability, but perhaps not enough to trust it for woodworking projects.

In contrast, we found that the steel tape showed readings of exactly 36 inches 98% of the time. Once (that is, 1% of the time) it showed a reading of 35 15/16 inches, and once (1% of the trials) it produced a measurement of 36 1/16 inch. These results suggest that the steel tape measure is reliable enough to use in your woodworking projects.

Now let's see how let's see how some issues concerning the reliability of cloth and steel tape measures applies to the reliability of intelligence tests or personality questionnaires

Standards for Measurement

First, even though the steel tape measure gave us much more consistent results than the cloth tape measure and therefore could be said to be more reliable, we have to remember that we were just pretending to know ahead of time that the board we were measuring was exactly 36 inches long. In most real life situations, we do not know for certain the real, actual amount of anything we measure before we measure it. So how in the world can we see how close a measuring device comes to the actual amount of something if we don't know the actual amount ahead of time?

For physical properties, this problem has been successfully handled by simply defining the three basic units of measurement (length, mass, and time) according to agreed-upon standards. These standards have changed over of the history of measurement. For example, in A.D. 1120 the king of England declared that the standard of length would be called a yard, defined by the distance from the tip of his nose to the end of his outstretched arm. Modern physics has depended on various standards of distance, defining a meter in 1960 as the distance between two ends of a particular platinum-iridium bar stored under controlled conditions and in 1983 as the distance traveled by light in a vacuum in 1 / 299,792,458th of a second. As physics developed more reliable methods of measurement, we have been able to improve the measurement precision and accuracy to enable remarkable technological achievements, from producing nuclear energy to connecting the world through the Internet to safely flying more than 8 million people through the sky each day.

In psychology we have yet to establish such standards for measuring intellectual and personality traits. There is no platinum-iridium IQ or personality test. Part of the problem is that, unlike in physics, we are still arguing about what, exactly, is the nature of the psychological characteristics we are trying to measure. It is hard to settle on a unit of intelligence or personality when we are not in agreement about the definition of intelligence or personality. Still, over time, the psychological research community has tended to rally around preferred measures. For example, the NEO Personality Inventory has been used so often to measure the five major factors of personality that it has been called the "gold standard" for measuring those factors (Muck, Hell, & Gosling, 2007).

Assessing Reliability with Repeated Measurement

In our tape measure example we found that 98 out of 100 measurements with the steel tape produced the same result, while only 70 out of 100 measurements with the cloth tape produced the same result. Even without knowing the actual length of the board, we could say that the steel tape produces more consistent measurements, and is, in that sense, more reliable. But without knowing the actual length of the board, we wouldn't know for sure whether those 98 measurements of exactly 36 inches are right on the mark, consistently high, or consistently low. If you could not borrow the platinum-iridium bar to measure the wood or have a way of timing the fraction of a second it would take light to travel from one end of the piece of wood to the other, you wouldn't really know whether your steel tape's 98 measurements of 36" are right on the mark.

In psychology, where we do not even have a platinum-iridium bar, we've decided to accept finding the same measurement over and over again as sufficient evidence for the reliability of a psychological test or questionnaire. This might sound a little crazy, because you might think that a consistent score might be either a consistent overestimate or underestimate of someone's intelligence or conscientiousness. But using the consistency of scores to assess reliability in psychology is not as crazy as it might seem, as I will explain.

CC0 Creative Commons
Source: CC0 Creative Commons

Unlike physical measurements, most psychological measurements are interpreted relatively by comparing them to other people's scores (e.g., this woman is more conscientious than 80% of women.) This is not true for physical measurements, where the measurement of, say, a board represents how much longer the board is from zero rather than whether it is shorter or longer than other boards.

Without an objective zero point for intelligence (what would it mean to have an intelligence of zero?) there is no way to describe an individual's actual intelligence level in objective units above zero. And because we can't describe an individual's actual intelligence level as "X units above zero," we cannot define reliability in terms of how close a score is to the actual level, X.

Instead, "actual intelligence" ends up being defined as how much higher or lower your score is than the average score for your reference group. And reliability is described by groups of people getting the same score with repeated measurement. If everyone gets the same score on several different testing occasions, than any individual's score will be consistently higher than, lower than, or right on the average score for the group.

Repeated Measurement Assumes Consistency of the Property You Are Measuring

When we measured the three-foot board 100 times with the two tape measures, we expected to get the same measurement each time because we assumed that the length of the board was not changing between measurements. And that was probably true if we took the measurements one right after the other. But what if we waited two weeks between measurements? Changes in heat and humidity might cause the board to shrink or lengthen slightly. A reliable steel tape measure would show different lengths for the board over these time periods, giving the impression that the tape measure is less reliable than it really is.

In psychology, one long-standing method for assessing reliability is the test-retest method. If a test is perfectly reliable, each person will receive the same score on both the first testing and retesting (which is often one or two weeks later, although any time interval can be used), but only if each person's level of actual intelligence or personality does not change over the time interval. By most definitions, intelligence and personality do not change over short periods of time, so if the scores on retest differ from the first testing, that indicates imperfect reliability.

In psychological measurement we like to quantify the amount of reliability of a test with a statistic called the Pearson correlation coefficient. There's no need to explain here how it is computed; you can look that up if you like. It is enough to know that Pearson correlation coefficients of reliability nearly always range between 0 and 1.00. (It is possible to find negative values for reliability correlations, but when this happens something is seriously, seriously wrong.) There is no one standard for acceptable reliability, but .70 is often suggested as a minimum level of acceptable reliability. Good personality tests regularly show reliabilities above .80, while good measures of intelligence and cognitive abilities often show reliabilities above .90. It is possible to draw tentative conclusions about the relation between psychological variables when the tests show reliabilities below .70. But one should never draw strong conclusions or make significant decisions about individuals with tests that do not meet the .70 standard.

Test-retest is not the only method for estimating the reliability of a psychological measure. If you want to estimate reliability with just one test administration, you can use the split-half method. In this method you give each person two scores, each based on half the items in the test. Typically we compute one score based on the odd-numbered items and one based on the even-numbered items, although there are many ways to group items to form two scores (e.g., summing items 1,2,5,6,9,10 make one score and summing items 3,4,7,8,11,12 make a second score). We compute a Pearson correlation coefficient between the two scores and then adjust it upward slightly with something called the Spearman–Brown Formula because we know that tests with fewer items are less reliable than tests with more items.

The split-half method used to be very popular but has been replaced by a logical extension of it called Cronbach's Coefficient Alpha. Again, there is no need to bother with the math; we can think of Cronbach's Coefficient Alpha as the average of all the split-half reliabilities we could compute by all possible ways of dividing the items into two groups. Cronbach's Coefficient Alpha has become the most popular way of reporting estimates of the reliability of psychological measures. Again, an Alpha of .70 is generally regarded as the minimum level of acceptable reliability.

CC0 Creative Commons
Source: CC0 Creative Commons

Self-report questionnaires are not the only way we measure personality. We also often ask acquaintances to judge a person's personality by marking rating scales, sorting descriptive statements, or completing questionnaires written in the third person. Because these methods contain multiple items, we can compute Cronbach Coefficient Alphas just like we do for self-reports. But there is another angle here because we have multiple people (sometimes up to 6 or 10) making the judgments. The amount of agreement among judges can be quantified by yet another variant of correlation called the Inter-Class Correlation or ICC. The ICC is a noteworthy form of measurement reliability because it shows the consistency of measurement across different judges instead of just the consistency of scores produced by individual persons.

Sources of Error that Detract from Reliability

In all forms of measurement there is some degree of measurement error. The greater the error, the lower the reliability of measurement. Even simple tasks of physical measurement involve measurement error, either due to the measuring tool itself or the way it is used by the person who is doing the measuring. In our example, the cloth measuring tape produced a number of readings that were either greater than or less than the actual length of the board. Perhaps the 25 readings that were too low resulted from stretching the tape inappropriately. And the 5 readings that were too high? Maybe the tape bunched up when it was placed on the board. Some of the erroneous readings could be due more to human carelessness than to the physical properties of the tape. Even with the very reliable steel tape one reading was too low and one was too high. We cannot always say how much of imperfect reliability is due to the measuring instrument itself and how much is due to the way it is used by the person who is measuring.

The same thing is true for psychological tests. I think there is a tendency by psychologists to think of reliability as a "property" of a test or questionnaire. But any time that tests are administered, the results can be affected by the behavior of the person administering the test—by their tone of voice and body language, even when standard instructions are being followed. And when tests are administered on the Internet, who knows how the conditions in a person's immediate environment (noise level, distractions from other people) and their own state of mind (whether they are attentive, sleepy, or drunk) are affecting the reliability of the test.

How Psychologists Create Reliability with Repeated Measurement

You might be familiar with an old carpenter's adage, "Measure twice, cut once." Actually, the original adage on which this saying is based is attributed to Giovanni Florio, 1591, "Alwaies measure manie, before you cut anie" (Rosenbaum, Vaughan, & Wyble, 2015, p. 10). The wisdom of this adage is its recognition of measurement error. In carpentry, it is good sense to measure a piece of wood multiple times before cutting it to avoid cutting a board too short and wasting wood. Repeated measurement improves reliability.

Similarly, in psychology we can increase measurement reliability by taking multiple measurements of any sort (be they self-judgments, acquaintance ratings, or laboratory measurements). As I noted earlier, when we have several acquaintances who are rating the same person's personality, we can assess reliability by the degree of agreement among judges. If that agreement is high enough we can then take the average judgment of all judges as our most reliable, accurate estimate of the person's personality. Such an average, composite judgment of personality will be more reliable than judgments from a single rater, and will be more accurate for predicting additional judgments of personality or future behavior (Hofstee, 1994). The theory behind this is that any individual judge might have some unique, idiosyncratic biases and errors in his or her judgments. A good friend might overestimate a person's conscientiousness, while a critical supervisor might underestimate it. But when you average judgments from a large set of judges, these unique biases and errors cancel each other out, leaving a more accurate, reliable estimate of personality.

For self-judgments, we have only one self, so it is impossible to average information from multiple judges. However, the summing of responses to different items for measuring the same trait accomplishes the purpose. Normally, we think of adding items from a personality self-report questionnaire as showing the degree or level of some trait. For example, if we have a ten-item anxiety questionnaire, someone who answers all ten questions in a way that indicates anxiety would be said to have a high level of anxiety, someone who answered only about half the items this way would be said to have moderate anxiety, and someone who answered only one or two items this way, low anxiety. But Paul E. Meehl, who has been described as the most intelligent psychologist of our times, said that that counting item responses is not like accumulating centimeters in physical measurement. Rather, extremely high or low scores merely represent an increased probability or confidence of correct decision-making. For our anxiety scale, from a score of 9 or 10 we can decide with confidence that the person is anxious, from a score of 1 or 2 that the person is calm, and from scores in the middle we can't decide anything about the person with confidence.

From this viewpoint, each item on the anxiety scale is basically asking "Is this person anxious or calm?" With the ten-item scale you are asking this question ten times. If the answer is the same every time (either 10 anxious answers or 10 calm answers) this indicates reliable measurement, just like finding a board is 36 inches long every time we measure it.

So summing the responses of 10 items on a personality scale can be seen as analogous to averaging the judgments of 10 acquaintances about the rated person's anxiety. (Whether you divide the sum by the number of items to get an average is unimportant; sums and averages provide the same information because they differ only by a constant.)

This view of reliability has interesting implications for providing feedback to people who complete personality questionnaires. Because our confidence about questionnaire results is high only for relatively high or low scores, it is probably wise to return only three categories of feedback: one for relatively high scores, one for relatively low scores, and one for scores in the middle. This is precisely what Hofstee (1994) recommended, given the typical reliability of personality tests. Any feedback scheme attempting to use more than three categories (e.g., very low, moderately low, average, moderately high, very high) is likely to provide inconsistent results because you are trying to make decisions that are more fine-grained than the reliability of the questionnaire supports.

There are, of course, practical limits to increasing reliability by using more and more items on a questionnaire to measure a trait. A ten-item personality test will almost certainly be more reliable and useful than a one-item personality test. This is particularly true for assessing broad traits such as the Big 5 (Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Intellect/Imagination). And a 20-item measure should be more reliable than a 10-item measure. But will a 50-item measure be better than a 20-item measure? The problem with very long questionnaires is that respondents can become bored, fatigued, and sometimes even suspicious ("Why do they keep asking me the same question in different ways? If I answer inconsistently will I be penalized?"). If you want to measure a lot of different traits with one questionnaire, the questionnaire might have to be 200 or 300 or 400 items long. Questionnaires of this length have been used successfully. But optimal reliability demands a balance between using multiple measurements and limiting the length of measures to keep respondents engaged.

Applying What You've Learned about Reliability to the Real World

I have been writing here as a professional in personality assessment. Although I have made a number of technical points about measurement reliability, I hope what I have written has been understandable. Contrary to my advice about long questionnaires, I have probably gone on about these issues longer than I should. I would like to end with some practical points about how you can apply the information I've presented here to your interaction with psychological measures.

CC0 Creative Commons
Source: CC0 Creative Commons

First, the reliabilites of most so-called "quizzes" on the Web probably have not even been examined, much less reported. Go ahead; find a psychological quiz on Facebook, take it, and see if they tell you the Cronbach coefficient alpha reliability estimate for the measure. The unknown reliability of these informal quizzes means that you do not know how much measurement error you can expect from the quiz. You can, however, complete the quiz several times to see if it gives you the same result each time. If it does not, the quiz is unreliable (at least for you) and is basically useless to you.

Professionals are a lot better when it comes to reporting reliability because reviewers and editors require researchers to report this information for psychological tests and questionnaires in order for research to be published in professional journals. However, when a professional writes more informally for a general audience on the Web, he or she might omit that information. Unless you can obtain information about reliability from them, you need to take whatever they are saying with a grain of salt.

There is one group of professionals researchers that has often been exempt (even though they should not be) from reporting measurement reliability: experimentalists who present stimuli to research participants (either in a laboratory or real-life situations) and measure their reactions. For example, a famous study by Hartshorne and May investigated the consistency of honesty in school children by giving them opportunities to lie or cheat in different school situations. They typical correlation between any two such situations was only .23, leading many to conclude that honesty/dishonesty is not a consistent trait. The problem with this conclusion is that each of Hartshorne and May's test situations can be thought of as a one-item test with unknown (but probably low because it is only one item) reliability. In fact, combining their single-item tests into multi-item measures yields reliability estimates in the .70s or .80s (Epstein and O'Brien, 1985). So, the next time an experimentalist (or anyone, for that matter) tries to tell you that inconsistent behaviors across two experimental situations proves that there is no consistency to personality, remember that the one-item behavioral measures in the two situations are likely to have low reliability and be skeptical about those conclusions.

Finally, it is important to remember that reliability is not validity. Reliability indicates measurement precision, reflected in producing similar measurements on multiple occasions. Validity, on the other hand, refers to whether a measurement procedure is actually measuring what it is supposed to measure. Someone might tell you that a certain quiz will show you how much social intelligence you have. Furthermore, the quiz has demonstrated reliability: the test-retest correlation over a two-week period is .90 and a Cronbach alpha of .85 has been computed for a research sample. But how do we know that this quiz actually measures social intelligence and not something else? How do we know that it is not simply reliable but also valid? Explaining validity is the topic of my next blog post.

References

Epstein, S., & O'Brien, E. J. (1985). The person-situation debate in historical and current perspective. Psychological Bulletin, 98, 513-537.

Hofstee, W. K. B. (1994) Who should own the definition of personality? European Journal of Personality, 8, 149-162.

Muck, P. M., Hell, B., & Gosling, S. D. (2007). Construct validation of a short five-factor model instrument: A self-peer study on the German adaptation of the ten-item personality inventory (TIPI-G). European Journal of Psychological Assessment, 23, 166–175.

Rosenbaum, D. A., Vaughan, J., & Wyble, B. (2015) MATLAB for behavioral scientists (2nd ed.). New York: Routlege.

Thorndike, E. L. (1918). The nature, purposes, and general methods of measurements of educational products. In G.M. Whipple (Ed.), The seventeenth yearbook of the National Society for Study of Education. Part II. The measurement of educational products (pp. 16-24). Bloomington, IL: Public School Publishing Co.

More Posts