My eldest son's girlfriend works as a diamond grader. Pretty cool job. She works in a room with trays of diamonds, examining each one and scoring them on their clarity, flaws, etc.
We were chatting about this last night over dinner and I realized her job illustrates two of the exact same principles I was teaching in my statistics course last week: minimizing measurement error and test-retest reliability.
If you imagine that a diamond grader sits at a desk with a bright light, a magnifying glass, and trays full of diamonds, you'd be absolutely correct. What you might not imagine is that many diamonds are very, very tiny. If you've ever dropped a stone from a piece of jewelry, you know how hard they are to handle. Sometimes, she tells me, they stick to your finger like sand and it's hard to know if you've put it down.
Except this sand is very, very expensive.
As you'd expect, security is important. You go to the supervisor and s/he hands you a precisely weighed tray of diamonds. You are responsible for returning back EXACTLY that same weight.
After grading, each diamond is weighed to a particular level of precision (two decimal places). It is put in a pile with other stones of similar weight. The weight of each class of stones is then weighed, also to a certain level of precision (two decimal places). After all the stones are graded, you add up the weight of each class. They have to add up to what you started with. They are very, very unhappy when they don't.
Why don't they weigh each diamond instead of giving you perhaps 100 at a time? Measurement error.
When checking that graders have returned all their stones, supervisors look at masses of diamonds, not individual ones. Why? Measurement error. With each stone you will have some rounding and measurement error. You might be slightly low or slightly high at the second decimal place, which isn't recorded. With very small weights, it is easy for small fluctuations (the sweat from your hand) to change the weight. The ERROR (differences between actual weight and recorded weight) will be large relative to the TRUE weight.
This brings us to test-retest reliability. Test-retest reliability is the extent to which you get the same score every time you measure something. I weight myself every morning. My new digital bathroom scale will give me a number showing my weight. If I get off the scale, let it recalibrate itself, and get back on, it will almost always give me a weight that is identical to the previous one. This shows good test-retest reliability. My weight hasn't changed in the previous two minutes. Neither has the number I am using to measure my weight.
Random error relative to true score. My old bathroom scale was one I picked up on the sidewalk that someone else had thrown away. It was mechanical, not digital. It varied several pounds depending on where I shifted my feet or I put it on the floor. When used to measure my weight fluctuations, it had somewhat poor test:retest reliability. This is because it was less precise than I needed to tell whether I had gained or lost a pound. Look at the picture below. In this illustration, true weight has changed two pounds. But because the scale is inaccurate - it might be two pounds low or two pounds high, I can't see the true change in my weight. From the scale readings, it looks like weight has gone up and then gone down again, because differences in the magnitude of the true score are small relative to the error of measurement.
On the other hand, even my lousy bathroom scale would be perfectly adequate to judge whether some people are heavier than others. If you look at the second graph, I'm trying to find out if there is a difference between people who weigh from 120-160 pounds. That same +/- two pound difference is small relative to the magnitidue of differences between people.
Back to diamonds Because diamonds weigh so little, the error of measurement is going to be relatively large. One way to compensate for that is to aggregate measures.
If I put many diamonds together and measure them all at once, the error of measurement will become small relative to the weight of all the diamonds together. Pretty cool.
This gets us to random error. Error that is truly random averages out. Literally, it averages out. Half the time the measured score should be higher than the true score. Half the time it should be lower. If you measure the same thing over and over and average all of them, those errors should cancel each other out and you should get the true score. Even in those larger categories of graded diamonds, there is some measurement error. However, if you have many categories of stones, adding them together should give you a correct answer. If it doesn't, they start checking - first at the weight of each diamond. Then sweeping the desk, bench, and area around it.
Psychologists (and professors) use the same principles in developing measures. The exact same principles are used when creating the items for a test. Imagine you are trying to develop a new measure. Last week my class worked on a project where we tried to develop a 10-item scale to measure student conscientiousness. We wanted to develop a Likert scale, meaning that people rated themselves on how well each item described them from 1 (not at all like me) to 5 (describes me exactly).
To start with, we developed more than 60 potential questions that we could ask, all of which were thought to tap student conscientiousness. A few samples items are listed below.
Once we picked our final 10 questions, we our plan was to average the answers to all 10 questions to give each person a score, with 5 being the highest (very conscientious) and 1 the lowest (not conscientious). The goal was for each of the ten questions to get at some aspect of student conscientiousness. Essentially, we were asking the same question 10 different ways.
Why didn't we just ask one really good question? It is rare for psychologists to develop a measure or questionnaire that is just one question long - usually we have a least three. Many intro psych students have filled out depression inventories of measures of self-esteem that are 10-100 questions long. Why? For the same reason that diamond graders weigh large numbers of diamonds rather than each diamond one at a time: measurement error.
Every time you answer a question on a questionnaire, your answer reflects two components: your true score and measurement error. In the case of our measure, your true score would reflect how conscientious you were as a student. But other things go into your answer as well. For example, it would also reflect how well you read and how you interpretted the question. One person might understand 'working on tough assignments until I've exhausted all my options' as being a 20 minute process of trying to figure something out. Another person might interpret that as meaning re-reading their book, going through their notes, checking on the internet, talking to a TA or going to help desk, and finding other students. The difference in how people READ the question helps determine how they answer the question. And the differences in their scores that are due to interpretation, not conscientiousness, are error.
Some error is systematic (I tend to circle 5s or 1s and avoid 2,3,4, while you might like those middle numbers). But some is just random. Asking more questions on our questionnaire will minimize random error, just as measuring more diamonds a time minimizes random errors in weight.
The same principle applies in classroom tests. A long test with many questions will tend to minimize the effect of students randomly misreading one question or randomly guessing on a question they didn't happen to study. On a short test, those random errors have a much larger effect on scores.
Although long tests can be exhausting, they tend to reflect true knowledge better than a short test with similar good questions.