When most people think of tests, they think about things they TAKE. Especially this time of year, when students are hunkered down for finals.
But I'm a professor. When I think of tests, I think of writing and giving them.
How psychologists think about measurement
The other day, I was writing a test for my research methods class, focusing on correlations and regression, with a bit of t-tests, ANOVAs, crosstabs, and z's on the side. Although it was a unit test, not a final exam, it was cumulative in the sense that I expected students to be able to apply material from all parts of the semester. We had begun the unit on correlation by talking about measurement and how psychological measures are designed, so measurement was on my mind.
When designing the measure of a construct (what you're trying to assess), psychologists need to deal with two major issues.*
* I'm going to be talking about paper and paper measures - questionairres - although similar principles hold in all areas of assessment. I'll also be focusing on measures whose goal is to assess individual differences.
For example, if I am trying to measure emotional intimacy (my construct), the measure is valid to the extent that the score derived from the scale I use to assess emotional intimacy accurately reflects individual differences in people's intimacy and captures all aspects of that construct. It is reliable to the extent that every time I administer the scale to the same person I get the same score (assuming their intimacy hasn't changed) and that every item in my scale measures some aspect of emotional intimacy.
Validity is reduced to the extent that my intimacy scale works differently for people who have the same level of intimacy. For example, if I measure emotional intimacy in words that women tend to be comfortable with but men aren't (sharing, close, exposing my thoughts and feelings), then women will tend to have higher intimacy scores EVEN IF THEY AREN'T MORE INTIMATE. That reduces the validity of my measure.
Reliability is reduced to the extent that someone's score on a measure will bounce around depending on things that have nothing to do with the construct. For example, if I just had a romantic evening with my husband I might be inclined to answer questions about emotional intimacy more positively than I would if we just had a fight, even if our intimacy hadn't changed.
Designing a good scale - to measure intelligence, parental monitoring, or love - is very difficult. Psychologists can spend years developing a good measure. Draft items are developed, different versions of the scale are tested, and the advantages and disadvantages of different variants explored.
Often, graduate students will take a full semester course just in the topic of measurement. And it's big business too. All those 'standards' tests assessing school achievement - including SATs, GREs, and ACTs are developed by psychologists. (Want a job? Study statistics.)
So how SHOULD we develop final exams?
Which got me thinking. Exams are, of course, measures. In most classes, their job is to place students in a ranking assessing their knowledge of the subject material of the course.
What are the steps to developing a good exam? The same as for developing any other measure: thinking about validity and reliability.
Second, sample items (questions) must be mapped out assessing each course domain. They need to have face validity in that - on the face of it - they appear to be measuring the material covered in class.Third, items must be written so that they assess the material and ONLY THE MATERIAL.
And there are trade-offs, just as in developing any other measure.
And then there's the setting.
And, of course, we need to grade reliably (equivalent answers always get exactly the same scores).
It's all very complicated.
And how DO professors develop tests?
In reality, most professors develop exams as best they can. Few have any formal training in assessment (the field that focuses on how to accurately measure performance). Although many professors spend most of their time teaching, most of us have no formal training in education whatsoever.
So we tend to write questions that sound good and make sense to us.
We try to minimize cheating by writing new exams every semester so we never have a chance to weed out bad questions and develop really good measurement instruments.
We often use the same types of tools used by our own professors to assess the skills and learning of our students instead of thinking about what would work best.
We often don't think clearly enough about our course goals to accurately measure them.
And sometimes our questions are not clear enough so different students interpret them differently and we only recognize interpretations that match our own.
And all this happens despite our best efforts and all our hard work.
For better or for worse.