How to Write a Final Exam
Exams are measures too
Posted December 11, 2012
When most people think of tests, they think about things they TAKE. Especially this time of year, when students are hunkered down for finals.
But I'm a professor. When I think of tests, I think of writing and giving them.
How psychologists think about measurement
The other day, I was writing a test for my research methods class, focusing on correlations and regression, with a bit of t-tests, ANOVAs, crosstabs, and z's on the side. Although it was a unit test, not a final exam, it was cumulative in the sense that I expected students to be able to apply material from all parts of the semester. We had begun the unit on correlation by talking about measurement and how psychological measures are designed, so measurement was on my mind.
When designing the measure of a construct (what you're trying to assess), psychologists need to deal with two major issues.*
- Validity is the extent to which the measure assesses what you are trying to measure.
- Reliability is the extent to which the measure is stable.
* I'm going to be talking about paper and paper measures - questionairres - although similar principles hold in all areas of assessment. I'll also be focusing on measures whose goal is to assess individual differences.
For example, if I am trying to measure emotional intimacy (my construct), the measure is valid to the extent that the score derived from the scale I use to assess emotional intimacy accurately reflects individual differences in people's intimacy and captures all aspects of that construct. It is reliable to the extent that every time I administer the scale to the same person I get the same score (assuming their intimacy hasn't changed) and that every item in my scale measures some aspect of emotional intimacy.
Validity is reduced to the extent that my intimacy scale works differently for people who have the same level of intimacy. For example, if I measure emotional intimacy in words that women tend to be comfortable with but men aren't (sharing, close, exposing my thoughts and feelings), then women will tend to have higher intimacy scores EVEN IF THEY AREN'T MORE INTIMATE. That reduces the validity of my measure.
Reliability is reduced to the extent that someone's score on a measure will bounce around depending on things that have nothing to do with the construct. For example, if I just had a romantic evening with my husband I might be inclined to answer questions about emotional intimacy more positively than I would if we just had a fight, even if our intimacy hadn't changed.
Designing a good scale - to measure intelligence, parental monitoring, or love - is very difficult. Psychologists can spend years developing a good measure. Draft items are developed, different versions of the scale are tested, and the advantages and disadvantages of different variants explored.
Often, graduate students will take a full semester course just in the topic of measurement. And it's big business too. All those 'standards' tests assessing school achievement - including SATs, GREs, and ACTs are developed by psychologists. (Want a job? Study statistics.)
So how SHOULD we develop final exams?
Which got me thinking. Exams are, of course, measures. In most classes, their job is to place students in a ranking assessing their knowledge of the subject material of the course.
What are the steps to developing a good exam? The same as for developing any other measure: thinking about validity and reliability.
- First, the goals of the course and the material covered must be mapped out. This is equivalent to defining the construct to be assessed in a measure of, for example, love or self-esteem, or legitimacy of parental authority.
- A test is VALID to the extent that the material covered on the test accurately reflects the material that was to have been mastered in the course. This, in term, should reflect course goals.
Second, sample items (questions) must be mapped out assessing each course domain. They need to have face validity in that - on the face of it - they appear to be measuring the material covered in class.Third, items must be written so that they assess the material and ONLY THE MATERIAL.
- This is always the hard part. If I write a question that is hard for people to understand, then I am assessing reading ability and not knowledge of statistics or psychology.
- If I write an essay question where I have specific answers in mind but don't make the requirements clear to my students, then test savvy students will do better than students who are test naive. Even if they don't know the material better. Again, that undermines the validity of my test.
And there are trade-offs, just as in developing any other measure.
- The more items in a measure, the easier it is to accurately assess all aspects of the construct. You aren't sampling the construct, you are truly getting at all parts of it.
- In addition, the more questions you have, the less 'noise' in the test score due to randomly asking a question that student just doesn't know (good reliability)
- But the more items you have, the longer it takes and the more exhausted the poor student becomes. Thus concentration and patience and perseverence is also assessed in a long test, reducing validity.
And then there's the setting.
- Timed tests - the standard - are 'fair' to the extent that everyone is taking the test under identical conditions (it is 'controlled')
- On the other hand, timed tests measure reading and writing speed as well as knowledge of material. Although any given person will be able to perform faster when they know material better, individual differences in speed can overwhelm that. So students who read slowly, who know English as a second language, who write physically slowly, or who are more thoughtful in their responses may know the material tested as well as other students, but perform more poorly.
- On the third hand, if you give extended time, students who have other courses (or work) immediately after the scheduled exam period are systematically disadvantaged.
And, of course, we need to grade reliably (equivalent answers always get exactly the same scores).
It's all very complicated.
And how DO professors develop tests?
In reality, most professors develop exams as best they can. Few have any formal training in assessment (the field that focuses on how to accurately measure performance). Although many professors spend most of their time teaching, most of us have no formal training in education whatsoever.
So we tend to write questions that sound good and make sense to us.
We try to minimize cheating by writing new exams every semester so we never have a chance to weed out bad questions and develop really good measurement instruments.
We often use the same types of tools used by our own professors to assess the skills and learning of our students instead of thinking about what would work best.
We often don't think clearly enough about our course goals to accurately measure them.
And sometimes our questions are not clear enough so different students interpret them differently and we only recognize interpretations that match our own.
And all this happens despite our best efforts and all our hard work.
For better or for worse.