Personality
Should Personality Inventories Be Gender Normed?
Gender-normed personality tests are not intrinsically sexist.
Posted January 26, 2025 Reviewed by Gary Drevitch
Key points
- Most personality inventories compare a respondent's scores to a sample of male or female respondents.
- For traits that show sex differences, identical scores can be interpreted differently for men and women.
- A science reporter has claimed that different score interpretation criteria for men and women is sexism.
- The current post explains why gender norming is not sexism, but, rather helpful, objective, information.
Five years ago, Olivia Goldhill, then a science reporter for the online news website Quartz, interviewed me and several other personality psychologists about the use of gender norms for interpreting scores on personality inventories. The result of these interviews was an article, "We took the world’s most scientific personality test—and discovered unexpectedly sexist results."
Goldhill's concern was that when she completed several Big Five personality inventories, her level of agreeableness was consistently interpreted as higher when she indicated that she was a man, compared to her level of agreeableness when she indicated that she was a woman, even though she responded to the items in exactly the same way each time. Her conclusion was that this was prima facie evidence that these personality inventories are sexist because women need to achieve a higher score on Agreeableness than men to be judged as highly agreeable.
If we were talking about a high school mathematics test, using different grading keys for girls and boys would certainly be seen as sexist and unfair. If girls needed to answer 95% of the math problems correctly to earn an A but boys needed to answer only 85% of the problems correctly to earn an A, the grading system is obviously biased against girls.
But personality tests are not like math tests. Math problems assess your knowledge of mathematics, and they do so by determining if you know how to get the right answers. But items on personality tests do not have right and wrong answers. Rather, items on personality tests are designed to reflect differences in individuals' thoughts, feelings, and behavior. A reliable, valid personality test simply reflects who you really are, based on your responses to the items on the test.
How do we know if a personality test is valid—that is, if it reflects who you really are? The long answer can be found in one of my previous PT blog posts. The short answer is that scores on a valid test will predict things that our theories say they should predict. For example, a valid Extraversion-Introversion scale should predict how often you engage in social interaction and whether others see you as more extraverted or introverted. A valid Agreeableness-Disagreeableness scale should predict how often you agree with or argue with others and whether others see you as more agreeable or disagreeable.
Now, it is a scientific fact that, on average, women are more pleasant, agreeable, and empathic than men. Differences in biology, childrearing, and cultural conditioning all contribute to this personality difference, but how this happens is a question for another time. The important consequence of the fact that women are more agreeable than men is that men have a lower bar to clear than women to be considered above average in agreeableness. If we could compile a checklist of specific behaviors associated with agreeableness, a man who exhibits exactly the same agreeable behaviors as a woman would tend to be perceived by others as more agreeable because men, on average, show fewer agreeable behaviors than women. It is like saying, "He is pretty agreeable—for a man."
Again, if an Agreeableness scale were a math test, the different standards for interpreting male and female scores would certainly be sexist. Higher math scores are better than lower math scores, so everyone wants a higher math score. But are higher Agreeableness scores always better than lower Agreeableness scores? On the surface, it may seem so.
Consider the contrast between the interpretation of Goldhill's agreeableness when she completed Costa and McCrae's NEO Five-Factor Inventory as if she were a man: "compassionate, good-natured, and eager to cooperate and avoid conflict" and as a woman, "Generally warm, trusting, and agreeable, but you can sometimes be stubborn and competitive." The description for a man does seem "nicer" than the description for a woman. But is higher agreeableness always better in every situation than lower agreeableness?
Goldhill correctly claims that this is not the case, and she chides the personality inventory authors who do not emphasize that the value of high or low agreeableness depends on context. She notes that my own IPIP-NEO inventory recognizes that the value of agreeableness depends on context by quoting a portion of the report generated by my inventory for agreeableness. This is what the report says:
"Agreeableness is obviously advantageous for attaining and maintaining popularity. Agreeable people are better liked than disagreeable people. On the other hand, agreeableness is not useful in situations that require tough or absolute objective decisions. Disagreeable people can make excellent scientists, critics, or soldiers." I might add science reporters.
What Goldhill does not mention is that I emphasize how the value of traits depends on context in the heading of every personality report with the following lines: "Please keep in mind that 'low,' 'average,' and 'high' scores on a personality test are neither absolutely good nor bad. A particular level on any trait will probably be neutral or irrelevant for a great many activities, be helpful for accomplishing some things, and detrimental for accomplishing other things."
Sexism is usually defined as a negative prejudice against one of the sexes (usually women) with no objective basis. Gender-normed interpretations of personality scores are not negative prejudices. As Goldhill and I both recognize, low agreeableness is not necessarily a negative attribute because it can be quite useful in certain occupations and situations. Furthermore, validated Agreeableness scores do have an objective basis: All of the validity studies indicating that scores predict relevant behaviors, life events, and how a person is perceived by others.
Goldhill's claim of systematic sexist bias in personality inventories against women is further undermined by the fact that men need a lower raw score on Neuroticism than women to be classified as non-neurotic. This is not because the interpretation of Neuroticism scores is biased against men. It is because men, on average, present fewer signs of neuroticism (anxiety, depression, self-consciousness, etc.) than women in everyday life. This sets a lower bar on the Neuroticism scale for men.
Vocational-personality psychologist John Holland went through a period in which he was criticized for using gender-based norms for interpreting scores on his inventories of the six vocational-personality types like the Vocational Preference Inventory. His critics demanded that he interpret scores as low or high based on the same rules for males and females because the use of different cutoff scores was intrinsically sexist. They were also outraged that women tend to score higher in the Social interests (the helping professions) and men, in the Realistic and Investigative interests (engineering and science), which reflects cultural stereotypes. What his critics failed to see is that Holland's gender-based norms actually made it more likely for the sexes to consider non-stereotypical careers. Girls have a lower bar than boys for their scores on Realistic and Investigative to be judged as above average, suggesting careers in science and engineering.
In short, gender-norming for personality inventories is not necessarily prejudice against women. It is a method for helping men and women to see themselves accurately as others see them to help them make realistic decisions about careers and other important life decisions.
[Incidentally, I have temporarily stopped using gender norms for my online IPIP-NEO inventory. Score interpretation is currently based on the sample of over a million people who have completed the inventory. The latest iteration of the inventory asks for both sex assigned at birth and current gender identity. When enough new data are collected, respondents will be given the choice of which reference group they want to be compared to, with caveats on interpreting scores realistically.]