Career

A More Optimistic Perspective on Performance Appraisals

Discover a more promising outlook on the accuracy of performance appraisals.

Posted November 7, 2023 | Reviewed by Michelle Quirk

Key points

Performance appraisals are crucial for organizational decisions but face skepticism regarding their accuracy.
Recent research indicates that performance appraisal reliability is higher than previously believed.
Further improvements to performance appraisals can likely be achieved by using best-practice designs.

The clocks just turned back for daylight savings time, and managers can look forward to the upcoming performance appraisal season. Yes, every manager’s least favorite activity: constrained budgets, rigid rating scales, contrived competencies, the uncomfortable feeling that comes with judging the worth of other human beings, and, of course, the reality of having to give negative reviews to your employees. Not fun.

Performance appraisals (PAs) are consequential to organizations and employees, influencing developmental feedback, pay decisions, and promotional opportunities. Yet, there have long been concerns over the accuracy of PA ratings. When a supervisor marks an employee as a “3 – meets expectations” on a 1-to-5 scale, is that truly reflective of that person’s performance? Did the person who received a rating of 4 actually exhibit superior performance? Think back to times when you were rated and how you felt.

Reliability of Ratings

The reliability of PA ratings is a statistical term that loosely reflects how much two raters agree with one another when assessing the same employee. If two managers cannot agree that Joe is a good performer, how can we trust that either manager's evaluation actually reflects Joe’s performance at work?

For decades, academic research has worked under the assumption that the reliability of PA ratings was low—so low, in fact, that the most commonly used estimate of PA reliability suggested that only approximately 50 percent of a manager’s PA rating was due to the rated employee’s performance behaviors (Viswesvaran et al., 1996). That’s a disappointing value, and especially considering how important PA ratings are to an employee’s career (pay increases!).

Recently, my colleagues (Angie Delacruz, Lauren Wegmeyer, James Perrotta) and I conducted a study that addressed some methodological challenges to previous PA reliability estimates (in press at the Journal of Applied Psychology). Specifically, we conducted a meta-analysis (a study that aggregates the results from many studies) that isolated PA reliability studies only from situations where an employee was rated by two direct supervisors—two managers directly overseeing an employee’s work.

Our logic for this is simple. When discussing the reliability of PA ratings, we are generally talking about whether a person’s direct supervisor is reliable. Despite this, many past PA reliability estimates have relied on designs where non-direct supervisors made ratings (e.g., manager from another team, a more senior manager). In such cases, those managers are less likely to have adequately observed the employee’s job performance. For instance, consider a department divided into two teams: team A and B. Each team is led by a manager, with the vice president overseeing the entire department. In this structure, can we expect the team A manager to have an in-depth understanding of the team B employees? While possible in certain cases, it's improbable that a manager would frequently supervise employees from a different team, nor would they likely possess the same level of insight as the employees' direct supervisor. Similarly, it's doubtful that the vice president would have sufficient opportunity to observe the day-to-day activities of a lower-level employee, including their work's volume and quality, the necessity for rework, or their interactions with colleagues. Typically, such close supervision falls under the purview of the employee’s immediate manager.

Thus, in our research, we isolated to PA reliability estimates that came from two managers who each directly supervised the employee in question. After winnowing studies to a more restrictive set of 22 PA reliability estimates, we found that average PA reliability for direct supervisor ratings increased, with 65 percent of PA ratings being attributable to employee job performance. That’s a marked increase over the previous estimate.

Sound Performance Appraisal Design

Of course, one could easily hope for even higher reliability given that performance ratings facilitate employment decisions that play a considerable role in employees’ lives. Although no measure is perfect (i.e., there will always be error), 65 percent is still probably not high enough for most people. The good news is that by working from this higher baseline, organizations can likely achieve more reasonable reliability given sound PA design. For example, training raters (Roch et al., 2012), incorporating more sophisticated rating scale formats (Hoffman et al., 2012), implementing rater accountability (Mero & Motowidlo, 1995; Roch, 2006; Tenbrink & Speer, 2022), and requiring calibration meetings (Speer et al., 2019) have all been shown to have positive effects on the quality of performance ratings. Thus, if companies implement best practices design, that might be enough to achieve more appropriate PA reliability.

Just as we recently turned back the clocks, we also now turn a new page in understanding performance appraisals. The findings from this research offer optimism despite longstanding skepticism regarding PA evaluations, though more research and attention are needed.

References

Hoffman, B. J., Gorman, C. A., Blair, C. A., Meriac, J. P., Overstreet, B., & Atchley, E. K. (2012). Evidence for the effectiveness of an alternative multisource performance rating methodology. Personnel Psychology, 65(3), 531–563.

Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta-analytic review of frame-of-reference training. Journal of Occupational and Organizational Psychology, 85(2), 370–395.

Roch, S. G. (2006). Discussion and consensus in rater groups: Implications for behavioral and rating accuracy. Human Performance, 19(2), 91–115.

Speer, A.B., Delacruz, A.Y., Wegmeyer, L.J., Perrotta, J. (in press). Meta analytical estimates of interrater reliability for direct supervisor performance ratings: Optimism under optimal measurement designs. Journal of Applied Psychology.

Speer, A. B., Tenbrink, A., & Schwendeman, M. (2019). Let’s talk it out: The effects of calibration meetings on the accuracy of performance ratings. Human Performance, 32(3-4), 107–128.

Tenbrink, A. P., & Speer, A. B. (2022). Accountability during Performance Appraisals: The Development and Validation of the Rater Accountability Scale. Human Performance, 1–23.

Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81(5), 557–574.