Critical Thinking About Measuring Critical Thinking

A list of critical thinking measures.

Posted May 18, 2018

In my last post, I discussed the nature of engaging the critical thinking (CT) process and made mention of individuals who draw a conclusion and wind up being correct. But, just because they’re right, it doesn’t mean they used CT to get there. I exemplified this through an observation made in recent years regarding extant measures of CT, many of which assess CT via multiple-choice questions. In the case of CT MCQs, you can guess the "right" answer 20-25% of the time, without any need for CT. So, the question is, are these CT measures really measuring CT?     

As my previous articles explain, CT is a metacognitive process consisting of a number of sub-skills and dispositions, that, when applied through purposeful, self-regulatory, reflective judgment, increase the chances of producing a logical solution to a problem or a valid conclusion to an argument (Dwyer, 2017; Dwyer, Hogan & Stewart, 2014). Most definitions, though worded differently, tend to agree with this perspective – it consists of certain dispositions, specific skills and a reflective sensibility that governs application of these skills. That’s how it’s defined; however, it’s not necessarily how it’s been operationally defined.

Operationally defining something refers to defining the terms of the process or measure required to determine the nature and properties of a phenomenon. Simply, it is defining the concept with respect to how it can be done, assessed or measured. If the manner in which you measure something does not match, or assess the parameters set out in the way in which you define it, then you have not been successful in operationally defining it.

Though most theoretical definitions of CT are similar, the manner in which they vary often impedes the construction of an integrated theoretical account of how best to measure CT skills. As a result, researchers and educators must consider the wide array of CT measures available, in order to identify the best and the most appropriate measures, based on the CT conceptualisation used for training. There are various extant CT measures – the most popular amongst them include the Watson-Glaser Critical Thinking Assessment (WGCTA; Watson & Glaser, 1980), the Cornell Critical Thinking Test (CCTT; Ennis, Millman & Tomko, 1985), the California Critical Thinking Skills Test (CCTST; Facione, 1990a), the Ennis-Weir Critical Thinking Essay Test (EWCTET; Ennis & Weir, 1985) and the Halpern Critical Thinking Assessment (Halpern, 2010).

It has been noted by some commentators that these different measures of CT ability may not be directly comparable (Abrami et al., 2008). For example, the WGCTA consists of 80 MCQs that measure the ability to draw inferences; recognise assumptions; evaluate arguments; and use logical interpretation and deductive reasoning (Watson & Glaser, 1980). The CCTT consists of 52 MCQs which measure skills of critical thinking associated with induction; deduction; observation and credibility; definition and assumption identification; and meaning and fallacies. Finally, the CCTST consists of 34 multiple-choice questions (MCQs) and measures CT according to the core skills of analysis, evaluation and inference, as well as inductive and deductive reasoning.

As addressed above, the MCQ-format of these three assessments is less than ideal – problematic even, because it allows test-takers to simply guess when they do not know the correct answer, instead of demonstrating their ability to critically analyse and evaluate problems and infer solutions to those problems (Ku, 2009). Furthermore, as argued by Halpern (2003), the MCQ format makes the assessment a test of verbal and quantitative knowledge rather than CT (i.e. because one selects from a list of possible answers rather than determining one’s own criteria for developing an answer). The measurement of CT through MCQs is also problematic given the potential incompatibility between the conceptualisation of CT that shapes test construction and its assessment using MCQs. That is, MCQ tests assess cognitive capacities associated with identifying single right-or-wrong answers and as a result, this approach to testing is unable to provide a direct measure of test-takers’ use of metacognitive processes such as CT, reflective judgment, and disposition towards CT.

Instead of using MCQ items, a better measure of CT might ask open-ended questions, which would allow test-takers to demonstrate whether or not they spontaneously use a specific CT skill. One commonly used CT assessment, mentioned above, that employs an open-ended format is the Ennis-Weir Critical Thinking Essay Test (EWCTET; Ennis & Weir, 1985). The EWCTET is an essay-based assessment of the test-taker’s ability to analyse, evaluate, and respond to arguments and debates in real-world situations (Ennis & Weir, 1985; see Ku, 2009 for a discussion). The authors of the EWCTET provide what they call a “rough, somewhat overlapping list of areas of critical thinking competence”, measured by their test (Ennis & Weir, 1985, p. 1). However, this test, too, has been criticised – for its domain-specific nature (Taube, 1997), the subjectivity of its scoring protocol and its bias in favour of those proficient in writing (Adams, Whitlow, Stover & Johnson, 1996).

Another, more recent CT assessment that utilises an open-ended format is the Halpern Critical Thinking Assessment (HCTA; Halpern, 2010).  The HCTA consists of 25 open-ended questions based on believable, everyday situations, followed by 25 specific questions that probe for the reasoning behind each answer. The multi-part nature of the questions makes it possible to assess the ability to use specific CT skills when the prompt is provided (Ku, 2009). The HCTA’s scoring protocol also provides comprehensible, unambiguous instructions for how to evaluate responses by breaking them down into clear, measurable components. Questions on the HCTA represent five categories of CT application: hypothesis testing (e.g. understanding the limits of correlational reasoning and how to know when causal claims cannot be made), verbal reasoning (e.g. recognising the use of pervasive or misleading language), argumentation (e.g. recognising the structure of arguments, how to examine the credibility of a source and how to judge one’s own arguments), judging likelihood and uncertainty (e.g. applying relevant principles of probability, how to avoid overconfidence in certain situations) and problem-solving (e.g. identifying the problem goal, generating and selecting solutions among alternatives).

Up until the development of the HCTA, I would have recommended the CCTST for measuring CT, despite its limitations. What’s nice about the CCTST is that it assesses the three core skills of CT: analysis, evaluation, and inference, which other scales do not (explicitly). So, if you were interested in assessing students’ sub-skill ability, this would be helpful. However, as we know, though CT skill performance is a sequence, it is also a collation of these skills – meaning that for any given problem or topic, each skill is necessary. By administrating an analysis problem, an evaluation problem and an inference problem, in which the student scores top marks for all three, it doesn’t guarantee that the student will apply these three to a broader problem that requires all three. That is, these questions don’t measure CT skill ability per se, rather analysis skill, evaluation skill and inference skill in isolation. Simply, scores may predict CT skill performance, but they don’t measure it.  

What may be a better indicator of CT performance is assessment of CT application. As addressed above, there are five general applications of CT: hypothesis testing, verbal reasoning, argumentation, problem-solving and judging likelihood and uncertainty – all of which require a collation of analysis, evaluation, and inference. Though the sub-skills of analysis, evaluation, and inference are not directly measured in this case, their collation is measured through five distinct applications; and, as I see it, provides a 'truer' assessment of CT. In addition to assessing CT via an open-ended, short-answer format, the HCTA measures CT according to the five applications of CT; thus, I recommend its use for measuring CT.

However, that’s not to say that the HCTA is perfect. Though it consists of 25 open-ended questions, followed by 25 specific questions that probe for the reasoning behind each answer, when I first used it to assess a sample of students, I found that in setting up my data file, there were actually 165 opportunities for scoring across the test. Past research recommends that the assessment takes roughly between 45 and 60 minutes to complete. However, many of my participants reported it requiring closer to two hours (sometimes longer). It’s a long assessment – thorough, but long. Fortunately, adapted, shortened versions are now available, and it’s an adapted version that I currently administrate to assess CT. Another limitation is that, despite the rationale above, it would be nice to have some indication of how participants get on with the sub-skills of analysis, evaluation, and inference, as I do think there’s a potential predictive element in the relationship among the individual skills and the applications. With that, I suppose it is feasible to administer both the HCTA and CCTST to assess such hypotheses.            

Though it’s obviously important to consider how assessments actually measure CT and the nature in which each is limited, the broader, macro-problem still requires thought. Just as conceptualisations of CT vary, so too does the reliability and validity of the different CT measures, which has led Abrami and colleagues (2008, p. 1104) to ask: “How will we know if one intervention is more beneficial than another if we are uncertain about the validity and reliability of the outcome measures?” Abrami and colleagues add that, even when researchers explicitly declare that they are assessing CT, there still remains the major challenge of ensuring that measured outcomes are related, in some meaningful way, to the conceptualisation and operational definition of CT that informed the teaching practice in cases of interventional research. Often, the relationship between the concepts of CT that are taught and those that are assessed is unclear, and a large majority of studies in this area include no theory to help elucidate these relationships.

In conclusion, solving the problem of consistency across CT conceptualisation, training, and measure is no easy task. I think recent advancements in CT scale development (e.g. the development of the HCTA and its adapted versions) have eased the problem, given that they now bridge the gap between current theory and practical assessment. However, such advances need to be made clearer to interested populations. As always, I’m very interested in hearing from any readers who may have any insight or suggestions!


Abrami, P. C., Bernard, R. M., Borokhovski, E., Wade, A., Surkes, M. A., Tamim, R., & Zhang, D. (2008). Instructional interventions affecting critical thinking skills and dispositions: A stage 1 meta-analysis. Review of Educational Research, 78(4), 1102–1134.

Adams, M.H., Whitlow, J.F., Stover, L.M., & Johnson, K.W. (1996). Critical thinking as an educational outcome: An evaluation of current tools of measurement. Nurse Educator, 21, 23–32.

Dwyer, C.P. (2017). Critical thinking: Conceptual perspectives and practical guidelines. Cambridge, UK: Cambridge University Press.

Dwyer, C.P., Hogan, M.J. & Stewart, I. (2014). An integrated critical thinking framework for the 21st century. Thinking Skills & Creativity, 12, 43-52. 

Ennis, R.H., Millman, J., & Tomko, T.N. (1985). Cornell critical thinking tests. CA: Critical Thinking Co.

Ennis, R.H., & Weir, E. (1985). The Ennis-Weir critical thinking essay test. Pacific Grove, CA: Midwest Publications.

Facione, P. A. (1990a). The California critical thinking skills test (CCTST): Forms A and B;The CCTST test manual. Millbrae, CA: California Academic Press.

Facione, P.A. (1990b). The Delphi report: Committee on pre-college philosophy. Millbrae, CA: California Academic Press.

Halpern, D. F. (2003b). The “how” and “why” of critical thinking assessment. In D. Fasko (Ed.), Critical thinking and reasoning: Current research, theory and practice. Cresskill, NJ: Hampton Press.

Halpern, D.F. (2010). The Halpern critical thinking assessment: Manual. Vienna: Schuhfried.

Ku, K.Y.L. (2009). Assessing students’ critical thinking performance: Urging for measurements using multi-response format. Thinking Skills and Creativity, 4, 1, 70- 76. 

Taube, K.T. (1997). Critical thinking ability and disposition as factors of performance on a written critical thinking test. Journal of General Education, 46, 129-164.

Watson, G., & Glaser, E.M. (1980). Watson-Glaser critical thinking appraisal. New York: Psychological Corporation.