Artificial Intelligence
Will 2025 Be a "Technology Wake-Up Call" for Clinicians?
Clinicians who fail to embrace artificial intelligence risk falling behind.
Posted December 19, 2024 Reviewed by Michelle Quirk
Key points
- OpenAI's o1-preview achieved 88 percent accuracy in diagnosis, far surpassing human doctors' 35 percent.
- Eighty-four percent of o1-preview's reasoning matched or exceeded that of human experts.
- AI shines in diagnosis but still struggles with probabilistic reasoning and triage decision-making.

The year 2025 may well mark a pivotal moment in the evolution of artificial intelligence (AI) in medicine. A new prepress study evaluating OpenAI’s GPT-4 and o1-preview model demonstrates that AI is not only achieving impressive feats in clinical reasoning but is doing so without supplemental training on domain-specific data. This achievement represents a significant leap in what general-purpose large language models (LLMs) can accomplish, fueled by innovations in reasoning frameworks such as chain-of-thought (CoT) processing.
The findings are both promising and provocative. On one hand, the o1-preview model excels in tasks requiring complex diagnostic and management reasoning, rivaling human clinicians. On the other, it reveals critical gaps in probabilistic reasoning and triage diagnosis, areas where human expertise remains paramount. This duality raises important questions about how AI will integrate into medical workflows and redefine the role of clinicians.
There's a lot to unpack here, and I suggest reading the study carefully as I'm only touching on some of the key points, particularly the results with the o1-preview model.
A Tale of Strengths and Weaknesses
The study evaluated the o1-preview model across five experiments, including differential diagnosis generation, diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning. The results were adjudicated by physician experts using validated psychometrics, providing a benchmark for comparison against human controls.
Strengths
- Differential diagnosis generation: The o1-preview model achieved an 88 percent accuracy rate, far surpassing the 35 percent accuracy demonstrated by human clinicians in the same task. Its output was consistently rated as more comprehensive and precise, particularly in rare and complex diagnostic scenarios, where the model’s CoT reasoning allowed it to identify conditions often overlooked by clinicians.
- Diagnostic and management reasoning: The o1-preview model displayed significant advancements in diagnostic and management tasks. In 84 percent of cases, the model’s reasoning was rated as on par with or exceeding that of human experts, who achieved comparable accuracy in only 64 percent of cases. Physicians praised the model’s structured and logical approach, which mirrored the stepwise critical thinking employed by clinicians and synthesized data from diverse clinical inputs to produce actionable recommendations.
Limitations
- Probabilistic reasoning: The model struggled with tasks requiring nuanced probabilistic reasoning—a cornerstone of medical decision-making. While the o1-preview model’s performance was consistent with prior LLMs, human clinicians continued to excel in this area, demonstrating greater adaptability in assigning likelihoods to competing diagnoses and dynamically balancing risks in uncertain situations.
- Triage differential diagnosis: No improvements were observed in triage tasks that require prioritizing cases by severity. While human clinicians achieved a 70 percent accuracy rate in these high-pressure, dynamic scenarios, the model’s logical but rigid outputs fell short, lacking the adaptive nuance required for real-time decision-making in emergency or critical care settings.
The Role of Chain-of-Thought Reasoning
A standout feature of the o1-preview model is its reliance on chain-of-thought (CoT) reasoning, a framework that enables the AI to generate intermediate steps in its reasoning process before arriving at a final answer. This process allows the model to explain its thought process, making its outputs more transparent and easier for clinicians to interpret.
By breaking down complex problems into smaller steps, CoT reasoning reduces the risk of logical errors, particularly in tasks requiring critical thinking. Moreover, this approach mimics the way clinicians address diagnostic challenges—systematically considering symptoms, test results, and medical history to form conclusions. The use of CoT reasoning may be an important factor in the model’s success with diagnostic and management reasoning, even as it struggles with the more dynamic aspects of clinical practice, such as triage.
The Remarkable Absence of Supplemental Clinical Training
Another striking aspect of the o1-preview model is that it was not trained on supplemental clinical data. Unlike earlier AI systems fine-tuned on medical data sets, o1-preview achieved its performance using general-purpose training. This accomplishment suggests that broad, general training data combined with advanced reasoning frameworks can rival domain-specific training, reducing the need for costly and time-intensive fine-tuning processes.
The absence of supplemental training also eliminates concerns about patient privacy, biased data sets, and overfitting to specific scenarios. However, it means the model’s performance is limited to patterns present in its general training data, leaving gaps in areas requiring contextual nuance. This highlights both the promise and the current limitations of generalist AI systems in specialized domains like healthcare.
A Wake-Up Call for Clinicians
The o1-preview model’s performance highlights both the promise and the limitations of LLMs in medicine. For clinicians, this study serves as a wake-up call: AI is no longer a futuristic concept—it’s here, and it’s redefining what is possible in patient care.
- AI as a partner: Models like o1-preview are not replacing clinicians but augmenting their capabilities. They excel at tasks like differential diagnosis generation and management planning, freeing up clinicians to focus on patient interaction and decision-making.
- Closing the gaps: While o1-preview shines in structured reasoning tasks, its struggles with probabilistic reasoning and triage emphasize the irreplaceable value of human expertise. These gaps point to opportunities for future AI development.
- The need for new benchmarks: Current evaluation methods, such as multiple-choice question benchmarks, fail to capture the complexity of real-world clinical scenarios. Robust, scalable benchmarks and clinical trials are essential to understand AI’s true potential in healthcare.
Digital Health and "Another" Inflection Point?
The o1-preview model may represent a turning point in the integration of AI into medicine. And as we've heard this claim many times, its ability to perform superhuman reasoning tasks without supplemental clinical training is important—as an achievement and a challenge. As AI continues to evolve, clinicians must adapt to this new reality, embracing AI as a cognitive partner while maintaining the human expertise that defines the art of medicine.
2025 doesn't just represent a wake-up call; it may be the beginning of a new era. The question is no longer whether AI will transform medicine, but how clinicians and AI will work together to shape the future of healthcare.