Article Analysis: Communications Medicine
AI in Clinical Diagnosis & Management
Article Title: Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management
Authors: Peikai Chen, Jifu Cai, Jiaying Zhou, Shaoxi Chen, Chenguang Xu, Lihua Yuan, Xiaoying Dai, Xiaowei Chen, Yanzhe Wei, Xia Li, Shaofeng Gong, Xiaolong Liang, Jiancheng Yang, Jun Jin, Kanglin Dai, Yuzhen Cui, Guan-Ming Kuang, Jiansheng Xie, Libing Luo, Haibing Xiao, Shijie Yin, Jun Yang, Yulan Yan, Jianliang Chen, Yihua Chen, Qianshen Zhang, Qingshan Zhou, Lina Zhao, Min Wu, Xin Tang, Lei Rong, Zanxin Wang, Weifu Qiu, Yanli Wang, Liwen Cui, Xiangyang Li, Yong Hu, Huiren Tao, Nan Wu, David J. H. Shih, Pearl Pai, Minxin Wei, Michael Kai-tsun To & Kenneth M. C. Cheung
Abstract: Direct clinical uses of large language models (LLMs) remain controversial, partly because of the lack of methodological rigor in assessing their risks and benefits in medicine. We developed Medieval, a multidisciplinary, randomized, and blinded expert evaluation framework. A ten-point Dreyfus-based scoring scale linked to career stages of human physicians was designed to reflect response qualities. Seven advanced LLMs or their distilled versions that were released within a short time-frame (≤45 days) in early 2025 were tested. Incidence of fabricated medical facts were documented. Linear mixed-effects models and variance-stabilizing Bayesian generalized linear mixed models were employed to perform statistical analyses. We first develop a high-quality question bank comprising 685 real and simulated clinical cases across 13 specialties. An expert panel of 27 clinicians (average years of services: 25.9) evaluated the 4,795 model responses. We show that these LLM ratings (n = 9856) have excellent reliability (intraclass correlation coefficients >0.9). Among the seven LLMs tested, Gemini 2.0 Flash achieved the highest raw scores. However, after adjusting for confounders, DeepSeek-R1 was the top-performing model with a mean score of 6.36 (95% confidence interval 6.03 - 6.69), a performance level equivalent to an early-career physician. Despite these strengths, 3–19% LLM responses were rated as incompetent and 40 instances of LLM hallucination were also identified. Our study shows that in spite of LLMs' substantial potentials in medicine, their unguarded clinical application could present serious risks, which must be continuously monitored by human expert panels. The evaluation framework developed and validated in this study will facilitate such efforts.
Plain Language Summary: Artificial intelligence (AI) systems are becoming increasingly capable of answering medical questions, but it is still important to understand how safe and reliable they truly are. In this study, we created a structured evaluation framework where experienced doctors reviewed the responses of seven newly released Al models to hundreds of real and simulated clinical scenarios. Across nearly 5,000 answers, doctors used a scoring system designed to reflect the quality expected at different stages of a medical career. Some Al models performed well, occasionally reaching a level similar to that of early career physicians. However, doctors also identified answers that were incomplete, inaccurate, or based on invented details. Our study shows that while these Al systems are promising, ensuring their safe use in medicine will require ongoing oversight. Standard engineering tests alone are not enough; evaluations by clinical experts remain a crucial safeguard to help identify potential risks and prevent misuse.
Executive Impact
This study establishes a rigorous, blinded 'human-in-the-loop' framework to evaluate advanced LLMs in medicine using real cases. While top-tier models demonstrate sophisticated clinical reasoning, they also exhibit critical variability and safety risks, underscoring the need for continuous expert oversight before wider clinical adoptions.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Evaluation Framework: Medieval
| Feature | Medieval Framework | Traditional Benchmarks |
|---|---|---|
| Case Source |
|
|
| Evaluators |
|
|
| Scoring |
|
|
| Bias Mitigation |
|
|
Hallucination Example: Prenatal Diagnosis
In prenatal diagnosis, 17 instances of hallucinations were identified. Common errors involved misinterpretation or incorrect application of variant classification criteria (e.g., PVS1, PS3/4, PM1/2/4, PP1/3). For example, Qwen 32B, ChatGPT-40L, and Gemini misused ACMG-AMP criteria, leading to potentially critical misdiagnoses. These demonstrate the need for human oversight to prevent medical malpractice.
LLM Strengths: DeepSeek R1 in STEMI
In a simulated A.E. case with STEMI complicated by right ventricular infarction and hypotension, DeepSeek R1 demonstrated profound understanding by integrating the patient's low blood pressure and right heart failure signs to infer right ventricular involvement and explicitly recommend avoiding nitroglycerin, aligning with American College of Cardiology guidelines. This nuanced response contrasts with other LLMs offering only routine reminders.
| Risk Type | Description | Example |
|---|---|---|
| Incompetence |
|
|
| Hallucination |
|
|
| Bias |
|
|
Estimate Your Enterprise AI Impact
Quantify potential efficiency gains and cost savings by integrating AI into your clinical workflows. Adjust parameters to see the estimated ROI for your organization.
Your AI Implementation Roadmap
A strategic overview of the phased approach to integrate advanced AI solutions into your enterprise, ensuring a smooth transition and maximum impact.
Phase 1: Discovery & Strategy
Conduct a comprehensive assessment of current workflows, identify key pain points, and define strategic AI integration goals. Develop a tailored AI adoption roadmap.
Phase 2: Pilot & Validation
Implement a pilot AI solution in a controlled environment. Evaluate performance against defined KPIs and gather feedback for refinement and optimization.
Phase 3: Scaled Deployment
Roll out AI solutions across relevant departments. Provide extensive training and support to ensure user adoption and maximize efficiency gains.
Phase 4: Continuous Optimization
Establish ongoing monitoring, performance analytics, and iterative improvement cycles to adapt AI systems to evolving needs and technological advancements.
Ready to Transform Your Enterprise with AI?
Unlock the full potential of advanced AI and achieve unprecedented efficiency and innovation. Our experts are ready to guide you.