Article Analysis: Communications Medicine

AI in Clinical Diagnosis & Management

Article Title: Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management

Authors: Peikai Chen, Jifu Cai, Jiaying Zhou, Shaoxi Chen, Chenguang Xu, Lihua Yuan, Xiaoying Dai, Xiaowei Chen, Yanzhe Wei, Xia Li, Shaofeng Gong, Xiaolong Liang, Jiancheng Yang, Jun Jin, Kanglin Dai, Yuzhen Cui, Guan-Ming Kuang, Jiansheng Xie, Libing Luo, Haibing Xiao, Shijie Yin, Jun Yang, Yulan Yan, Jianliang Chen, Yihua Chen, Qianshen Zhang, Qingshan Zhou, Lina Zhao, Min Wu, Xin Tang, Lei Rong, Zanxin Wang, Weifu Qiu, Yanli Wang, Liwen Cui, Xiangyang Li, Yong Hu, Huiren Tao, Nan Wu, David J. H. Shih, Pearl Pai, Minxin Wei, Michael Kai-tsun To & Kenneth M. C. Cheung

Abstract: Direct clinical uses of large language models (LLMs) remain controversial, partly because of the lack of methodological rigor in assessing their risks and benefits in medicine. We developed Medieval, a multidisciplinary, randomized, and blinded expert evaluation framework. A ten-point Dreyfus-based scoring scale linked to career stages of human physicians was designed to reflect response qualities. Seven advanced LLMs or their distilled versions that were released within a short time-frame (≤45 days) in early 2025 were tested. Incidence of fabricated medical facts were documented. Linear mixed-effects models and variance-stabilizing Bayesian generalized linear mixed models were employed to perform statistical analyses. We first develop a high-quality question bank comprising 685 real and simulated clinical cases across 13 specialties. An expert panel of 27 clinicians (average years of services: 25.9) evaluated the 4,795 model responses. We show that these LLM ratings (n = 9856) have excellent reliability (intraclass correlation coefficients >0.9). Among the seven LLMs tested, Gemini 2.0 Flash achieved the highest raw scores. However, after adjusting for confounders, DeepSeek-R1 was the top-performing model with a mean score of 6.36 (95% confidence interval 6.03 - 6.69), a performance level equivalent to an early-career physician. Despite these strengths, 3–19% LLM responses were rated as incompetent and 40 instances of LLM hallucination were also identified. Our study shows that in spite of LLMs' substantial potentials in medicine, their unguarded clinical application could present serious risks, which must be continuously monitored by human expert panels. The evaluation framework developed and validated in this study will facilitate such efforts.

Plain Language Summary: Artificial intelligence (AI) systems are becoming increasingly capable of answering medical questions, but it is still important to understand how safe and reliable they truly are. In this study, we created a structured evaluation framework where experienced doctors reviewed the responses of seven newly released Al models to hundreds of real and simulated clinical scenarios. Across nearly 5,000 answers, doctors used a scoring system designed to reflect the quality expected at different stages of a medical career. Some Al models performed well, occasionally reaching a level similar to that of early career physicians. However, doctors also identified answers that were incomplete, inaccurate, or based on invented details. Our study shows that while these Al systems are promising, ensuring their safe use in medicine will require ongoing oversight. Standard engineering tests alone are not enough; evaluations by clinical experts remain a crucial safeguard to help identify potential risks and prevent misuse.

Schedule Your Strategy Session

Executive Impact

This study establishes a rigorous, blinded 'human-in-the-loop' framework to evaluate advanced LLMs in medicine using real cases. While top-tier models demonstrate sophisticated clinical reasoning, they also exhibit critical variability and safety risks, underscoring the need for continuous expert oversight before wider clinical adoptions.

0 LLM Responses Evaluated

0 Expert Clinicians (Avg. Exp.) Years

0 Reliability (ICC) Model Level

Discuss Our Methodology

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Evaluation Framework: Medieval

Question Bank Development

→

LLM Response Generation

→

Blinded Expert Evaluation

→

Statistical Analysis & Validation

10-Point Dreyfus-based Scoring Scale for LLM Quality

Feature	Medieval Framework	Traditional Benchmarks
Case Source	Real local clinical cases, open-ended	Public exam questions, multiple-choice
Evaluators	Qualified specialists (blinded)	Crowdsourced/Automated
Scoring	Career stage equivalence (10-point)	Boolean/Categorical
Bias Mitigation	Randomized, blinded, de-identified	Limited (data leakage, overfitting)

6.36 DeepSeek-R1 Avg. Score (Early-career physician equivalent)

2.7% Gemini 2.0 Flash Lowest Incompetence Rate

Hallucination Example: Prenatal Diagnosis

In prenatal diagnosis, 17 instances of hallucinations were identified. Common errors involved misinterpretation or incorrect application of variant classification criteria (e.g., PVS1, PS3/4, PM1/2/4, PP1/3). For example, Qwen 32B, ChatGPT-40L, and Gemini misused ACMG-AMP criteria, leading to potentially critical misdiagnoses. These demonstrate the need for human oversight to prevent medical malpractice.

LLM Strengths: DeepSeek R1 in STEMI

In a simulated A.E. case with STEMI complicated by right ventricular infarction and hypotension, DeepSeek R1 demonstrated profound understanding by integrating the patient's low blood pressure and right heart failure signs to infer right ventricular involvement and explicitly recommend avoiding nitroglycerin, aligning with American College of Cardiology guidelines. This nuanced response contrasts with other LLMs offering only routine reminders.

7.2% Overall Diagnostic Incompetence Rate

40+ Instances of Fabricated Medical Facts (Hallucinations)

Risk Type	Description	Example
Incompetence	Sub-par responses (<4 score), equivalent to beginning physician	Higher rates in ICU, Neonatology, Pediatric Surgery
Hallucination	Fabricated medical facts, drugs, guidelines	Misinterpretation of variant pathogenicity; fabricated drug names
Bias	Sociodemographic biases in decision making	Prior studies indicate potential for model bias

Estimate Your Enterprise AI Impact

Quantify potential efficiency gains and cost savings by integrating AI into your clinical workflows. Adjust parameters to see the estimated ROI for your organization.

Industry Sector

Number of Employees Affected by AI

Avg. Hours/Week on Repetitive Tasks

Avg. Hourly Cost (incl. overhead)

Estimated Annual Savings $0

Annual Hours Reclaimed 0 hours

Calculate My AI ROI

Your AI Implementation Roadmap

A strategic overview of the phased approach to integrate advanced AI solutions into your enterprise, ensuring a smooth transition and maximum impact.

Phase 1: Discovery & Strategy

Conduct a comprehensive assessment of current workflows, identify key pain points, and define strategic AI integration goals. Develop a tailored AI adoption roadmap.

Phase 2: Pilot & Validation

Implement a pilot AI solution in a controlled environment. Evaluate performance against defined KPIs and gather feedback for refinement and optimization.

Phase 3: Scaled Deployment

Roll out AI solutions across relevant departments. Provide extensive training and support to ensure user adoption and maximize efficiency gains.

Phase 4: Continuous Optimization

Establish ongoing monitoring, performance analytics, and iterative improvement cycles to adapt AI systems to evolving needs and technological advancements.

Discuss Your Implementation

Ready to Transform Your Enterprise with AI?

Unlock the full potential of advanced AI and achieve unprecedented efficiency and innovation. Our experts are ready to guide you.

Schedule Your Strategy Session

Article Analysis: Communications Medicine

AI in Clinical Diagnosis & Management

Executive Impact

Deep Analysis & Enterprise Applications

Evaluation Framework: Medieval

Hallucination Example: Prenatal Diagnosis

LLM Strengths: DeepSeek R1 in STEMI

Estimate Your Enterprise AI Impact

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Validation

Phase 3: Scaled Deployment

Phase 4: Continuous Optimization

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai