Enterprise AI Analysis

Diagnosis and Triage Performance of Contemporary Large Language Models on Short Clinical Vignettes

General-purpose large language models (LLMs) are increasingly proposed for diagnostic and triage decision support, yet their reliability relative to humans remains unclear. This study evaluated eight contemporary LLMs (ChatGPT-4, ChatGPT-01, DeepSeek-V3, DeepSeek-R1, Gemini-2.0, Copilot, Grok-2, Llama-3.1) on 48 single-turn clinical vignettes spanning four triage levels (Emergent, 1-day, 1-week, Self-care). Models were tested without prompts and with structured prompts. Structured prompting significantly improved both diagnostic accuracy (from 89.84% to 91.67%) and triage accuracy (from 76.82% to 86.20%). The best diagnostic accuracy was 93.75% (ChatGPT-01 and DeepSeek-R1). Prompting also shifted models toward a more precautionary stance, increasing safety of advice from 89.06% to 94.53%, accompanied by higher over-triage (from 53.15% to 65.62%). While advanced LLMs show high diagnostic accuracy, triage remains a significant challenge. Structured prompting is a practical lever to enhance robustness.

Schedule Your Enterprise AI Strategy Session

Key Performance Indicators

Leveraging advanced prompting, LLMs demonstrate significant gains in diagnostic precision and enhanced safety for clinical support.

0 Diagnostic Accuracy (Prompted)

0 Triage Accuracy (Prompted)

0 Safety of Advice (Prompted)

0 Diagnostic Accuracy Gain with Prompting

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The rapid evolution of large language models (LLMs) like ChatGPT-4, DeepSeek, and Gemini has profoundly transformed natural language processing. These models excel in text generation, complex reasoning, and integrating domain-specific knowledge, making them highly promising for healthcare applications such as disease diagnosis and patient management. However, their reliability compared to humans is still an open question, with prior studies showing mixed findings and highlighting the need for standardized evaluation frameworks.

This study utilized a validated dataset of 48 synthetic single-turn clinical vignettes, categorized into four triage urgency levels (Emergent, 1-day, 1-week, Self-care). Eight contemporary LLMs were evaluated under two scenarios: without prompts and with structured prompts comprising exemplar cases. Performance was assessed using diagnostic and triage accuracy, confusion matrices, over-triage, safety of advice, and the Capability Comparison Score (CCS) to account for case difficulty.

Structured prompting consistently boosted performance across all models. Mean diagnostic accuracy increased from 89.84% to 91.67%, and mean triage accuracy from 76.82% to 86.20%. ChatGPT-01 and DeepSeek-R1 achieved the best diagnostic accuracy at 93.75%. Prompting also shifted models toward safety, increasing safety of advice to 94.53% but also raising over-triage to 65.62%. CCS values preserved rankings but highlighted the impact of vignette difficulty.

The integration of LLMs into clinical practice is gaining attention, especially for diagnosis and triage. Our findings show LLMs approaching physician-level diagnostic accuracy (up to 93.75% for best models) but remaining below physician benchmarks for triage, with a systematic tendency toward over-triage. Prompting served as a 'safety-first' lever, although it can drive unnecessary resource strain. The study also highlighted the sensitivity of LLM performance to vignette design and the importance of precise, comprehensive clinical descriptions for accurate predictions.

In conclusion, GPT-01 and DeepSeek-R1 achieved the highest diagnostic accuracy among the evaluated LLMs, approaching primary care physician benchmarks on this specific vignette-based task. While triage remains a challenge, structured prompting proves to be a practical, training-free method to enhance robustness. Continued research is essential to ensure reliable and safe clinical integration of LLMs, focusing on uncertainty-aware prompting and real-world, multi-turn/multi-modality cases.

Impact of Structured Prompting on Performance

+1.83% Average Diagnostic Accuracy Increase with Structured Prompting

Structured prompting proved to be a powerful lever, boosting mean diagnostic accuracy from 89.84% to 91.67% and mean triage accuracy from 76.82% to 86.20% across all models, demonstrating enhanced robustness.

LLM Evaluation Methodology Overview

Select LLMs & Vignettes (48 clinical cases)

→

Test without prompts (Diagnosis & Triage)

→

Test with structured prompts (Diagnosis & Triage)

→

Evaluate Performance (Accuracy, Over-triage, Safety, CCS)

→

Analyze Error Patterns & Compare to Human Benchmarks

→

Strengthen Clinical Reliability for Future Work

The study's rigorous evaluation involved assessing contemporary LLMs on a validated clinical vignette dataset under both non-prompted and structured-prompted conditions, using a comprehensive set of metrics including accuracy, over-triage, safety, and CCS, with comparison to physician and layperson benchmarks.

Top Performers and Safety-First Trade-offs

Scenario: In evaluating LLMs for clinical decision support, the challenge lies in achieving high accuracy while prioritizing patient safety, especially in triage.

Solution: Models like ChatGPT-01 and DeepSeek-R1 achieved the highest diagnostic accuracy (93.75%). With structured prompting, the safety of advice across models significantly increased to 94.53%.

Outcome: This 'safety-first' approach, however, led to an increased over-triage rate of 65.62%. While beneficial for minimizing under-triage, it highlights a trade-off where models might err on the side of caution, potentially leading to increased resource utilization.

While specific LLMs like ChatGPT-01 and DeepSeek-R1 achieved top-tier diagnostic accuracy, structured prompting pushed all models towards a 'safety-first' mode. This, while desirable for avoiding under-triage, increased overall over-triage, highlighting a trade-off between caution and resource utilization in clinical decision support.

Feature	Diagnostic Accuracy	Triage Performance
LLM Performance (Best)	Up to 93.75% (nearing physician benchmarks of 96%)	Improved to 93.75% (ChatGPT-01, DeepSeek-R1) but still challenged
Physician Benchmarks	96%	91%
Key Challenge	Subtlety and ambiguity in symptom presentation (Cases 1, 18, 42)	Systematic tendencies toward over-triage; significant disparity with physician performance
Impact of Prompting	Consistent boost across all models, especially for top performers	Even greater improvements, narrowing the gap with stronger models, but increased over-triage for safety

Advanced LLMs demonstrated impressive diagnostic accuracy, closely approaching physician benchmarks on concise vignettes. However, triage remained a more significant challenge, consistently showing lower accuracy and a tendency for over-triage compared to human experts, despite substantial improvements with structured prompting.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions.

Your Industry

Number of Employees (Impacted by AI)

Average Hours Spent on Repetitive Tasks Per Week

Average Hourly Cost Per Employee ($)

Estimated Annual Savings $0

Hours Reclaimed Annually 0

Your AI Implementation Roadmap

A strategic phased approach to integrate large language models for impactful clinical decision support.

Initial Assessment & Data Integration

Review existing clinical data, identify critical diagnostic and triage pathways, and integrate LLM outputs into a controlled environment for initial testing.

Structured Prompt Engineering & Validation

Develop and refine structured prompts based on identified best practices and test their impact on accuracy and safety metrics using internal datasets.

Pilot Program & Clinician Feedback

Launch a small-scale pilot with a subset of clinical staff to gather real-world feedback on LLM-assisted decision support and identify areas for refinement.

Safety & Bias Audit

Conduct comprehensive audits to ensure LLM recommendations are safe, unbiased, and align with clinical guidelines, focusing on minimizing under-triage risks.

Scalable Deployment & Continuous Monitoring

Implement LLM solutions across broader clinical workflows, establish robust monitoring systems for ongoing performance, and integrate feedback loops for continuous improvement.

Ready to Transform Your Healthcare Operations?

Schedule a personalized consultation to discuss how our enterprise AI solutions can be tailored to your specific clinical needs and strategic objectives.

Schedule Your Enterprise AI Strategy Session

Enterprise AI Analysis

Diagnosis and Triage Performance of Contemporary Large Language Models on Short Clinical Vignettes

Key Performance Indicators

Deep Analysis & Enterprise Applications

Impact of Structured Prompting on Performance

LLM Evaluation Methodology Overview

Top Performers and Safety-First Trade-offs

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Initial Assessment & Data Integration

Structured Prompt Engineering & Validation

Pilot Program & Clinician Feedback

Safety & Bias Audit

Scalable Deployment & Continuous Monitoring

Ready to Transform Your Healthcare Operations?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai