Enterprise AI Analysis
Diagnosis and Triage Performance of Contemporary Large Language Models on Short Clinical Vignettes
General-purpose large language models (LLMs) are increasingly proposed for diagnostic and triage decision support, yet their reliability relative to humans remains unclear. This study evaluated eight contemporary LLMs (ChatGPT-4, ChatGPT-01, DeepSeek-V3, DeepSeek-R1, Gemini-2.0, Copilot, Grok-2, Llama-3.1) on 48 single-turn clinical vignettes spanning four triage levels (Emergent, 1-day, 1-week, Self-care). Models were tested without prompts and with structured prompts. Structured prompting significantly improved both diagnostic accuracy (from 89.84% to 91.67%) and triage accuracy (from 76.82% to 86.20%). The best diagnostic accuracy was 93.75% (ChatGPT-01 and DeepSeek-R1). Prompting also shifted models toward a more precautionary stance, increasing safety of advice from 89.06% to 94.53%, accompanied by higher over-triage (from 53.15% to 65.62%). While advanced LLMs show high diagnostic accuracy, triage remains a significant challenge. Structured prompting is a practical lever to enhance robustness.
Key Performance Indicators
Leveraging advanced prompting, LLMs demonstrate significant gains in diagnostic precision and enhanced safety for clinical support.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The rapid evolution of large language models (LLMs) like ChatGPT-4, DeepSeek, and Gemini has profoundly transformed natural language processing. These models excel in text generation, complex reasoning, and integrating domain-specific knowledge, making them highly promising for healthcare applications such as disease diagnosis and patient management. However, their reliability compared to humans is still an open question, with prior studies showing mixed findings and highlighting the need for standardized evaluation frameworks.
This study utilized a validated dataset of 48 synthetic single-turn clinical vignettes, categorized into four triage urgency levels (Emergent, 1-day, 1-week, Self-care). Eight contemporary LLMs were evaluated under two scenarios: without prompts and with structured prompts comprising exemplar cases. Performance was assessed using diagnostic and triage accuracy, confusion matrices, over-triage, safety of advice, and the Capability Comparison Score (CCS) to account for case difficulty.
Structured prompting consistently boosted performance across all models. Mean diagnostic accuracy increased from 89.84% to 91.67%, and mean triage accuracy from 76.82% to 86.20%. ChatGPT-01 and DeepSeek-R1 achieved the best diagnostic accuracy at 93.75%. Prompting also shifted models toward safety, increasing safety of advice to 94.53% but also raising over-triage to 65.62%. CCS values preserved rankings but highlighted the impact of vignette difficulty.
The integration of LLMs into clinical practice is gaining attention, especially for diagnosis and triage. Our findings show LLMs approaching physician-level diagnostic accuracy (up to 93.75% for best models) but remaining below physician benchmarks for triage, with a systematic tendency toward over-triage. Prompting served as a 'safety-first' lever, although it can drive unnecessary resource strain. The study also highlighted the sensitivity of LLM performance to vignette design and the importance of precise, comprehensive clinical descriptions for accurate predictions.
In conclusion, GPT-01 and DeepSeek-R1 achieved the highest diagnostic accuracy among the evaluated LLMs, approaching primary care physician benchmarks on this specific vignette-based task. While triage remains a challenge, structured prompting proves to be a practical, training-free method to enhance robustness. Continued research is essential to ensure reliable and safe clinical integration of LLMs, focusing on uncertainty-aware prompting and real-world, multi-turn/multi-modality cases.
Impact of Structured Prompting on Performance
+1.83% Average Diagnostic Accuracy Increase with Structured PromptingStructured prompting proved to be a powerful lever, boosting mean diagnostic accuracy from 89.84% to 91.67% and mean triage accuracy from 76.82% to 86.20% across all models, demonstrating enhanced robustness.
LLM Evaluation Methodology Overview
The study's rigorous evaluation involved assessing contemporary LLMs on a validated clinical vignette dataset under both non-prompted and structured-prompted conditions, using a comprehensive set of metrics including accuracy, over-triage, safety, and CCS, with comparison to physician and layperson benchmarks.
Top Performers and Safety-First Trade-offs
Scenario: In evaluating LLMs for clinical decision support, the challenge lies in achieving high accuracy while prioritizing patient safety, especially in triage.
Solution: Models like ChatGPT-01 and DeepSeek-R1 achieved the highest diagnostic accuracy (93.75%). With structured prompting, the safety of advice across models significantly increased to 94.53%.
Outcome: This 'safety-first' approach, however, led to an increased over-triage rate of 65.62%. While beneficial for minimizing under-triage, it highlights a trade-off where models might err on the side of caution, potentially leading to increased resource utilization.
While specific LLMs like ChatGPT-01 and DeepSeek-R1 achieved top-tier diagnostic accuracy, structured prompting pushed all models towards a 'safety-first' mode. This, while desirable for avoiding under-triage, increased overall over-triage, highlighting a trade-off between caution and resource utilization in clinical decision support.
| Feature | Diagnostic Accuracy | Triage Performance |
|---|---|---|
| LLM Performance (Best) |
|
|
| Physician Benchmarks |
|
|
| Key Challenge |
|
|
| Impact of Prompting |
|
|
Advanced LLMs demonstrated impressive diagnostic accuracy, closely approaching physician benchmarks on concise vignettes. However, triage remained a more significant challenge, consistently showing lower accuracy and a tendency for over-triage compared to human experts, despite substantial improvements with structured prompting.
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions.
Your AI Implementation Roadmap
A strategic phased approach to integrate large language models for impactful clinical decision support.
Initial Assessment & Data Integration
Review existing clinical data, identify critical diagnostic and triage pathways, and integrate LLM outputs into a controlled environment for initial testing.
Structured Prompt Engineering & Validation
Develop and refine structured prompts based on identified best practices and test their impact on accuracy and safety metrics using internal datasets.
Pilot Program & Clinician Feedback
Launch a small-scale pilot with a subset of clinical staff to gather real-world feedback on LLM-assisted decision support and identify areas for refinement.
Safety & Bias Audit
Conduct comprehensive audits to ensure LLM recommendations are safe, unbiased, and align with clinical guidelines, focusing on minimizing under-triage risks.
Scalable Deployment & Continuous Monitoring
Implement LLM solutions across broader clinical workflows, establish robust monitoring systems for ongoing performance, and integrate feedback loops for continuous improvement.
Ready to Transform Your Healthcare Operations?
Schedule a personalized consultation to discuss how our enterprise AI solutions can be tailored to your specific clinical needs and strategic objectives.