Skip to main content
Enterprise AI Analysis: Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department

Research Paper

Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department

This study comprehensively evaluated the performance of seven large language models (LLMs) in emergency department (ED) triage using a large dataset of 39,375 patient cases. Findings indicate that while some LLMs show moderate agreement with physician decisions for triage and admission prediction, none achieved strong agreement. LLMs performed best in anatomically well-defined clinical scenarios but struggled with severity-based triage. The conclusion is that current LLMs are promising as supervised decision support tools rather than autonomous systems.

Executive Impact: Key Findings

Understand the critical insights and potential for AI in optimizing emergency department operations and patient care.

0.467 Highest ESI Agreement (Kappa)
67.1% Best Clinic Referral Accuracy
39,375 Total Patients Analyzed

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Triage Score Agreement
Clinic Referral Accuracy
Admission Prediction

DeepSeek achieved the highest agreement with physician ESI assessments (κw = 0.467; 95% CI: 0.457–0.476), followed closely by Gemini 2.5 (κw = 0.465; 95% CI: 0.457–0.471). Both models achieved moderate agreement. Claude Sonnet 4 showed slightly lower agreement (κw = 0.402; 95% CI: 0.394-0.409), remaining at the boundary between fair and moderate concordance.

Qwen (κw = 0.304), Grok (κw = 0.261), and Thinking GPT-5 (κw = 0.258) demonstrated fair agreement, while Instant GPT-5 showed poor agreement with physician triage decisions (κw = 0.176).

Claude Sonnet 4 achieved the highest agreement with physician referral decisions (accuracy: 67.1%; κ = 0.619, 95% CI: 0.614–0.624), corresponding to substantial agreement. DeepSeek demonstrated comparable performance (accuracy: 66.8%; κ = 0.615, 95% CI: 0.608–0.620), followed by Gemini 2.5 (accuracy: 64.5%; κ = 0.597, 95% CI: 0.591–0.602) and Grok (accuracy: 63.8%; κ = 0.580, 95% CI: 0.575-0.586).

Performance was strongest in anatomically well-defined specialties like Ophthalmology (F1 = 0.872), Pediatrics (F1 = 0.849), and Otolaryngology (ENT) (F1 = 0.810). Poorest performance was in Orthopedics (F1 = 0.018), Shock Room (F1 = 0.185), and Fast Track (F1 = 0.235).

Claude Sonnet 4 was the best-performing model for admission prediction, achieving a binary Cohen's kappa of about 0.46, signifying moderate reliability. The next best were Gemini 2.5 and DeepSeek (κ ≈ 0.37).

Qwen and DeepSeek exhibited a strong positive bias, generating 3–5 times more false admissions than missed admissions. Thinking GPT-5 showed a dangerous negative bias, significantly favoring false discharges (missed admissions). Gemini 2.5 demonstrated the most balanced error profile.

61.7% DeepSeek & Claude Sonnet 4 Raw ESI Accuracy

LLM Triage Process Flow

Patient Data Anonymization
LLM Input (Symptoms, Vitals)
ESI Score Prediction
Referral Clinic Suggestion
Admission Prediction
Physician Review & Override

LLM Performance by Specialty Category

Specialty Category High Performance LLMs Challenges
Anatomically Defined (e.g., Ophthalmology, Pediatrics)
  • Claude Sonnet 4
  • DeepSeek
  • Gemini 2.5
Minimal, high F1-scores (0.810-0.872)
Severity-Based (e.g., Shock Room, Fast Track)
  • Claude Sonnet 4 (moderate)
High misclassification, low recall, F1-scores < 0.25 due to lack of critical cues & gestalt reasoning
Orthopedics (Special Case)
  • N/A
Very poor performance (F1=0.018) due to limited dataset and specific hospital context (no dedicated dept.)

Case Study: Improving Pediatric Triage

In pediatric cases, LLMs like Claude Sonnet 4 achieved an F1-score of 0.849, significantly outperforming other categories. This highlights their potential as highly effective decision support tools in well-defined clinical areas with clear symptomatology. Integrating LLMs in pediatric triage could lead to faster, more accurate initial assessments, reducing wait times and improving patient flow. However, continuous physician oversight is crucial to prevent misinterpretation of nuanced cases.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by implementing tailored AI solutions based on our research.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A phased approach to integrate advanced AI into your emergency department, ensuring successful adoption and measurable impact.

Phase 1: Pilot & Validation

Deploy selected LLMs in a controlled, supervised environment for 3 months. Collect data on physician override rates and compare LLM predictions to actual outcomes. Focus on anatomically well-defined specialties first.

Phase 2: Integration & Training

Integrate LLM insights directly into the EMR. Conduct comprehensive training for ED staff on interpreting LLM recommendations and best practices for supervised use. Refine prompts and data input methods based on pilot feedback.

Phase 3: Scaled Deployment & Optimization

Expand LLM use to broader triage categories, including severity-based scenarios, with enhanced monitoring. Implement continuous feedback loops for model optimization and bias detection. Evaluate impact on ED efficiency, patient outcomes, and physician workload.

Ready to Transform Your ED Operations?

Schedule a personalized consultation with our AI specialists to discuss how these findings can be applied to your unique institutional needs.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking