Research Paper
Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department
This study comprehensively evaluated the performance of seven large language models (LLMs) in emergency department (ED) triage using a large dataset of 39,375 patient cases. Findings indicate that while some LLMs show moderate agreement with physician decisions for triage and admission prediction, none achieved strong agreement. LLMs performed best in anatomically well-defined clinical scenarios but struggled with severity-based triage. The conclusion is that current LLMs are promising as supervised decision support tools rather than autonomous systems.
Executive Impact: Key Findings
Understand the critical insights and potential for AI in optimizing emergency department operations and patient care.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
DeepSeek achieved the highest agreement with physician ESI assessments (κw = 0.467; 95% CI: 0.457–0.476), followed closely by Gemini 2.5 (κw = 0.465; 95% CI: 0.457–0.471). Both models achieved moderate agreement. Claude Sonnet 4 showed slightly lower agreement (κw = 0.402; 95% CI: 0.394-0.409), remaining at the boundary between fair and moderate concordance.
Qwen (κw = 0.304), Grok (κw = 0.261), and Thinking GPT-5 (κw = 0.258) demonstrated fair agreement, while Instant GPT-5 showed poor agreement with physician triage decisions (κw = 0.176).
Claude Sonnet 4 achieved the highest agreement with physician referral decisions (accuracy: 67.1%; κ = 0.619, 95% CI: 0.614–0.624), corresponding to substantial agreement. DeepSeek demonstrated comparable performance (accuracy: 66.8%; κ = 0.615, 95% CI: 0.608–0.620), followed by Gemini 2.5 (accuracy: 64.5%; κ = 0.597, 95% CI: 0.591–0.602) and Grok (accuracy: 63.8%; κ = 0.580, 95% CI: 0.575-0.586).
Performance was strongest in anatomically well-defined specialties like Ophthalmology (F1 = 0.872), Pediatrics (F1 = 0.849), and Otolaryngology (ENT) (F1 = 0.810). Poorest performance was in Orthopedics (F1 = 0.018), Shock Room (F1 = 0.185), and Fast Track (F1 = 0.235).
Claude Sonnet 4 was the best-performing model for admission prediction, achieving a binary Cohen's kappa of about 0.46, signifying moderate reliability. The next best were Gemini 2.5 and DeepSeek (κ ≈ 0.37).
Qwen and DeepSeek exhibited a strong positive bias, generating 3–5 times more false admissions than missed admissions. Thinking GPT-5 showed a dangerous negative bias, significantly favoring false discharges (missed admissions). Gemini 2.5 demonstrated the most balanced error profile.
LLM Triage Process Flow
| Specialty Category | High Performance LLMs | Challenges |
|---|---|---|
| Anatomically Defined (e.g., Ophthalmology, Pediatrics) |
|
Minimal, high F1-scores (0.810-0.872) |
| Severity-Based (e.g., Shock Room, Fast Track) |
|
High misclassification, low recall, F1-scores < 0.25 due to lack of critical cues & gestalt reasoning |
| Orthopedics (Special Case) |
|
Very poor performance (F1=0.018) due to limited dataset and specific hospital context (no dedicated dept.) |
Case Study: Improving Pediatric Triage
In pediatric cases, LLMs like Claude Sonnet 4 achieved an F1-score of 0.849, significantly outperforming other categories. This highlights their potential as highly effective decision support tools in well-defined clinical areas with clear symptomatology. Integrating LLMs in pediatric triage could lead to faster, more accurate initial assessments, reducing wait times and improving patient flow. However, continuous physician oversight is crucial to prevent misinterpretation of nuanced cases.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by implementing tailored AI solutions based on our research.
Your AI Implementation Roadmap
A phased approach to integrate advanced AI into your emergency department, ensuring successful adoption and measurable impact.
Phase 1: Pilot & Validation
Deploy selected LLMs in a controlled, supervised environment for 3 months. Collect data on physician override rates and compare LLM predictions to actual outcomes. Focus on anatomically well-defined specialties first.
Phase 2: Integration & Training
Integrate LLM insights directly into the EMR. Conduct comprehensive training for ED staff on interpreting LLM recommendations and best practices for supervised use. Refine prompts and data input methods based on pilot feedback.
Phase 3: Scaled Deployment & Optimization
Expand LLM use to broader triage categories, including severity-based scenarios, with enhanced monitoring. Implement continuous feedback loops for model optimization and bias detection. Evaluate impact on ED efficiency, patient outcomes, and physician workload.
Ready to Transform Your ED Operations?
Schedule a personalized consultation with our AI specialists to discuss how these findings can be applied to your unique institutional needs.