Benchmarking large language model-based agent systems for clinical decision tasks
Unpacking the Efficacy and Efficiency of AI Agents in Clinical Decision Making
This study systematically evaluates LLM-based agent systems (OpenManus, Manus) against baseline LLMs (GPT-4.1, Qwen-3, Gemma-3, MedGemma, Llama-4) on clinical decision tasks, including diagnostic simulations (AgentClinic), medical QA (MedAgentsBench), and challenging questions (Humanity's Last Exam). Findings indicate that while agent systems offer modest accuracy gains (e.g., 60.3% in AgentClinic MedQA), these benefits come with substantial increases in computational resources (>10x token usage, >2x latency) and persistent hallucinations, even after mitigation (affecting ~30% of clinical scenarios). The study underscores the need for more accurate, efficient, and clinically viable agent systems, highlighting that current designs present significant practical trade-offs and are not yet robust enough for routine clinical deployment without further refinement and rigorous validation.
Executive Impact: Key Findings for Enterprise AI Strategy
Our comprehensive analysis reveals critical performance metrics and operational trade-offs for deploying agentic AI in healthcare.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Agent systems achieved a peak accuracy of 60.3% in simulated diagnostic dialogues (AgentClinic MedQA), a modest gain over baseline LLMs, underscoring the limited performance benefits despite advanced capabilities.
| System Type | Text-Only Tasks (MedQA, MIMIC-IV, HLE) | Multimodal Tasks (NEJM, HLE) |
|---|---|---|
| Baseline LLMs |
|
|
| Agent Systems (OpenManus variants) |
|
|
A comparative analysis reveals that agent systems offer marginal accuracy improvements in text-only tasks but inconsistent performance in multimodal tasks, always at significantly higher computational costs.
Agent systems consistently required more than tenfold the token usage compared to baseline LLMs across all benchmarks, indicating substantial increases in computational demands.
Response times for agent systems were at least twice that of baseline LLMs, highlighting significant latency penalties due to their complex reasoning processes and tool invocations.
Enterprise Process Flow
The intricate, multi-step nature of agent systems' workflows, involving planning, tool invocation, and iterative reasoning, contributes directly to their increased computational burden and response times.
Despite the prevalence of hallucinations, in-agent safeguards successfully filtered 89.9% of them. However, approximately 30% of clinical scenarios were still impacted by unblocked hallucinations.
Hallucination Impact in Clinical Scenarios
In AgentClinic-MedQA, original OpenManus generated 728 hallucinations across 97.2% of scenarios. Although 532 were blocked, 196 unblocked hallucinations affected diagnoses in 149 scenarios (69.6%). Targeted interventions (output filtering, prompt engineering) reduced diagnostically relevant hallucinations to 155 across all variants, but no statistically significant difference in diagnostic accuracy was found between scenarios with and without hallucinations, possibly due to further information gathering triggered by the hallucinations.
Key Takeaway: While mitigation strategies are effective at reducing the *number* of impactful hallucinations, their persistence across a significant portion of scenarios (around 30% post-mitigation) remains a critical safety concern for real-world clinical deployment, even if they don't always directly alter final accuracy in simulated settings.
This module illustrates the persistent challenge of hallucinations in agent systems, detailing their frequency, mitigation, and the remaining impact on diagnostic outcomes in simulated clinical environments.
The study concludes that current LLM-based agent systems, despite improvements, remain insufficiently robust, efficient, and reliable for routine clinical deployment, emphasizing the need for significant refinement.
| Challenge | Impact on Clinical Use | Recommended Action |
|---|---|---|
| Modest Accuracy Gains |
|
|
| High Computational Cost |
|
|
| Persistent Hallucinations |
|
|
| Idealized Benchmarks |
|
|
Key challenges of agent system deployment in healthcare, along with their practical implications and recommended actions for future development, emphasizing a cautious and rigorous approach.
Advanced ROI Calculator
Estimate the potential return on investment for integrating agentic AI systems into your operations.
Implementation Roadmap
A phased approach to integrate agentic AI, ensuring strategic deployment and maximum value.
Phase 1: Pilot & Data Integration
Initial setup of agent systems with existing clinical data (e.g., EHR excerpts). Focus on low-risk, well-defined tasks like preliminary report generation or basic information retrieval. Establish baseline performance and monitor resource consumption and hallucination rates.
Phase 2: Customization & Tool Refinement
Tailor prompt engineering and integrate specialized medical tools (e.g., advanced imaging analysis, specific knowledge bases). Conduct iterative testing in simulated environments with clinical experts to refine agent behavior and improve accuracy and efficiency for target tasks.
Phase 3: Controlled Clinical Validation
Deploy agent systems in a controlled, supervised clinical environment, possibly as 'shadow AI' alongside human clinicians. Rigorously assess performance on complex diagnostic pathways, treatment planning, and multimodal data, focusing on safety, interpretability, and human-AI collaboration. Obtain regulatory approvals.
Phase 4: Scaled Deployment & Continuous Monitoring
Expand deployment to broader clinical settings, ensuring seamless integration with existing IT infrastructure. Implement continuous monitoring for performance, safety, and efficiency. Establish feedback loops with clinicians for ongoing improvement and adaptation to evolving clinical needs.
Ready to Transform Your Enterprise with AI Agents?
Unlock the full potential of agentic AI. Book a personalized strategy session to explore tailored solutions for your organization.