Benchmarking large language model-based agent systems for clinical decision tasks

Unpacking the Efficacy and Efficiency of AI Agents in Clinical Decision Making

This study systematically evaluates LLM-based agent systems (OpenManus, Manus) against baseline LLMs (GPT-4.1, Qwen-3, Gemma-3, MedGemma, Llama-4) on clinical decision tasks, including diagnostic simulations (AgentClinic), medical QA (MedAgentsBench), and challenging questions (Humanity's Last Exam). Findings indicate that while agent systems offer modest accuracy gains (e.g., 60.3% in AgentClinic MedQA), these benefits come with substantial increases in computational resources (>10x token usage, >2x latency) and persistent hallucinations, even after mitigation (affecting ~30% of clinical scenarios). The study underscores the need for more accurate, efficient, and clinically viable agent systems, highlighting that current designs present significant practical trade-offs and are not yet robust enough for routine clinical deployment without further refinement and rigorous validation.

Schedule Your Strategy Session

Executive Impact: Key Findings for Enterprise AI Strategy

Our comprehensive analysis reveals critical performance metrics and operational trade-offs for deploying agentic AI in healthcare.

0 Max Accuracy Gain (AgentClinic MedQA)

0 Increase in Token Usage

0 Increase in Latency

0 Scenarios Affected by Hallucinations (post-mitigation)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

60.3% Agent Systems' Peak Accuracy in AgentClinic MedQA

Agent systems achieved a peak accuracy of 60.3% in simulated diagnostic dialogues (AgentClinic MedQA), a modest gain over baseline LLMs, underscoring the limited performance benefits despite advanced capabilities.

System Type	Text-Only Tasks (MedQA, MIMIC-IV, HLE)	Multimodal Tasks (NEJM, HLE)
Baseline LLMs	Modest accuracy, lower resource usage, less complex workflows.	Low accuracy, variable performance.
Agent Systems (OpenManus variants)	Modest accuracy gains (0.5-8.9% over best LLMs) High resource usage, complex workflows with tool use.	Low accuracy, inconsistent improvement over baselines High resource usage.

A comparative analysis reveals that agent systems offer marginal accuracy improvements in text-only tasks but inconsistent performance in multimodal tasks, always at significantly higher computational costs.

>10x Token Usage Increase for Agent Systems

Agent systems consistently required more than tenfold the token usage compared to baseline LLMs across all benchmarks, indicating substantial increases in computational demands.

>2x Response Time Increase for Agent Systems

Response times for agent systems were at least twice that of baseline LLMs, highlighting significant latency penalties due to their complex reasoning processes and tool invocations.

Enterprise Process Flow

Plan Task

→

Invoke Tools (Web Search, Code Ex.)

→

Process Results

→

Iterative Reasoning

→

Formulate Diagnosis

The intricate, multi-step nature of agent systems' workflows, involving planning, tool invocation, and iterative reasoning, contributes directly to their increased computational burden and response times.

89.9% Hallucinations Blocked by Safeguards

Despite the prevalence of hallucinations, in-agent safeguards successfully filtered 89.9% of them. However, approximately 30% of clinical scenarios were still impacted by unblocked hallucinations.

Hallucination Impact in Clinical Scenarios

In AgentClinic-MedQA, original OpenManus generated 728 hallucinations across 97.2% of scenarios. Although 532 were blocked, 196 unblocked hallucinations affected diagnoses in 149 scenarios (69.6%). Targeted interventions (output filtering, prompt engineering) reduced diagnostically relevant hallucinations to 155 across all variants, but no statistically significant difference in diagnostic accuracy was found between scenarios with and without hallucinations, possibly due to further information gathering triggered by the hallucinations.

Key Takeaway: While mitigation strategies are effective at reducing the *number* of impactful hallucinations, their persistence across a significant portion of scenarios (around 30% post-mitigation) remains a critical safety concern for real-world clinical deployment, even if they don't always directly alter final accuracy in simulated settings.

This module illustrates the persistent challenge of hallucinations in agent systems, detailing their frequency, mitigation, and the remaining impact on diagnostic outcomes in simulated clinical environments.

Limited Robustness Current Status for Routine Clinical Deployment

The study concludes that current LLM-based agent systems, despite improvements, remain insufficiently robust, efficient, and reliable for routine clinical deployment, emphasizing the need for significant refinement.

Challenge	Impact on Clinical Use	Recommended Action
Modest Accuracy Gains	Limited direct clinical benefit over advanced LLMs.	Prioritize novel architectural designs for significant accuracy leaps.
High Computational Cost	Impractical for high-volume, real-time clinical workflows.	Develop more efficient agent designs and optimized tool interactions.
Persistent Hallucinations	Significant patient safety risk, even with mitigation.	Integrate stronger real-time verification and human-in-the-loop mechanisms.
Idealized Benchmarks	Overestimation of real-world reliability.	Validate on authentic EHR data, complex longitudinal scenarios.

Key challenges of agent system deployment in healthcare, along with their practical implications and recommended actions for future development, emphasizing a cautious and rigorous approach.

Advanced ROI Calculator

Estimate the potential return on investment for integrating agentic AI systems into your operations.

Your Industry

Number of Employees (AI Impacted)

Average Hours Saved per Employee per Week

Average Hourly Rate ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Implementation Roadmap

A phased approach to integrate agentic AI, ensuring strategic deployment and maximum value.

Phase 1: Pilot & Data Integration

Initial setup of agent systems with existing clinical data (e.g., EHR excerpts). Focus on low-risk, well-defined tasks like preliminary report generation or basic information retrieval. Establish baseline performance and monitor resource consumption and hallucination rates.

Phase 2: Customization & Tool Refinement

Tailor prompt engineering and integrate specialized medical tools (e.g., advanced imaging analysis, specific knowledge bases). Conduct iterative testing in simulated environments with clinical experts to refine agent behavior and improve accuracy and efficiency for target tasks.

Phase 3: Controlled Clinical Validation

Deploy agent systems in a controlled, supervised clinical environment, possibly as 'shadow AI' alongside human clinicians. Rigorously assess performance on complex diagnostic pathways, treatment planning, and multimodal data, focusing on safety, interpretability, and human-AI collaboration. Obtain regulatory approvals.

Phase 4: Scaled Deployment & Continuous Monitoring

Expand deployment to broader clinical settings, ensuring seamless integration with existing IT infrastructure. Implement continuous monitoring for performance, safety, and efficiency. Establish feedback loops with clinicians for ongoing improvement and adaptation to evolving clinical needs.

Ready to Transform Your Enterprise with AI Agents?

Unlock the full potential of agentic AI. Book a personalized strategy session to explore tailored solutions for your organization.

Schedule Your Strategy Session

Benchmarking large language model-based agent systems for clinical decision tasks

Unpacking the Efficacy and Efficiency of AI Agents in Clinical Decision Making

Executive Impact: Key Findings for Enterprise AI Strategy

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Hallucination Impact in Clinical Scenarios

Advanced ROI Calculator

Implementation Roadmap

Phase 1: Pilot & Data Integration

Phase 2: Customization & Tool Refinement

Phase 3: Controlled Clinical Validation

Phase 4: Scaled Deployment & Continuous Monitoring

Ready to Transform Your Enterprise with AI Agents?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai