Skip to main content
Enterprise AI Analysis: First, do NOHARM: towards clinically safe large language models

First, do NOHARM: towards clinically safe large language models

Revolutionizing Medical AI Safety: The NOHARM Framework

Our analysis of 'First, do NOHARM' reveals critical insights into the clinical safety profiles of Large Language Models (LLMs). The NOHARM benchmark, encompassing 100 real primary-care-to-specialist consultation cases across 10 specialties, uncovers significant findings on harm frequency, severity, and mitigation strategies for AI-generated medical recommendations.

Executive Summary: Navigating AI Risks in Healthcare

This study is a foundational step towards understanding and mitigating harm from AI in clinical decision support. It establishes that existing benchmarks are insufficient for measuring safety, and that a multi-agent approach significantly reduces error. This has profound implications for AI deployment strategies in healthcare.

0 Max Severe Harm Rate
0 Omission Error Share
0 Multi-agent Safety Odds

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Clinical Safety
Error Taxonomy
Harm Mitigation

The NOHARM benchmark found that severe harm occurs in up to 22.2% of cases across 31 LLMs, highlighting a significant safety gap not captured by traditional AI and medical knowledge benchmarks. Performance on NOHARM's safety metric was only moderately correlated (r = 0.61–0.64) with existing evaluations, underscoring the need for explicit safety measurement.

Remarkably, the best LLMs outperformed generalist physicians on safety (mean difference 9.7%), suggesting AI's potential for safer clinical decision support when properly evaluated.

A critical finding is that errors of omission account for 76.6% of severely harmful errors. This means LLMs are more likely to cause harm by failing to recommend necessary actions (e.g., critical tests, follow-up) rather than by recommending inappropriate ones.

Analysis of intervention categories showed that top models' performance advantage came from reducing severe diagnostic and counseling errors of omission, further emphasizing the importance of comprehensive recommendation generation.

Multi-agent orchestration, where models review and revise each other's outputs, was found to be a highly effective strategy. These configurations had 5.9-fold higher odds of achieving top-quartile Safety performance than solo models.

The study also revealed an inverted-U relationship between Safety and Restraint (precision). Models that were either too precise (too few recommendations) or too permissive (too many, some inappropriate) performed worse on safety, with optimal safety achieved at intermediate levels of restraint.

22.2% Max Severe Harm Rate from LLMs in Clinical Cases

Enterprise Process Flow

Stanford eConsult Dataset (16,399)
Filter & Anonymize Cases (149)
Specialist Review & Approval (100)
LLM Query for Options (17,278)
Physician & LLM Review for Unique Options (4,249)
Manual Review for Realism
Expert Rubric Creation (2-3 Specialists per Case)

Benchmarking AI Safety: NOHARM vs. Traditional

Feature Traditional Benchmarks NOHARM Benchmark
Focus
  • Knowledge recall
  • Diagnostic accuracy
  • Patient-level harm (commission & omission)
  • Clinical appropriateness of actions
Data Source
  • Stylized vignettes
  • USMLE-style questions
  • Real primary-care-to-specialist eConsult cases
  • Authentic clinical questions
Evaluation
  • Model performance on correctness
  • Limited context
  • Expert panel annotations (12,747) across 10 specialties
  • Harm severity definitions (WHO)

Case Study: Urinary Tract Infection Management

A 25-year-old woman presents with urinary urgency and burning. An LLM might recommend only 'reassurance'. NOHARM identifies this as a Moderate Harm of Omission, as it fails to recommend crucial steps like urinalysis with reflex culture and appropriate antibiotics (nitrofurantoin or TMP/SMX). Conversely, an LLM recommending a CT abdomen pelvis with contrast for this case would be a Mild Harm of Commission.

The benchmark highlights that failing to act can be as harmful, if not more, than inappropriate actions.

Calculate Your Potential AI Safety ROI

Estimate the impact of improved AI clinical safety on your organization. By minimizing harmful errors and optimizing clinical recommendations, you can achieve significant savings in operational costs and reallocate physician time more effectively.

Annual Savings from Reduced Harm $0
Annual Hours Reclaimed 0

Your Journey to Safer AI Clinical Integration

Our phased approach ensures a secure, compliant, and impactful integration of AI into your clinical workflows, prioritizing patient safety and operational efficiency.

Phase 1: Safety Assessment & Gap Analysis

Leverage NOHARM-like evaluations to identify current AI safety risks and establish a baseline. This involves expert review of AI outputs in simulated clinical scenarios specific to your practice.

Phase 2: Multi-Agent Orchestration Design

Architect multi-agent systems using diverse LLMs to review and refine clinical recommendations, significantly reducing errors of both commission and omission.

Phase 3: Pilot Deployment & Continuous Monitoring

Implement AI-powered clinical decision support in a controlled pilot, continuously monitoring safety metrics and integrating feedback for iterative improvement and scaling.

Ready to Ensure AI Safety in Your Practice?

Don't let unquantified AI risks compromise patient care. Partner with us to implement robust safety benchmarks and multi-agent solutions that protect your patients and empower your clinicians.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking