Skip to main content
Enterprise AI Analysis: Benchmarking Large Language Models in Breast Cancer Care: Agreement with Radiology-Led Multidisciplinary Tumor Board Decisions

Enterprise AI Analysis

Unlocking Precision in Breast Cancer Treatment with AI

Our analysis of recent research highlights how Large Language Models (LLMs) are transforming oncologic decision-making, offering significant advancements in accuracy and efficiency for breast cancer care.

Quantifying AI's Impact in Oncology

Leveraging advanced AI can lead to tangible improvements in clinical decision-making and operational efficiency.

83.2% Overall Concordance
90% Agreement in Aggressive Subtypes
100% Adjuvant Therapy F1 Score

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Overview
Methodology
Key Findings
Conclusion
83.2% ChatGPT-4o's Overall Concordance

This study conducted a retrospective analysis of 286 breast cancer cases, comparing LLM-generated treatment recommendations with decisions from a radiology-led Multidisciplinary Tumor Board (MDTB). The goal was to assess agreement and identify contexts where LLMs are most reliable. The models included ChatGPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Recommendations were evaluated across treatment categories, disease stages, and molecular subtypes, using concordance, Cohen's kappa, precision, recall, and F1 scores.

Enterprise Process Flow

Structured Case Data
Standardized Prompt to LLMs
Output Comparison with MDTB
Performance Metrics Calculation
Subgroup Analyses

Patient data from 286 cases were anonymized and structured into clinical vignettes, including demographics, medical history, pathology (histologic subtype, tumor size, grade, ER/PR/HER2/Ki-67), radiological findings (multifocality, nodal status, PET/CT), and AJCC 8th edition clinical staging. A uniform prompt referencing ASCO, ESMO, and NCCN guidelines was used for all LLMs. MDTB consensus decisions served as the benchmark. LLM outputs were mapped to predefined treatment categories (systemic therapy, breast surgery, axillary management) and evaluated for concordance, Cohen's kappa, precision, recall, and F1 scores. Subgroup analyses examined performance across molecular subtypes and disease stages.

90% Concordance in HER2-enriched & Triple-Negative BC

ChatGPT-4o achieved the highest overall concordance (83.2%) with MDTB decisions, followed by Claude (79.7%) and Gemini (79.4%). Agreement exceeded 90% in HER2-enriched and triple-negative breast cancer. F1 scores were highest for adjuvant systemic therapy (100) and neoadjuvant chemotherapy (≥91). However, performance significantly declined for surgical decisions, including mastectomy (<58) and axillary lymph node dissection (≤23.5). Stage-based analyses showed varied concordance, with high agreement in some stage III-IV subgroups and lower agreement in scenarios requiring complex, individualized decisions.

Decision Domain LLM Performance MDTB Rationale
Systemic Therapy
  • High concordance
  • Excellent F1 scores (91-100%)
  • Guideline-aligned for standard cases
  • Protocol-driven, standardized risk stratification
  • Algorithms well-defined
Surgical Decisions (Mastectomy, ALND)
  • Significantly lower F1 scores (<58%)
  • Limited outputs for ALND
  • Difficulty with individualized surgical planning
  • Requires nuanced clinical judgment
  • Integration of imaging findings, patient preferences
  • Multidisciplinary trade-offs
Molecular Subtyping (HER2-enriched, TNBC)
  • High agreement (>90%)
  • Standardized management protocols
Molecular Subtyping (Luminal A)
  • Lowest concordance (~66%)
  • Requires individualized consideration of tumor biology, patient preferences, quality-of-life trade-offs
Complex/Multimodal Cases
  • Declined performance (F1: 26.7-37.8%)
  • Inter-model discordance up to 52.7% (ALND)
  • Requires expert multidisciplinary oversight
  • Nuanced interpretation of available findings
  • Iterative deliberation

LLMs demonstrate substantial agreement with MDTB recommendations in structured, guideline-based breast cancer settings, particularly for systemic therapy planning. However, their performance declines when decisions require individualized clinical judgment, complex multimodal trade-offs, or nuanced interpretation of findings. These findings support further evaluation of LLMs as decision-support tools in straightforward cases, but complex surgical or multimodal treatment planning still requires expert multidisciplinary oversight. LLMs should be regarded as adjunctive tools that enhance, but do not replace, human expertise.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could realize by integrating AI into key operational areas.

Annual Cost Savings
Hours Reclaimed Annually

Your AI Implementation Roadmap

Our proven methodology guides enterprises from initial strategy to successful AI integration and sustained impact.

Phase 1: Discovery & Strategy

In-depth analysis of your current workflows, identifying key AI opportunities and defining a tailored strategic roadmap aligned with your business objectives.

Phase 2: Pilot & Validation

Develop and deploy a proof-of-concept AI solution in a controlled environment, rigorously testing its performance and validating its business value.

Phase 3: Integration & Scaling

Seamlessly integrate the validated AI solution into your existing enterprise architecture and scale it across relevant departments for maximum impact.

Phase 4: Monitoring & Optimization

Continuous monitoring of AI performance, ongoing optimization, and training to ensure long-term effectiveness and adaptation to evolving needs.

Optimize Your Oncology Workflows with AI

Ready to integrate advanced AI into your clinical decision-making processes? Our experts can help you design and implement solutions that enhance accuracy and efficiency.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking