Skip to main content
Enterprise AI Analysis: An updated analysis of large language model performance on ophthalmology speciality examinations

Enterprise AI Analysis

An updated analysis of large language model performance on ophthalmology speciality examinations

Current-generation large language models (LLMs) significantly outperform previous models on postgraduate ophthalmology examinations, with Gemini 2.5 Pro achieving the highest accuracy. While performance has improved, an error analysis reveals persistent safety-critical reasoning deficits, particularly in integrating discordant findings, necessitating human oversight in clinical applications.

Executive Impact: Key Findings at a Glance

Our analysis uncovers critical performance shifts and areas for strategic focus when integrating advanced AI into medical practice.

27.2% Median Accuracy Increase (Part One)
17.1% Median Accuracy Increase (Part Two)
98.0% Gemini 2.5 Pro Accuracy (Part One)
90.7% Gemini 2.5 Pro Accuracy (Part Two)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Current-generation LLMs demonstrate significant improvements, with Gemini 2.5 Pro leading in both Part One (Basic Sciences) and Part Two (Clinical Application) examinations.

LLM Performance (2025 Models)

Model Part One Accuracy Part Two Accuracy
Gemini 2.5 Pro98.0%90.7%
Claude Sonnet 4.094.0%74.4%
Grok 392.0%88.4%
ChatGPT 4.090.0%88.4%
ChatGPT 590.0%86.0%
DeepSeek-V382.0%83.7%

Current-generation LLMs demonstrate significant improvements, with Gemini 2.5 Pro leading in both Part One (Basic Sciences) and Part Two (Clinical Application) examinations.

LLM Performance Baseline (2023 Models)

Model Part One Accuracy Part Two Accuracy
Bing Chat78.9%82.9%
ChatGPT-465.5%74.4%
ChatGPT-3.565.5%65.9%
Google Bard40.0% (est.)48.0% (est.)

The 2023 models serve as a baseline, showing substantial performance gaps compared to the 2025 models, highlighting rapid advancements in LLM capabilities.

98.0% Highest Part One Accuracy (Gemini 2.5 Pro)

Gemini 2.5 Pro demonstrates exceptional performance in the Part One (Basic Sciences) examination, achieving a near-perfect score, indicating strong foundational knowledge in ophthalmology.

85.0% Top Human Candidate Score (Part One) Exceeded

Most current LLMs now surpass the top human candidate score, showcasing their advanced capabilities in medical knowledge recall and application.

An analysis of errors revealed LLMs sometimes rely on superficial pattern matching, failing to integrate contradictory information, leading to premature diagnostic closure and potentially missing critical conditions like compressive pathology.

LLM Reasoning Failure Mechanisms

Question Prompt
Superficial Pattern Matching
Failure to Integrate Discordant Findings
Premature Diagnostic Closure
Missed Compressive Pathology

An analysis of errors revealed LLMs sometimes rely on superficial pattern matching, failing to integrate contradictory information, leading to premature diagnostic closure and potentially missing critical conditions like compressive pathology.

Case Study: Glaucoma Patient Mismanagement

A patient with a family history of glaucoma, bilateral IOPs of 25 mmHg, optic atrophy, and reduced visual acuity was presented. LLMs unanimously opted for pharmacological intervention with latanoprost. However, the ground truth was MRI brain and orbit. Expert consultants noted the LLMs' superficial reasoning, failing to integrate atypical features (e.g., absence of correlating RNFL damage, asymmetry of visual function despite symmetric IOP) necessitating neuroimaging to rule out compressive pathology. This highlights a significant safety risk in real-world clinical application.

Takeaway: LLMs must improve integration of atypical and discordant clinical findings to avoid safety-critical errors.

This case exemplifies the critical reasoning deficits in current LLMs, particularly their inability to process complex clinical scenarios and prioritize neuroimaging when atypical signs suggest a broader differential diagnosis beyond standard protocols.

LLM vs. Human Expert Reasoning

Feature LLM Reasoning (Observed) Human Expert Reasoning (Desired)
Diagnostic ApproachSuperficial pattern matchingSystematic exclusion of alternatives
Data Integration Fails to integrate discordant negative findings
  • Integrates all findings, including atypical
  • Considers lack of correlating RNFL damage
  • Notes visual function asymmetry despite symmetric IOP
Safety ConsiderationRisk of premature diagnostic closurePrioritizes patient safety, broad differential
Clinical ContextLimited contextual understandingComprehensive clinical judgment

A comparison reveals that while LLMs excel in knowledge recall, their current reasoning often lacks the depth and safety-consciousness of human experts, particularly in complex diagnostic scenarios.

Multimodal tasks, like optical cross transposition, show varied LLM performance. Deepseek-V3 could derive the correct prescription but failed to select it, while Claude Sonnet 4 struggled with axis alignment, pointing to deficits in spatial reasoning.

Multimodal Task Processing (Optical Cross Transposition)

Multimodal Input (Image + Text)
Derive Correct Prescription (Deepseek-V3 Success)
Select Multiple-Choice (Deepseek-V3 Failure)
Axis Alignment (Claude Sonnet 4 Failure)
Spatial Reasoning Deficit

Multimodal tasks, like optical cross transposition, show varied LLM performance. Deepseek-V3 could derive the correct prescription but failed to select it, while Claude Sonnet 4 struggled with axis alignment, pointing to deficits in spatial reasoning.

66.7% Mean Accuracy on Single Multimodal Task

The current mean accuracy for the single multimodal task (optical cross transposition) indicates that while promising, there's significant room for improvement in visual and spatial reasoning for LLMs.

Calculate Your Potential ROI with Enterprise AI

Estimate the efficiency gains and cost savings your organization could achieve by implementing tailored AI solutions.

Estimated Annual Savings $0
Equivalent Hours Reclaimed Annually 0

Your Strategic AI Implementation Roadmap

A phased approach to safely and effectively integrate advanced AI capabilities into your organization, leveraging insights from cutting-edge research.

Phase 1: Pilot & Validation

Conduct internal pilot studies with LLMs on anonymized clinical data. Establish robust validation protocols for accuracy and safety. Focus on non-diagnostic support roles.

Phase 2: Clinician-in-the-Loop Integration

Integrate LLMs into clinical workflows with mandatory human oversight. Develop interfaces for clinicians to easily review, override, and provide feedback on LLM outputs. Prioritize augmentation over automation.

Phase 3: Expanded Multimodal Benchmarking

Develop and deploy more diverse multimodal datasets encompassing various diagnostic modalities (e.g., OCT, fundus photos, visual fields) to rigorously test LLM visual and spatial reasoning capabilities. Address data contamination risks.

Phase 4: Advanced Reasoning & Safety Protocols

Research and implement LLM architectures specifically designed for complex, nuanced reasoning, focusing on integrating discordant findings and avoiding premature diagnostic closure. Implement fail-safes and anomaly detection.

Ready to Transform Your Enterprise with AI?

Connect with our AI specialists to discuss a tailored strategy for your organization. Unlock new efficiencies and drive innovation responsibly.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking