SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
Revolutionizing Surgical AI with Advanced Reasoning Capabilities
The SUREON project introduces a novel approach to surgical AI, moving beyond basic perception tasks to enable complex reasoning. By leveraging expert-narrated surgical videos, SUREON creates a large-scale video QA dataset with 12 distinct question categories covering safety, decision-making, and forecasting. The SureonVLM models, trained with supervised fine-tuning and reinforcement learning, significantly outperform general-domain models, demonstrating explicit reasoning behavior and superior performance on surgical perception and reasoning tasks, especially in safety-critical areas. This work paves the way for interpretable and clinically meaningful surgical AI.
Executive Impact: Key Metrics at a Glance
SUREON's innovative approach yields significant advancements in surgical AI, as highlighted by these key performance indicators:
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
SUREON Data Synthesis Pipeline
Large-scale QA Dataset
206.8k QA Pairs GeneratedSUREON is a large-scale video QA dataset systematically harvested from surgical academic videos, yielding 206.8k QA pairs across 12 question categories and 170 procedure types, with an expert-validated benchmark of 354 examples. This dataset is crucial for training models on surgical reasoning.
SUREON VLM Outperforms General Models
84% Average Accuracy on SUREON BenchmarkSureonVLM and SureonVLM-R1 achieve an average accuracy of 84-85% on the SUREON benchmark, significantly outperforming GPT-5.1, Gemini 3.1 Pro, and Qwen3-VL, particularly in safety-critical categories like Safety Action Identification and Decision Reasoning.
| QA Type | GPT-5.1 | Gemini 3.1 Pro | Qwen3-VL (8B) | SureonVLM (ours) | SureonVLM-R1 (ours) |
|---|---|---|---|---|---|
| Action Description | 0.52 | 0.27 | 0.63 | 0.87 | 0.88 |
| Decision Reasoning | 0.70 | 0.60 | 0.83 | 0.98 | 1.00 |
| Forecasting | 0.53 | 0.60 | 0.53 | 0.73 | 0.62 |
| Instrument Action Interaction | 0.69 | 0.81 | 0.53 | 0.88 | 0.90 |
| Local Action Reasoning | 0.83 | 0.31 | 0.67 | 0.93 | 1.00 |
| Entity Attribute | 0.82 | 0.94 | 0.93 | 0.95 | 0.97 |
| Entity Existence | 0.68 | 0.75 | 0.57 | 0.70 | 0.70 |
| Entity Localization | 0.55 | 0.55 | 0.33 | 0.53 | 0.50 |
| Procedural Action Description | 0.90 | 0.73 | 0.63 | 0.97 | 0.93 |
| Safety Action Identification | 0.62 | 0.47 | 0.90 | 0.92 | 0.93 |
Explicit Reasoning via Thinking Tokens
Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context. The model generates interpretable surgical rationale via thinking tokens, demonstrating an ability to connect visual features to surgical meaning and reason about the intent behind maneuvers, not just their execution. For example, it correctly identifies why a vessel branch was sacrificed due to an enlarged interlobar lymph node (Fig. 1).
| Task | GPT-5.1 | Gemini 3.1 Pro | Qwen3-VL | SureonVLM |
|---|---|---|---|---|
| Action HeiChole F1 | 0.18 | 0.21 | 0.17 | 0.04 |
| CVS Endoscapes F1 | 0.08 | 0.14 | 0.02 | 0.32 |
| Phase Cholec80 F1 | 0.36 | 0.47 | 0.17 | 0.63 |
| Phase HeiChole F1 | 0.29 | 0.35 | 0.12 | 0.41 |
| Phase MultiBypass140 F1 | 0.13 | 0.22 | 0.08 | 0.40 |
| Tool Endoscapes mAP@.5:.95 | 0.00 | 0.61 | 0.00 | 0.22 |
SureonVLM Training Stages
Calculate Your Potential ROI
Estimate the efficiency gains and cost savings your organization could achieve by integrating advanced AI solutions like SUREON.
Customise Your Scenario
Projected Annual Impact
Your AI Implementation Roadmap
Our structured approach ensures a smooth and effective integration of cutting-edge AI, tailored to your enterprise needs.
01 Discovery & Strategy
We begin with a deep dive into your current surgical workflows, existing data, and specific challenges. This phase defines project scope, identifies key reasoning needs, and sets measurable objectives aligned with SUREON's capabilities.
02 Data Integration & Customization
Leveraging SUREON's architecture, we integrate your proprietary surgical video data. Our experts fine-tune the SureonVLM models to your specific procedural context, ensuring optimal performance and accurate reasoning for your unique environment.
03 Pilot Deployment & Validation
A pilot program is initiated within a controlled environment to test the customized SureonVLM. We rigorously validate its reasoning capabilities against real-world surgical scenarios, collecting feedback and making iterative improvements for precision and reliability.
04 Full-Scale Integration & Monitoring
Once validated, SureonVLM is integrated into your operational systems. We provide ongoing monitoring, support, and continuous model improvement, ensuring sustained performance and adaptability to evolving surgical practices and data.
Ready to Enhance Your Surgical Intelligence?
Partner with us to unlock advanced reasoning capabilities in surgical AI and drive unprecedented efficiency and safety.