SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Revolutionizing Surgical AI with Advanced Reasoning Capabilities

The SUREON project introduces a novel approach to surgical AI, moving beyond basic perception tasks to enable complex reasoning. By leveraging expert-narrated surgical videos, SUREON creates a large-scale video QA dataset with 12 distinct question categories covering safety, decision-making, and forecasting. The SureonVLM models, trained with supervised fine-tuning and reinforcement learning, significantly outperform general-domain models, demonstrating explicit reasoning behavior and superior performance on surgical perception and reasoning tasks, especially in safety-critical areas. This work paves the way for interpretable and clinically meaningful surgical AI.

Schedule Your Strategy Session

Executive Impact: Key Metrics at a Glance

SUREON's innovative approach yields significant advancements in surgical AI, as highlighted by these key performance indicators:

0 Question Types Defined

0 Video Clip Conversations

0 Expert-Verified Conversations

0 Accuracy on SUREON Benchmark

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

SUREON Data Synthesis Pipeline

Identify Semantic Grounding Moments (SGMs)

→

Generate Semantically Structured Samples

→

Specialized Generator Agent

→

Specialized Filtering Agent

Large-scale QA Dataset

206.8k QA Pairs Generated

SUREON is a large-scale video QA dataset systematically harvested from surgical academic videos, yielding 206.8k QA pairs across 12 question categories and 170 procedure types, with an expert-validated benchmark of 354 examples. This dataset is crucial for training models on surgical reasoning.

SUREON VLM Outperforms General Models

84% Average Accuracy on SUREON Benchmark

SureonVLM and SureonVLM-R1 achieve an average accuracy of 84-85% on the SUREON benchmark, significantly outperforming GPT-5.1, Gemini 3.1 Pro, and Qwen3-VL, particularly in safety-critical categories like Safety Action Identification and Decision Reasoning.

Comparison of SOTA Models on SUREON Benchmark (Accuracy)

QA Type	GPT-5.1	Gemini 3.1 Pro	Qwen3-VL (8B)	SureonVLM (ours)	SureonVLM-R1 (ours)
Action Description	0.52	0.27	0.63	0.87	0.88
Decision Reasoning	0.70	0.60	0.83	0.98	1.00
Forecasting	0.53	0.60	0.53	0.73	0.62
Instrument Action Interaction	0.69	0.81	0.53	0.88	0.90
Local Action Reasoning	0.83	0.31	0.67	0.93	1.00
Entity Attribute	0.82	0.94	0.93	0.95	0.97
Entity Existence	0.68	0.75	0.57	0.70	0.70
Entity Localization	0.55	0.55	0.33	0.53	0.50
Procedural Action Description	0.90	0.73	0.63	0.97	0.93
Safety Action Identification	0.62	0.47	0.90	0.92	0.93

Explicit Reasoning via Thinking Tokens

Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context. The model generates interpretable surgical rationale via thinking tokens, demonstrating an ability to connect visual features to surgical meaning and reason about the intent behind maneuvers, not just their execution. For example, it correctly identifies why a vessel branch was sacrificed due to an enlarged interlobar lymph node (Fig. 1).

Performance on Standard Surgical Perception Benchmarks (F1/mAP)

Task	GPT-5.1	Gemini 3.1 Pro	Qwen3-VL	SureonVLM
Action HeiChole F1	0.18	0.21	0.17	0.04
CVS Endoscapes F1	0.08	0.14	0.02	0.32
Phase Cholec80 F1	0.36	0.47	0.17	0.63
Phase HeiChole F1	0.29	0.35	0.12	0.41
Phase MultiBypass140 F1	0.13	0.22	0.08	0.40
Tool Endoscapes mAP@.5:.95	0.00	0.61	0.00	0.22

SureonVLM Training Stages

Supervised Fine-tuning (SFT) Stage 1: MLP projection layer update

→

SFT Stage 2: Vision encoder and MLP update

→

SFT Stage 3: MLP and LLM update (vision encoder fixed), open-ended exposure & explicit <think> tokens

→

Reinforcement Learning (GRPO) with Reward Design

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your organization could achieve by integrating advanced AI solutions like SUREON.

Customise Your Scenario

Your Industry

Number of Employees

Avg. Manual Hours / Week / Employee

Avg. Hourly Cost / Employee ($)

Projected Annual Impact

Potential Annual Savings $0

Hours Reclaimed Annually 0

Get a Custom ROI Analysis

Your AI Implementation Roadmap

Our structured approach ensures a smooth and effective integration of cutting-edge AI, tailored to your enterprise needs.

01 Discovery & Strategy

We begin with a deep dive into your current surgical workflows, existing data, and specific challenges. This phase defines project scope, identifies key reasoning needs, and sets measurable objectives aligned with SUREON's capabilities.

02 Data Integration & Customization

Leveraging SUREON's architecture, we integrate your proprietary surgical video data. Our experts fine-tune the SureonVLM models to your specific procedural context, ensuring optimal performance and accurate reasoning for your unique environment.

03 Pilot Deployment & Validation

A pilot program is initiated within a controlled environment to test the customized SureonVLM. We rigorously validate its reasoning capabilities against real-world surgical scenarios, collecting feedback and making iterative improvements for precision and reliability.

04 Full-Scale Integration & Monitoring

Once validated, SureonVLM is integrated into your operational systems. We provide ongoing monitoring, support, and continuous model improvement, ensuring sustained performance and adaptability to evolving surgical practices and data.

Book Your Free Consultation

Ready to Enhance Your Surgical Intelligence?

Partner with us to unlock advanced reasoning capabilities in surgical AI and drive unprecedented efficiency and safety.

Start Your AI Journey Today

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Revolutionizing Surgical AI with Advanced Reasoning Capabilities

Executive Impact: Key Metrics at a Glance

Deep Analysis & Enterprise Applications

SUREON Data Synthesis Pipeline

Large-scale QA Dataset

SUREON VLM Outperforms General Models

Comparison of SOTA Models on SUREON Benchmark (Accuracy)

Explicit Reasoning via Thinking Tokens

Performance on Standard Surgical Perception Benchmarks (F1/mAP)

SureonVLM Training Stages

Calculate Your Potential ROI

Customise Your Scenario

Projected Annual Impact

Your AI Implementation Roadmap

01 Discovery & Strategy

02 Data Integration & Customization

03 Pilot Deployment & Validation

04 Full-Scale Integration & Monitoring

Ready to Enhance Your Surgical Intelligence?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai