Skip to main content
Enterprise AI Analysis: SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Revolutionizing Surgical AI with Advanced Reasoning Capabilities

The SUREON project introduces a novel approach to surgical AI, moving beyond basic perception tasks to enable complex reasoning. By leveraging expert-narrated surgical videos, SUREON creates a large-scale video QA dataset with 12 distinct question categories covering safety, decision-making, and forecasting. The SureonVLM models, trained with supervised fine-tuning and reinforcement learning, significantly outperform general-domain models, demonstrating explicit reasoning behavior and superior performance on surgical perception and reasoning tasks, especially in safety-critical areas. This work paves the way for interpretable and clinically meaningful surgical AI.

Executive Impact: Key Metrics at a Glance

SUREON's innovative approach yields significant advancements in surgical AI, as highlighted by these key performance indicators:

0 Question Types Defined
0 Video Clip Conversations
0 Expert-Verified Conversations
0 Accuracy on SUREON Benchmark

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

SUREON Data Synthesis Pipeline

Identify Semantic Grounding Moments (SGMs)
Generate Semantically Structured Samples
Specialized Generator Agent
Specialized Filtering Agent

Large-scale QA Dataset

206.8k QA Pairs Generated

SUREON is a large-scale video QA dataset systematically harvested from surgical academic videos, yielding 206.8k QA pairs across 12 question categories and 170 procedure types, with an expert-validated benchmark of 354 examples. This dataset is crucial for training models on surgical reasoning.

SUREON VLM Outperforms General Models

84% Average Accuracy on SUREON Benchmark

SureonVLM and SureonVLM-R1 achieve an average accuracy of 84-85% on the SUREON benchmark, significantly outperforming GPT-5.1, Gemini 3.1 Pro, and Qwen3-VL, particularly in safety-critical categories like Safety Action Identification and Decision Reasoning.

Comparison of SOTA Models on SUREON Benchmark (Accuracy)

QA TypeGPT-5.1Gemini 3.1 ProQwen3-VL (8B)SureonVLM (ours)SureonVLM-R1 (ours)
Action Description 0.52 0.27 0.63 0.87 0.88
Decision Reasoning 0.70 0.60 0.83 0.98 1.00
Forecasting 0.53 0.60 0.53 0.73 0.62
Instrument Action Interaction 0.69 0.81 0.53 0.88 0.90
Local Action Reasoning 0.83 0.31 0.67 0.93 1.00
Entity Attribute 0.82 0.94 0.93 0.95 0.97
Entity Existence 0.68 0.75 0.57 0.70 0.70
Entity Localization 0.55 0.55 0.33 0.53 0.50
Procedural Action Description 0.90 0.73 0.63 0.97 0.93
Safety Action Identification 0.62 0.47 0.90 0.92 0.93

Explicit Reasoning via Thinking Tokens

Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context. The model generates interpretable surgical rationale via thinking tokens, demonstrating an ability to connect visual features to surgical meaning and reason about the intent behind maneuvers, not just their execution. For example, it correctly identifies why a vessel branch was sacrificed due to an enlarged interlobar lymph node (Fig. 1).

Performance on Standard Surgical Perception Benchmarks (F1/mAP)

TaskGPT-5.1Gemini 3.1 ProQwen3-VLSureonVLM
Action HeiChole F1 0.18 0.21 0.17 0.04
CVS Endoscapes F1 0.08 0.14 0.02 0.32
Phase Cholec80 F1 0.36 0.47 0.17 0.63
Phase HeiChole F1 0.29 0.35 0.12 0.41
Phase MultiBypass140 F1 0.13 0.22 0.08 0.40
Tool Endoscapes mAP@.5:.95 0.00 0.61 0.00 0.22

SureonVLM Training Stages

Supervised Fine-tuning (SFT) Stage 1: MLP projection layer update
SFT Stage 2: Vision encoder and MLP update
SFT Stage 3: MLP and LLM update (vision encoder fixed), open-ended exposure & explicit <think> tokens
Reinforcement Learning (GRPO) with Reward Design

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your organization could achieve by integrating advanced AI solutions like SUREON.

Customise Your Scenario

Projected Annual Impact

Potential Annual Savings $0
Hours Reclaimed Annually 0

Your AI Implementation Roadmap

Our structured approach ensures a smooth and effective integration of cutting-edge AI, tailored to your enterprise needs.

01 Discovery & Strategy

We begin with a deep dive into your current surgical workflows, existing data, and specific challenges. This phase defines project scope, identifies key reasoning needs, and sets measurable objectives aligned with SUREON's capabilities.

02 Data Integration & Customization

Leveraging SUREON's architecture, we integrate your proprietary surgical video data. Our experts fine-tune the SureonVLM models to your specific procedural context, ensuring optimal performance and accurate reasoning for your unique environment.

03 Pilot Deployment & Validation

A pilot program is initiated within a controlled environment to test the customized SureonVLM. We rigorously validate its reasoning capabilities against real-world surgical scenarios, collecting feedback and making iterative improvements for precision and reliability.

04 Full-Scale Integration & Monitoring

Once validated, SureonVLM is integrated into your operational systems. We provide ongoing monitoring, support, and continuous model improvement, ensuring sustained performance and adaptability to evolving surgical practices and data.

Ready to Enhance Your Surgical Intelligence?

Partner with us to unlock advanced reasoning capabilities in surgical AI and drive unprecedented efficiency and safety.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking