Skip to main content
Enterprise AI Analysis: Vero: An Open RL Recipe for General Visual Reasoning

Vero: An Open RL Recipe for General Visual Reasoning

Revolutionizing Visual AI with Open, General-Purpose Reinforcement Learning

Gabriel Sarch*, Linrong Cai*, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu†
Princeton University

Executive Summary: Unlocking General Visual Reasoning

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats.

66% Vero Overall Avg. Score (SOTA)
5.3 pts Avg. Improvement Over Base Models
600K RL Samples in Vero-600K Dataset
59+ Datasets Covered Across 6 Categories

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Overall Performance
Chart & OCR
STEM
Spatial & Action
Knowledge & Recognition
Grounding, Counting & Search
Captioning & Instruction Following

Vero: State-of-the-Art General Visual Reasoning

Vero achieves state-of-the-art overall performance among 8B VLMs, outperforming baselines across all six task categories. It improves over four base models by 3.6-5.3 points on average across its 30 challenging benchmarks in VeroEval. Notably, Vero-Qwen3T-8B outperforms Qwen3-VL-8B-Thinking on 24 of 30 benchmarks without additional proprietary thinking data, and Vero-Qwen3I-8B on 23 of 30 benchmarks.

Vero-600K, our 600K-sample dataset derived from 59 datasets, and its task-routed rewards are crucial for enabling this broad capability.

Performance in Chart & OCR

Vero-Qwen3I-8B achieves an average score of 69.8%, demonstrating a significant improvement of +8.6 points over the Qwen3-VL-8B-Instruct base model (61.2%). This highlights Vero's advanced capabilities in extracting and reasoning over structured information in documents, charts, tables, and infographics.

Performance in STEM

In STEM tasks, Vero-Qwen3I-8B scores 63.7% on average, an improvement of +6.4 points over the Qwen3-VL-8B-Instruct base model (57.3%). This covers mathematical diagram reasoning, scientific figure interpretation, and medical image understanding, with answers typically numeric or symbolic.

Performance in Spatial & Action

Vero-Qwen3I-8B achieves an average score of 66.3%, showing a gain of +3.7 points compared to the Qwen3-VL-8B-Instruct base model (62.6%). This category targets embodied reasoning, UI navigation, and 3D spatial understanding, requiring reasoning about spatial transformations and action sequences.

Performance in Knowledge & Recognition

For Knowledge & Recognition, Vero-Qwen3I-8B scores 53.3% on average, a +1.0 point improvement over the Qwen3-VL-8B-Instruct base model (52.3%). This spans visual question answering that combines object, scene, and entity recognition with external or commonsense knowledge.

Performance in Grounding, Counting & Search

Vero-Qwen3I-8B shows strong performance with an average score of 63.8%, a +5.3 point increase from the Qwen3-VL-8B-Instruct base model (58.5%). This category requires spatially localizing objects via bounding boxes, counting entity instances, and searching among visual distractors.

Performance in Captioning & Instruction Following

In Captioning & Instruction Following, Vero-Qwen3I-8B achieves the highest average score of 83.8%, an impressive +5.6 point gain over the Qwen3-VL-8B-Instruct base model (78.2%). This encompasses open-ended image description and following prompt instructions, maintaining visual chat ability during RL.

The Open RL Recipe: Vero-600K Dataset

Vero-600K is a fully open, multi-task RL training set of 600K samples curated from 59 diverse datasets spanning six core task categories. This extensive dataset, combined with task-routed reward functions, enables Vero VLMs to achieve state-of-the-art performance across diverse visual reasoning tasks.

Unlike proprietary reinforcement learning pipelines, Vero provides a transparent and accessible recipe, releasing all data, code, and models to facilitate future research. This openness fosters reproducibility, mechanistic understanding, and sustained scientific progress in visual reasoning.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings for your enterprise with advanced AI solutions.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A typical journey to integrating advanced visual AI into your enterprise.

Phase 1: Discovery & Strategy

Initial consultation to understand your specific needs, assess current workflows, and define key AI objectives. We'll identify high-impact use cases.

Phase 2: Data Curation & Model Training

Leveraging open recipes like Vero, we'll curate and filter proprietary data, then train or fine-tune models to your unique visual reasoning requirements across diverse tasks.

Phase 3: Integration & Deployment

Seamless integration of the trained VLM into your existing enterprise systems and workflows, ensuring robust performance and scalability.

Phase 4: Monitoring & Optimization

Continuous monitoring of AI performance, iterative refinement, and strategic adjustments to ensure sustained value and adapt to evolving business needs.

Ready to Transform Your Enterprise with AI?

Schedule a personalized strategy session to discover how Vero's open-source RL recipe can empower your visual reasoning capabilities.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking