Vero: An Open RL Recipe for General Visual Reasoning
Revolutionizing Visual AI with Open, General-Purpose Reinforcement Learning
Gabriel Sarch*, Linrong Cai*, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu†
Princeton University
Executive Summary: Unlocking General Visual Reasoning
What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Vero: State-of-the-Art General Visual Reasoning
Vero achieves state-of-the-art overall performance among 8B VLMs, outperforming baselines across all six task categories. It improves over four base models by 3.6-5.3 points on average across its 30 challenging benchmarks in VeroEval. Notably, Vero-Qwen3T-8B outperforms Qwen3-VL-8B-Thinking on 24 of 30 benchmarks without additional proprietary thinking data, and Vero-Qwen3I-8B on 23 of 30 benchmarks.
Vero-600K, our 600K-sample dataset derived from 59 datasets, and its task-routed rewards are crucial for enabling this broad capability.
Performance in Chart & OCR
Vero-Qwen3I-8B achieves an average score of 69.8%, demonstrating a significant improvement of +8.6 points over the Qwen3-VL-8B-Instruct base model (61.2%). This highlights Vero's advanced capabilities in extracting and reasoning over structured information in documents, charts, tables, and infographics.
Performance in STEM
In STEM tasks, Vero-Qwen3I-8B scores 63.7% on average, an improvement of +6.4 points over the Qwen3-VL-8B-Instruct base model (57.3%). This covers mathematical diagram reasoning, scientific figure interpretation, and medical image understanding, with answers typically numeric or symbolic.
Performance in Spatial & Action
Vero-Qwen3I-8B achieves an average score of 66.3%, showing a gain of +3.7 points compared to the Qwen3-VL-8B-Instruct base model (62.6%). This category targets embodied reasoning, UI navigation, and 3D spatial understanding, requiring reasoning about spatial transformations and action sequences.
Performance in Knowledge & Recognition
For Knowledge & Recognition, Vero-Qwen3I-8B scores 53.3% on average, a +1.0 point improvement over the Qwen3-VL-8B-Instruct base model (52.3%). This spans visual question answering that combines object, scene, and entity recognition with external or commonsense knowledge.
Performance in Grounding, Counting & Search
Vero-Qwen3I-8B shows strong performance with an average score of 63.8%, a +5.3 point increase from the Qwen3-VL-8B-Instruct base model (58.5%). This category requires spatially localizing objects via bounding boxes, counting entity instances, and searching among visual distractors.
Performance in Captioning & Instruction Following
In Captioning & Instruction Following, Vero-Qwen3I-8B achieves the highest average score of 83.8%, an impressive +5.6 point gain over the Qwen3-VL-8B-Instruct base model (78.2%). This encompasses open-ended image description and following prompt instructions, maintaining visual chat ability during RL.
The Open RL Recipe: Vero-600K Dataset
Vero-600K is a fully open, multi-task RL training set of 600K samples curated from 59 diverse datasets spanning six core task categories. This extensive dataset, combined with task-routed reward functions, enables Vero VLMs to achieve state-of-the-art performance across diverse visual reasoning tasks.
Unlike proprietary reinforcement learning pipelines, Vero provides a transparent and accessible recipe, releasing all data, code, and models to facilitate future research. This openness fosters reproducibility, mechanistic understanding, and sustained scientific progress in visual reasoning.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings for your enterprise with advanced AI solutions.
Your AI Implementation Roadmap
A typical journey to integrating advanced visual AI into your enterprise.
Phase 1: Discovery & Strategy
Initial consultation to understand your specific needs, assess current workflows, and define key AI objectives. We'll identify high-impact use cases.
Phase 2: Data Curation & Model Training
Leveraging open recipes like Vero, we'll curate and filter proprietary data, then train or fine-tune models to your unique visual reasoning requirements across diverse tasks.
Phase 3: Integration & Deployment
Seamless integration of the trained VLM into your existing enterprise systems and workflows, ensuring robust performance and scalability.
Phase 4: Monitoring & Optimization
Continuous monitoring of AI performance, iterative refinement, and strategic adjustments to ensure sustained value and adapt to evolving business needs.
Ready to Transform Your Enterprise with AI?
Schedule a personalized strategy session to discover how Vero's open-source RL recipe can empower your visual reasoning capabilities.