Databricks AI Research
Advancing Knowledge Agents with Reinforcement Learning
Databricks AI Research introduces KARL, a breakthrough system for training enterprise search agents via reinforcement learning. KARL achieves state-of-the-art grounded reasoning across diverse, hard-to-verify search tasks, setting new benchmarks for efficiency and generalization.
Transforming Enterprise Search & Reasoning
KARL's novel approach addresses critical challenges in grounded reasoning for enterprise applications, offering significant advancements in accuracy, efficiency, and adaptability across various knowledge domains. Our key contributions enable robust, cost-effective knowledge agents that generalize well.
Unrivaled Performance & Generalization
KARL surpasses leading closed models like Claude 4.6 and GPT 5.2 on KARLBench, showcasing Pareto-optimal performance across diverse in-distribution and out-of-distribution tasks, even with complementary test-time compute scaling. This demonstrates superior generalization across heterogeneous search behaviors.
Innovative Training Methodology
Our iterative large-batch off-policy RL (OAPL) is sample efficient, robust to discrepancies, and naturally extends to multi-task training. Combined with an agentic synthesis pipeline that generates diverse, grounded training data using long-horizon reasoning and tool use, KARL continuously self-improves.
Cost-Efficient Grounded Reasoning
KARL delivers frontier-quality search at a fraction of the cost and latency of alternatives, achieving competitive scores at under $0.10 per query. By learning more efficient search strategies through RL, KARL reduces token overhead and solves tasks in fewer steps, providing simultaneous quality gains and cost savings.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
KARLBench Task Capabilities
KARLBench is a multi-capability evaluation suite spanning six distinct search regimes, designed to assess grounded reasoning capabilities across diverse structural challenges. Each task isolates a distinct capability, from constraint-driven entity search to procedural reasoning over technical documentation.
| Name | Capability | Example Question | Example Answer |
|---|---|---|---|
| BrowseComp-Plus | Constraint-driven entity search | Which Nobel physicist was born in the same city as the author of The Trial and later worked at the Institute for Advanced Study? | Albert Einstein |
| TREC-Biogen | Cross-document report synthesis | What evidence supports the effectiveness of mRNA vaccines against emerging SARS-CoV-2 variants? | A report integrating findings from clinical studies, observational analyses, and variant-specific evaluations. |
| FinanceBench | Long-document traversal with tabular numerical reasoning | Based on Company X's 2022 annual report, what was the percent change in operating income from 2021 to 2022? | Operating income increased by 12.4%, computed from $2.10B (2021) to $2.36B (2022). |
| QAMPARI | Exhaustive entity search over encyclopedic text | Which countries have won at least one FIFA World Cup? | Brazil; Germany; Italy; Argentina; France; Uruguay; England; Spain. |
| FreshStack | Procedural reasoning over technical documentation | How can a ModuleNotFoundError be resolved when running a Python script inside a virtual environment? | Activate the correct environment, verify installation with pip list, install the missing package using pip install <package>, and ensure the interpreter path matches the environment. |
| PMBench | Exhaustive fact search over internal company notes | What are the specific concerns raised regarding governance in production environments, and which customers raised them? | XYZ Corp and ABC Financial raised governance concerns around access controls for model updates, audit logging, and environment separation. |
Agentic Data Synthesis Pipeline
Our two-phase agentic pipeline dynamically explores corpora and iteratively bootstraps from increasingly capable models to generate diverse, grounded, and difficult training data.
Cost & Latency Advantage
< $0.10/query Cost per Query (single-call)KARL defines the Pareto frontier for cost-quality and latency-quality trade-offs, demonstrating frontier-quality search at a fraction of the cost and latency of alternatives. With parallel sampling, KARL matches Claude Opus 4.6 quality at roughly 33% lower cost per query.
Parallel Thinking Test-time Compute
A general-purpose TTC strategy that generates multiple independent rollouts in parallel and aggregates them into a final, unified answer, boosting performance across all benchmarks.
Parallel Thinking Performance Boost
+5.9 Max Avg. Score Increase with Parallel ThinkingParallel Thinking significantly boosts KARL's performance, with gains up to +5.9 points on TREC-Biogen, and preserving strong out-of-distribution generalization benefits.
Compression Impact on Performance
RL training significantly improves the model's ability to identify and retain relevant information during context compression, a critical capability for long-horizon search tasks.
| Ablation | Setting | BrowseComp-Plus Score | BrowseComp-Plus Recall |
|---|---|---|---|
| Compression | With | 0.570 | 0.681 |
| Compression | Without | 0.389 | 0.503 |
| Retrieval | Qwen3-Embedding-8B | 0.570 | 0.681 |
| Retrieval | Vector Search (GTE-large + hybrid) | 0.568 | 0.698 |
RL Generalizes Beyond Sharpening
Contrary to merely sharpening existing capabilities, RL training in KARL develops new problem-solving capabilities. Max@K performance improves across all values of K, indicating an expanded problem-solving coverage rather than just concentrating probability mass on existing solutions. This translates directly to enhanced test-time compute gains, demonstrating that RL training truly expands the model's repertoire.
Search Behavior Profiles
KARL's search behavior profile is similar to Claude Sonnet 4.5, characterized by efficient 'Explore then Commit' strategies. This contrasts with GLM 4.5 Air, which shows higher incidences of 'Exhaustive Search, No Convergence' and context truncation, indicating KARL's improved ability to reach commitment within context limits.
| Behavior Category | GLM 4.5 Air | KARL | Sonnet 4.5 |
|---|---|---|---|
| Explore then Commit | 38% | 65% | 58% |
| Explore then Verify | 39% | 23% | 27% |
| Giving Up Early | 5% | 7% | 13% |
| Confidently Wrong Early | 18% | 3% | 2% |
| Running Out of Context | 0% | 0% | 0% |
| Exhaustive Search, No Convergence | 0% | 2% | 0% |
Calculate Your Potential AI ROI
Estimate the significant cost and time savings your enterprise could achieve by integrating KARL's advanced AI agents.
Your KARLBench Implementation Roadmap
A structured approach to integrating KARL into your enterprise, maximizing impact and ensuring a smooth transition to advanced AI capabilities.
Phase 1: Needs Assessment & Data Integration
Understand your specific enterprise search challenges and integrate proprietary data into KARLBench's secure, optimized vector search infrastructure. Leverage our corpus construction methodology to ensure data readiness without extensive preprocessing.
Phase 2: Agent Customization & Synthetic Data Generation
Tailor KARL agents to your unique tasks using our agentic synthesis pipeline. Generate high-quality, grounded training data through iterative bootstrapping, ensuring your agents learn to reason over your specific knowledge domains.
Phase 3: Multi-task RL Training & Optimization
Train KARL with our iterative large-batch off-policy RL (OAPL) framework, enabling out-of-distribution generalization. Optimize for cost-quality and latency-quality trade-offs, leveraging multi-task learning for robust performance across your diverse search behaviors.
Phase 4: Test-Time Compute Scaling & Deployment
Enhance agent performance with parallel thinking and value-guided search. Deploy KARL agents into your production environment, benefiting from a scalable, efficient architecture that drives state-of-the-art grounded reasoning.
Ready to Transform Your Enterprise AI?
Connect with Databricks AI Research experts to explore how KARL can unlock new levels of efficiency and intelligence for your grounded reasoning tasks.