Skip to main content
Enterprise AI Analysis: KARL: Knowledge Agents via Reinforcement Learning

Databricks AI Research

Advancing Knowledge Agents with Reinforcement Learning

Databricks AI Research introduces KARL, a breakthrough system for training enterprise search agents via reinforcement learning. KARL achieves state-of-the-art grounded reasoning across diverse, hard-to-verify search tasks, setting new benchmarks for efficiency and generalization.

Pareto-Optimal Cost-Quality & Latency-Quality Trade-offs
33% Lower Cost per Query vs. Opus 4.6
10x Parallel Rollouts Match Best Models
State-of-the-Art Performance on Diverse Search Tasks

Transforming Enterprise Search & Reasoning

KARL's novel approach addresses critical challenges in grounded reasoning for enterprise applications, offering significant advancements in accuracy, efficiency, and adaptability across various knowledge domains. Our key contributions enable robust, cost-effective knowledge agents that generalize well.

Unrivaled Performance & Generalization

KARL surpasses leading closed models like Claude 4.6 and GPT 5.2 on KARLBench, showcasing Pareto-optimal performance across diverse in-distribution and out-of-distribution tasks, even with complementary test-time compute scaling. This demonstrates superior generalization across heterogeneous search behaviors.

Innovative Training Methodology

Our iterative large-batch off-policy RL (OAPL) is sample efficient, robust to discrepancies, and naturally extends to multi-task training. Combined with an agentic synthesis pipeline that generates diverse, grounded training data using long-horizon reasoning and tool use, KARL continuously self-improves.

Cost-Efficient Grounded Reasoning

KARL delivers frontier-quality search at a fraction of the cost and latency of alternatives, achieving competitive scores at under $0.10 per query. By learning more efficient search strategies through RL, KARL reduces token overhead and solves tasks in fewer steps, providing simultaneous quality gains and cost savings.

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

KARLBench Task Capabilities

KARLBench is a multi-capability evaluation suite spanning six distinct search regimes, designed to assess grounded reasoning capabilities across diverse structural challenges. Each task isolates a distinct capability, from constraint-driven entity search to procedural reasoning over technical documentation.

Name Capability Example Question Example Answer
BrowseComp-Plus Constraint-driven entity search Which Nobel physicist was born in the same city as the author of The Trial and later worked at the Institute for Advanced Study? Albert Einstein
TREC-Biogen Cross-document report synthesis What evidence supports the effectiveness of mRNA vaccines against emerging SARS-CoV-2 variants? A report integrating findings from clinical studies, observational analyses, and variant-specific evaluations.
FinanceBench Long-document traversal with tabular numerical reasoning Based on Company X's 2022 annual report, what was the percent change in operating income from 2021 to 2022? Operating income increased by 12.4%, computed from $2.10B (2021) to $2.36B (2022).
QAMPARI Exhaustive entity search over encyclopedic text Which countries have won at least one FIFA World Cup? Brazil; Germany; Italy; Argentina; France; Uruguay; England; Spain.
FreshStack Procedural reasoning over technical documentation How can a ModuleNotFoundError be resolved when running a Python script inside a virtual environment? Activate the correct environment, verify installation with pip list, install the missing package using pip install <package>, and ensure the interpreter path matches the environment.
PMBench Exhaustive fact search over internal company notes What are the specific concerns raised regarding governance in production environments, and which customers raised them? XYZ Corp and ABC Financial raised governance concerns around access controls for model updates, audit logging, and environment separation.

Agentic Data Synthesis Pipeline

Our two-phase agentic pipeline dynamically explores corpora and iteratively bootstraps from increasingly capable models to generate diverse, grounded, and difficult training data.

Question-Answer Synthesis
Deduplication Agent
Solver Agent Attempts
Pass Rate Filtering
Quality Filter Agent
Off-Policy RL Training Data

Cost & Latency Advantage

< $0.10/query Cost per Query (single-call)

KARL defines the Pareto frontier for cost-quality and latency-quality trade-offs, demonstrating frontier-quality search at a fraction of the cost and latency of alternatives. With parallel sampling, KARL matches Claude Opus 4.6 quality at roughly 33% lower cost per query.

Parallel Thinking Test-time Compute

A general-purpose TTC strategy that generates multiple independent rollouts in parallel and aggregates them into a final, unified answer, boosting performance across all benchmarks.

Prompt (x)
Solver Agent (π)
Rollout Y1...YN
Aggregator Agent (π)
Final Answer

Parallel Thinking Performance Boost

+5.9 Max Avg. Score Increase with Parallel Thinking

Parallel Thinking significantly boosts KARL's performance, with gains up to +5.9 points on TREC-Biogen, and preserving strong out-of-distribution generalization benefits.

Compression Impact on Performance

RL training significantly improves the model's ability to identify and retain relevant information during context compression, a critical capability for long-horizon search tasks.

Ablation Setting BrowseComp-Plus Score BrowseComp-Plus Recall
Compression With 0.570 0.681
Compression Without 0.389 0.503
Retrieval Qwen3-Embedding-8B 0.570 0.681
Retrieval Vector Search (GTE-large + hybrid) 0.568 0.698

RL Generalizes Beyond Sharpening

Contrary to merely sharpening existing capabilities, RL training in KARL develops new problem-solving capabilities. Max@K performance improves across all values of K, indicating an expanded problem-solving coverage rather than just concentrating probability mass on existing solutions. This translates directly to enhanced test-time compute gains, demonstrating that RL training truly expands the model's repertoire.

Search Behavior Profiles

KARL's search behavior profile is similar to Claude Sonnet 4.5, characterized by efficient 'Explore then Commit' strategies. This contrasts with GLM 4.5 Air, which shows higher incidences of 'Exhaustive Search, No Convergence' and context truncation, indicating KARL's improved ability to reach commitment within context limits.

Behavior Category GLM 4.5 Air KARL Sonnet 4.5
Explore then Commit 38% 65% 58%
Explore then Verify 39% 23% 27%
Giving Up Early 5% 7% 13%
Confidently Wrong Early 18% 3% 2%
Running Out of Context 0% 0% 0%
Exhaustive Search, No Convergence 0% 2% 0%

Calculate Your Potential AI ROI

Estimate the significant cost and time savings your enterprise could achieve by integrating KARL's advanced AI agents.

Annual Savings Potential $0
Hours Reclaimed Annually 0

Your KARLBench Implementation Roadmap

A structured approach to integrating KARL into your enterprise, maximizing impact and ensuring a smooth transition to advanced AI capabilities.

Phase 1: Needs Assessment & Data Integration

Understand your specific enterprise search challenges and integrate proprietary data into KARLBench's secure, optimized vector search infrastructure. Leverage our corpus construction methodology to ensure data readiness without extensive preprocessing.

Phase 2: Agent Customization & Synthetic Data Generation

Tailor KARL agents to your unique tasks using our agentic synthesis pipeline. Generate high-quality, grounded training data through iterative bootstrapping, ensuring your agents learn to reason over your specific knowledge domains.

Phase 3: Multi-task RL Training & Optimization

Train KARL with our iterative large-batch off-policy RL (OAPL) framework, enabling out-of-distribution generalization. Optimize for cost-quality and latency-quality trade-offs, leveraging multi-task learning for robust performance across your diverse search behaviors.

Phase 4: Test-Time Compute Scaling & Deployment

Enhance agent performance with parallel thinking and value-guided search. Deploy KARL agents into your production environment, benefiting from a scalable, efficient architecture that drives state-of-the-art grounded reasoning.

Ready to Transform Your Enterprise AI?

Connect with Databricks AI Research experts to explore how KARL can unlock new levels of efficiency and intelligence for your grounded reasoning tasks.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking