Research & Analysis

Unlock AI's Full Potential for Your Enterprise

This paper introduces On-Policy Expert Corrections (OECs), a novel data generation technique inspired by DAgger, to mitigate covariate shift in multi-turn LM agent training, specifically for software engineering (SWE) tasks. Traditional imitation learning often suffers when a student model deviates from expert trajectories, encountering unseen states. OECs address this by rolling out student trajectories and then switching to an expert mid-way, allowing for on-policy data collection combined with expert guidance. Experiments on SWE-bench verified tasks with Qwen2.5-Coder models show OEC trajectories yield a relative 14% and 13% improvement over traditional imitation learning in 7B and 32B settings, respectively. The research highlights the critical need for combining expert demonstrations with on-policy data and robust filtering (like repetition filtering) for effective, stable multi-turn LM agent training, even beyond unit-test based rejection sampling.

Schedule Your Strategy Session

Key Metrics & Impact

Direct insights from the research demonstrating the tangible benefits of On-Policy Expert Corrections in multi-turn LM agent performance.

14% Relative Improvement (7B)

13% Relative Improvement (32B)

40.0% Verified Resolution Rate (OEC 32B)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Problem Statement

Proposed Solution

Key Findings

Addressing Covariate Shift in Multi-Turn LLMs

Traditional imitation learning for multi-turn LM agents suffers from covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning.

This is particularly problematic in complex tasks like Software Engineering (SWE) where errors can compound over many turns, leading to vastly different environment states and agent outputs compared to expert demonstrations.

On-Policy Expert Corrections (OECs)

Inspired by the DAgger algorithm, On-Policy Expert Corrections (OECs) is a novel data generation methodology. It involves starting rollouts with a student model and then switching to an expert model part way through the trajectory. This technique generates partially on-policy data, ensuring that the training data reflects states encountered by the student while still leveraging expert guidance for completion.

OECs combine strengths of on-policy approaches (like RL) with imitation learning from expert data, and allows for integration of verifier rewards (e.g., rejection sampling) because trajectories can be rolled out to completion.

Performance Gains & Data Quality

Experiments on SWE-bench verified tasks show that OEC trajectories lead to a relative 14% and 13% improvement over traditional imitation learning in 7B and 32B settings respectively.

The study also emphasizes the importance of additional data filtering beyond unit-test rejection sampling, specifically repetition filtering, to prevent performance degradation from low-quality on-policy trajectories (e.g., models getting stuck in loops).

On-policy masking (only training on expert portions of OEC trajectories) was found to be crucial for stable training, especially in the 32B setting.

14% Relative Improvement (7B Models) for SWE-bench verified tasks with OECs.

OEC Trajectory Generation Process

Start Rollout with Student Model

→

Randomly Sample Switch Index

→

Switch to Expert Model

→

Complete Trajectory with Expert

→

Filter Successful & Non-Repetitive

→

Fine-tune on Expert Portions

Comparison of Trajectory Generation Techniques

Feature	Behavioral Cloning (BC)	On-Policy Expert Corrections (OECs)	On-Policy Trajectories (Student Only)
Covariate Shift Mitigation	Limited	High (partially on-policy)	Moderate (relies on student skill)
Leverages Expert Data	Fully (from start)	Partially (after switch)	No direct expert guidance
Verifier Reward Integration	Difficult (short rollouts)	Yes (full trajectory to completion)	Yes
Performance on SWE-bench	Baseline (Good)	Superior (13-14% improvement)	Degraded/Unstable

Real-World Impact: SWE-Agent Debugging

A student agent struggling with a YamlLint bug (failing to localize the issue after 29 turns) was able to successfully complete the task, localize the bug, write a patch, and resolve the problem once the expert model took over the trajectory via OECs. This demonstrates OECs' ability to guide students through difficult problem spaces by providing targeted expert interventions when the student veers off track.

Student agent fails initial bug localization.
OEC intervention: Expert model takes over mid-trajectory.
Expert successfully localizes bug and writes patch.
Problem resolved, showcasing guided recovery via OECs.

Estimate Your AI Agent ROI

Use our calculator to see the potential time and cost savings from implementing multi-turn LM agents in your enterprise workflows, informed by the efficiency gains demonstrated in this research.

Your Industry

Number of Employees Performing Repetitive Tasks

Hours / Week Spent on Repetitive Tasks (per employee)

Average Hourly Fully-Burdened Cost

Potential Annual Savings $0

Hours Reclaimed Annually 0

Your Implementation Roadmap

A phased approach to integrating advanced LLM agents using OECs into your enterprise, maximizing efficiency and performance.

Phase 1: Pilot & Data Collection

Identify a critical, multi-turn task suitable for LLM agents. Begin collecting expert demonstration data and initial student rollouts to establish a baseline for OEC generation. Implement basic rejection sampling filters.

Phase 2: OEC Generation & Iterative Fine-Tuning

Systematically generate OEC trajectories by combining student and expert interactions. Iteratively fine-tune student models on this augmented dataset, applying on-policy masking and advanced filters like repetition detection.

Phase 3: Deployment & Continuous Improvement

Deploy the fine-tuned agents in a controlled environment. Monitor performance, collect new on-policy data from deployed agents, and continuously refine OEC strategies for ongoing model improvement and adaptation to new scenarios.

Ready to Transform Your Enterprise with AI Agents?

Our experts are ready to discuss how On-Policy Expert Corrections can enhance your LLM agent initiatives. Book a free consultation today!

Book Your Consultation

Research & Analysis

Unlock AI's Full Potential for Your Enterprise

Key Metrics & Impact

Deep Analysis & Enterprise Applications

Addressing Covariate Shift in Multi-Turn LLMs

On-Policy Expert Corrections (OECs)

Performance Gains & Data Quality

OEC Trajectory Generation Process

Comparison of Trajectory Generation Techniques

Real-World Impact: SWE-Agent Debugging

Estimate Your AI Agent ROI

Your Implementation Roadmap

Phase 1: Pilot & Data Collection

Phase 2: OEC Generation & Iterative Fine-Tuning

Phase 3: Deployment & Continuous Improvement

Ready to Transform Your Enterprise with AI Agents?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai