Research & Analysis
Unlock AI's Full Potential for Your Enterprise
This paper introduces On-Policy Expert Corrections (OECs), a novel data generation technique inspired by DAgger, to mitigate covariate shift in multi-turn LM agent training, specifically for software engineering (SWE) tasks. Traditional imitation learning often suffers when a student model deviates from expert trajectories, encountering unseen states. OECs address this by rolling out student trajectories and then switching to an expert mid-way, allowing for on-policy data collection combined with expert guidance. Experiments on SWE-bench verified tasks with Qwen2.5-Coder models show OEC trajectories yield a relative 14% and 13% improvement over traditional imitation learning in 7B and 32B settings, respectively. The research highlights the critical need for combining expert demonstrations with on-policy data and robust filtering (like repetition filtering) for effective, stable multi-turn LM agent training, even beyond unit-test based rejection sampling.
Key Metrics & Impact
Direct insights from the research demonstrating the tangible benefits of On-Policy Expert Corrections in multi-turn LM agent performance.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Addressing Covariate Shift in Multi-Turn LLMs
Traditional imitation learning for multi-turn LM agents suffers from covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning.
This is particularly problematic in complex tasks like Software Engineering (SWE) where errors can compound over many turns, leading to vastly different environment states and agent outputs compared to expert demonstrations.
On-Policy Expert Corrections (OECs)
Inspired by the DAgger algorithm, On-Policy Expert Corrections (OECs) is a novel data generation methodology. It involves starting rollouts with a student model and then switching to an expert model part way through the trajectory. This technique generates partially on-policy data, ensuring that the training data reflects states encountered by the student while still leveraging expert guidance for completion.
OECs combine strengths of on-policy approaches (like RL) with imitation learning from expert data, and allows for integration of verifier rewards (e.g., rejection sampling) because trajectories can be rolled out to completion.
Performance Gains & Data Quality
Experiments on SWE-bench verified tasks show that OEC trajectories lead to a relative 14% and 13% improvement over traditional imitation learning in 7B and 32B settings respectively.
The study also emphasizes the importance of additional data filtering beyond unit-test rejection sampling, specifically repetition filtering, to prevent performance degradation from low-quality on-policy trajectories (e.g., models getting stuck in loops).
On-policy masking (only training on expert portions of OEC trajectories) was found to be crucial for stable training, especially in the 32B setting.
OEC Trajectory Generation Process
| Feature | Behavioral Cloning (BC) | On-Policy Expert Corrections (OECs) | On-Policy Trajectories (Student Only) |
|---|---|---|---|
| Covariate Shift Mitigation | Limited | High (partially on-policy) | Moderate (relies on student skill) |
| Leverages Expert Data | Fully (from start) | Partially (after switch) | No direct expert guidance |
| Verifier Reward Integration | Difficult (short rollouts) | Yes (full trajectory to completion) | Yes |
| Performance on SWE-bench | Baseline (Good) | Superior (13-14% improvement) | Degraded/Unstable |
Real-World Impact: SWE-Agent Debugging
A student agent struggling with a YamlLint bug (failing to localize the issue after 29 turns) was able to successfully complete the task, localize the bug, write a patch, and resolve the problem once the expert model took over the trajectory via OECs. This demonstrates OECs' ability to guide students through difficult problem spaces by providing targeted expert interventions when the student veers off track.
- Student agent fails initial bug localization.
- OEC intervention: Expert model takes over mid-trajectory.
- Expert successfully localizes bug and writes patch.
- Problem resolved, showcasing guided recovery via OECs.
Estimate Your AI Agent ROI
Use our calculator to see the potential time and cost savings from implementing multi-turn LM agents in your enterprise workflows, informed by the efficiency gains demonstrated in this research.
Your Implementation Roadmap
A phased approach to integrating advanced LLM agents using OECs into your enterprise, maximizing efficiency and performance.
Phase 1: Pilot & Data Collection
Identify a critical, multi-turn task suitable for LLM agents. Begin collecting expert demonstration data and initial student rollouts to establish a baseline for OEC generation. Implement basic rejection sampling filters.
Phase 2: OEC Generation & Iterative Fine-Tuning
Systematically generate OEC trajectories by combining student and expert interactions. Iteratively fine-tune student models on this augmented dataset, applying on-policy masking and advanced filters like repetition detection.
Phase 3: Deployment & Continuous Improvement
Deploy the fine-tuned agents in a controlled environment. Monitor performance, collect new on-policy data from deployed agents, and continuously refine OEC strategies for ongoing model improvement and adaptation to new scenarios.
Ready to Transform Your Enterprise with AI Agents?
Our experts are ready to discuss how On-Policy Expert Corrections can enhance your LLM agent initiatives. Book a free consultation today!