TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Unlocking Robust Multi-Turn Agent AI

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student–teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.

Schedule Your Strategy Session

Key Impact Metrics

Discover the quantifiable improvements TCOD brings to multi-turn autonomous agents.

Max Success Rate Increase (SR)

Avg Action Rounds Reduced

Teacher Outperformance on Hard Tasks

Training Time Efficiency

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Vanilla On-Policy Distillation (OPD) faces a critical challenge in multi-turn agent settings: Trajectory-Level KL Instability. This manifests as KL divergence escalating with increased turns, leading to a collapse in success rates and unstable training. The core issue is compounding errors, pushing the student model out of the teacher's effective support, making supervision signals unreliable.

TCOD (Temporal Curriculum On-Policy Distillation) is proposed as a simple yet effective solution. It controls the trajectory depth exposed to the student during training, progressively expanding it from short to long horizons. This curriculum schedule helps mitigate the instability observed in vanilla OPD by providing a more stable learning signal.

Forward-to-Backward (TCOD-F2B) implements a 'shallow-to-deep' curriculum. The student policy rolls out for a maximum of 'k' steps, starting small and progressively increasing 'k'. This approach focuses on early-turn learning signals first, building a robust foundation before tackling full trajectories, thereby mitigating compounding errors and preventing KL collapse.

Backward-to-Forward (TCOD-B2F) leverages the teacher as a 'navigator'. The teacher executes the initial L-k steps of a successful trajectory, placing the student in a near-terminal state. The student then takes over for the remaining 'k' steps, progressively learning to complete tasks from earlier starting points as 'k' increases. This bypasses early-turn error accumulation by starting the student on teacher-vetted prefixes.

+15.71% Success Rate Boost Over Vanilla OPD

TCOD Operational Flow

Vanilla OPD: All Turns, Compounding Errors

→

TCOD-F2B: Forward Learning Start

→

TCOD-B2F: Backward Learning Start

→

Progressive Horizon Expansion

Vanilla OPD vs. TCOD

Feature	Vanilla OPD	TCOD
Multi-turn Stability	❌ High KL Instability ❌ Error Compounding	✅ Enhanced KL Stability ✅ Error Mitigation
Performance	❌ Success Rate Collapse (Small Models)	✅ Up to 18-point SR Gain ✅ Teacher Outperformance
Training Efficiency	❌ Longer Action Rounds	✅ 32% Faster Training ✅ Reduced Action Rounds

Generalization Beyond Teacher Capabilities

TCOD-B2F significantly surpasses the teacher's performance on challenging 'Hard' split tasks of ALFWorld. The teacher model itself failed under pass@10 sampling on these tasks (SR 6.61%), while TCOD-B2F achieved an impressive 20.66% success rate, a 14-point gain. This demonstrates TCOD's ability to develop a more robust policy that generalizes beyond the teacher's own capability boundary, rather than merely imitating it.

Explore Advanced Distillation

-32% Reduction in Total Training Time

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could achieve with AI-powered multi-turn agents.

Your Industry

Number of Employees Working on Multi-Turn Tasks

Average Weekly Hours Per Employee on These Tasks

Average Hourly Wage ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Discuss Your Custom ROI

Your AI Implementation Roadmap

A structured approach to integrating TCOD for maximum enterprise impact.

Phase 1: Discovery & Strategy

We collaborate to understand your specific multi-turn agent needs, existing infrastructure, and define clear objectives and success metrics for TCOD implementation.

Phase 2: Pilot & Customization

A tailored TCOD solution is developed and deployed in a pilot environment, focusing on curriculum design (F2B/B2F), teacher model selection, and iterative performance tuning.

Phase 3: Integration & Scaling

Seamless integration with your enterprise systems, comprehensive testing, and phased rollout across your target applications to achieve broad impact and scale efficiently.

Phase 4: Monitoring & Optimization

Continuous monitoring of agent performance, KL stability, and task success rates. Ongoing optimization ensures long-term effectiveness and adaptation to evolving requirements.

Start Your Custom Roadmap

Ready to Transform Your Agents?

Leverage Temporal Curriculum On-Policy Distillation to build robust, efficient, and highly performant multi-turn autonomous agents. Book a consultation with our experts today.

Book a Free Consultation

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Unlocking Robust Multi-Turn Agent AI

Key Impact Metrics

Deep Analysis & Enterprise Applications

TCOD Operational Flow

Vanilla OPD vs. TCOD

Generalization Beyond Teacher Capabilities

Calculate Your Potential ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Customization

Phase 3: Integration & Scaling

Phase 4: Monitoring & Optimization

Ready to Transform Your Agents?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai