Enterprise AI Research Analysis

ADAPTIVE COLLABORATION WITH HUMANS: METACOGNITIVE POLICY OPTIMIZATION FOR MULTI-AGENT LLMS WITH CONTINUAL LEARNING

Authors: Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

Affiliation: University of Southern California

Published: 9 Mar 2026

Abstract: While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain "closed-world" systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human-agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems. The code is available at https://github.com/USC-Melady/HILA.git.

Schedule Your Strategy Session

Executive Impact for Your Enterprise

This research introduces Human-In-the-Loop Multi-Agent Collaboration (HILA), a novel framework designed to overcome the limitations of purely autonomous multi-agent systems (MAS). HILA integrates human expertise to enable continuous learning and adaptive problem-solving.

A core component of HILA is its metacognitive policy, which allows agents to strategically decide when to solve problems autonomously and when to defer to human experts. This policy balances the benefits of collective intelligence with the need for external, high-quality guidance.

The framework utilizes Dual-Loop Policy Optimization (DLPO), a training methodology that combines reinforcement learning (for immediate deferral decisions) and continual learning (for long-term capability growth from expert feedback). This ensures agents not only make better decisions but also continually improve their underlying reasoning abilities.

Experimental results on mathematical and problem-solving benchmarks demonstrate that HILA with DLPO significantly outperforms advanced autonomous MAS, confirming its potential for building robust and continually evolving agentic systems.

0 Performance Improvement (avg)

0 Metacognitive Deferral (reduction)

0 Knowledge Boundary Expansion

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Framework Overview

Performance Gains

Strategic Intervention

The HILA Framework

Human-In-the-Loop Multi-Agent Collaboration (HILA) is introduced as a principled paradigm for adaptive human-agent collaboration. It equips agents with a metacognitive policy to decide when to strategically defer to human expertise. This allows MAS to move beyond 'closed-world' limitations to an 'open-world' dynamic capable of continuous learning and growth.

Enterprise Process Flow

Autonomous Operation (Initial Solve)

→

Metacognitive Assessment (Confidence Check)

→

Strategic Deferral (Human Expert)

→

Expert Feedback (Supervision)

→

Continual Learning (Capability Growth)

Dual-Loop Policy Optimization (DLPO)

DLPO is a novel training framework that separates short-term intervention decisions from long-term capability growth. The inner loop (Group Relative Policy Optimization - GRPO) optimizes deferral decisions with cost-aware rewards, while the outer loop implements continual learning by transforming expert feedback into high-quality supervised signals for reasoning ability.

Consistent Outperformance

Experiments on challenging mathematical (GSM8K, AIME, AMC) and general problem-solving (HumanEval, MMLU) benchmarks show that HILA, equipped with DLPO, consistently outperforms advanced autonomous multi-agent systems. Absolute improvements range from 3.7 to 15.4 points over the strongest baselines.

35.83% HILA's Accuracy on AMC

Feature	Autonomous MAS	HILA Framework
Knowledge Source	Pre-trained corpora (closed-world)	Pre-trained corpora + external human expertise (open-world)
Adaptability	Limited to recombining existing info	Generates new knowledge, adapts to unseen contexts via CL
Failure Mode	Collective failure under novel challenges	Strategic deferral to human expert, leading to learning
Policy Optimization	Optimizes internal collaboration protocols	Optimizes metacognitive deferral and capability growth
Learning Mechanism	SFT/RL for internal coordination	Dual-Loop RL for deferral + CL for knowledge acquisition

Metacognitive Policy in Action

The metacognitive policy enables agents to reason about their self-competence and peer competence. This guides collaboration by determining when to act autonomously (EVAL, CREATE) and when to invoke external expertise (DEFER), balancing risk of failure against intervention costs.

Impact of Human Proxy Capability

The strength of the external expert significantly impacts HILA's effectiveness. Stronger language models used as proxies consistently lead to better performance, highlighting that strategic intervention is most valuable when the guidance received is of high quality.

Case Study: Reducing Costly Deferrals

Problem: Initially, the unoptimized policy assigned a non-trivial fraction of decisions to DEFER, indicating substantial reliance on external intervention.

Solution: After applying GRPO, the share of DEFER decreases consistently across datasets, with EVAL and CREATE becoming more frequent. This shows agents learn a cost-aware intervention strategy, becoming more selective about invoking external expertise.

Outcome: With full DLPO training, DEFER rates drop substantially further, accompanied by a marked increase in EVAL. This indicates agents become more capable of resolving tasks internally, suggesting DLPO improves underlying reasoning ability and not just deferral decisions.

Calculate Your Potential ROI

Estimate the transformative impact of Human-In-the-Loop AI on your operations with our interactive ROI calculator.

Your Industry

Number of Employees (AI-assisted roles)

Average Weekly Hours on Repetitive Tasks

Average Hourly Fully-Burdened Rate ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Unlock Your Enterprise Potential

Your Adaptive AI Implementation Roadmap

Our phased approach ensures a smooth, effective, and continually optimizing integration of HILA into your enterprise workflows.

Phase 1: Discovery & Strategy Alignment

Understand your unique challenges, identify key use cases, and define clear objectives for HILA implementation. This includes data assessment and initial policy design.

Phase 2: Pilot Deployment & Metacognitive Training

Deploy HILA in a controlled environment, train agents using DLPO with proxy experts, and fine-tune metacognitive policies for optimal deferral and autonomous action.

Phase 3: Human Integration & Continual Learning

Integrate real human experts for targeted interventions, establish feedback loops for data collection, and activate the outer-loop continual learning for sustained capability growth.

Phase 4: Scaling & Advanced Optimization

Expand HILA to broader enterprise functions, implement dynamic collaboration mechanisms, and continuously monitor performance for further optimization and evolution.

Start Your AI Journey

Ready to Empower Your Enterprise with Adaptive AI?

Connect with our experts to explore how HILA can transform your multi-agent systems, drive continuous improvement, and unlock new levels of intelligence for your business.

Book a Free Consultation

Enterprise AI Research Analysis

ADAPTIVE COLLABORATION WITH HUMANS: METACOGNITIVE POLICY OPTIMIZATION FOR MULTI-AGENT LLMS WITH CONTINUAL LEARNING

Executive Impact for Your Enterprise

Deep Analysis & Enterprise Applications

The HILA Framework

Enterprise Process Flow

Dual-Loop Policy Optimization (DLPO)

Consistent Outperformance

Metacognitive Policy in Action

Impact of Human Proxy Capability

Case Study: Reducing Costly Deferrals

Calculate Your Potential ROI

Your Adaptive AI Implementation Roadmap

Phase 1: Discovery & Strategy Alignment

Phase 2: Pilot Deployment & Metacognitive Training

Phase 3: Human Integration & Continual Learning

Phase 4: Scaling & Advanced Optimization

Ready to Empower Your Enterprise with Adaptive AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai