Enterprise AI Analysis

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

By Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu

Abstract: Failure attribution in LLM multi-agent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI 01 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available in the public repository.

Schedule Your AI Strategy Session

Executive Impact & Key Findings

This research pioneers automated failure attribution for LLM multi-agent systems, addressing a critical gap in debugging and refinement. By introducing the Who&When dataset and evaluating novel attribution methods, it reveals the complexities and limitations of current AI in diagnosing multi-agent system failures, paving the way for targeted improvements and more robust AI development.

0 Highest Agent-Level Accuracy (Best Method)

0 Highest Step-Level Accuracy (Best Method)

0 LLM Multi-Agent Systems Analyzed

0 Failure Annotation Tasks

Discuss Your AI Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Problem Formulation

Dataset & Annotation

Attribution Methods

Model Performance

Problem Formulation

Details the definition of decisive errors and the formal problem of automated failure attribution in LLM multi-agent cooperation.

Dataset & Annotation

Introduces the Who&When dataset, its construction, meticulous annotation process, and an analysis of the challenges involved in manual failure attribution.

Attribution Methods

Explores three LLM-based failure attribution methods (All-at-once, Step-by-step, Binary Search) and their performance under various conditions, including context length and tolerance.

Model Performance

Examines the performance of different LLMs, including SOTA reasoning models, on the failure attribution task, highlighting the task's complexity and the impact of reasoning prompts.

Automated Failure Attribution: A New Research Area

The paper formalizes automated failure attribution for LLM multi-agent systems, highlighting its potential to replace labor-intensive manual debugging. This new research problem aims to automatically identify the components responsible for task failures without human intervention, which is crucial for scalable AI system development.

Enterprise Process Flow

LLM Multi-Agent System Fails

→

Generate Failure Logs

→

Who&When Dataset Annotation (Manual)

→

Automated Failure Attribution (Proposed)

→

Identify Responsible Agent & Step

→

System Refinement & Debugging

0 Highest Agent-Level Accuracy Achieved

Who&When Dataset: A Foundational Resource

The Who&When dataset, comprising 127 LLM multi-agent systems with 184 failure logs and fine-grained annotations, is introduced as a critical resource. It captures the 'who' (responsible agent) and 'when' (decisive error step) of failures, providing a rich benchmark for automated attribution research. The meticulous, multi-round human annotation process underscores the complexity of this task, with significant labor costs and uncertainty rates.

Comparison of Failure Attribution Methods (GPT-4o)

Method	Agent-Level Accuracy	Step-Level Accuracy	Key Advantage
All-at-Once	54.33% (best)	12.50%	Broader context for agent identification
Step-by-Step	35.20%	25.51% (best)	Focused decision-making for step identification
Binary Search	44.13%	23.98%	Balances context and focus, intermediate performance

0 Highest Step-Level Accuracy Achieved

Impact of Context Length on Performance

Analysis shows that failure attribution performance declines as context length increases, with step-level accuracy being more sensitive. This highlights the 'space-in-the-needle' problem where LLMs struggle with retrieving specific information from long contexts. The 'step-by-step' method mitigates this by incrementally processing context.

Key Takeaway: Longer contexts severely degrade fine-grained error detection; incremental processing offers a partial solution.

Challenges with SOTA Reasoning Models

Even advanced reasoning models like OpenAI 01 and DeepSeek R1 fail to achieve practical usability in automated failure attribution, often underperforming standard GPT-4o in specific metrics. This suggests that simply swapping models isn't enough; integrating explicit reasoning mechanisms into prompts is crucial for performance improvement, rather than relying solely on model capabilities.

0 Accuracy Drop Without Explicit Reasoning Prompts (All-at-Once)

Hybrid Methods for Enhanced Attribution

Combining different attribution methods, such as using 'all-at-once' for agent identification and then 'step-by-step' for precise step detection, can leverage their respective strengths. A hybrid approach demonstrated improved step-level accuracy (12.28% from 7.90%) but incurred higher computational costs due to sequential execution. This trade-off suggests avenues for optimization in practical implementations.

Calculate Your Potential AI ROI

Estimate the significant time and cost savings your enterprise could achieve by automating failure attribution and optimizing multi-agent systems.

Your Industry

Number of Employees (impacted by AI processes)

Average Weekly Hours Spent on Manual Debugging/Troubleshooting per Employee

Average Hourly Cost per Employee (including overhead)

Annual Cost Savings $0

Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A strategic, phased approach to integrating automated failure attribution into your enterprise AI systems for maximum impact.

Phase 1: Diagnostic Integration

Integrate automated failure attribution modules directly into your existing LLM multi-agent system's monitoring pipeline. This involves configuring the 'Who&When' dataset for tailored evaluations and setting up initial 'all-at-once' and 'step-by-step' judgment protocols.

Phase 2: Performance Baseline & Refinement

Establish baseline accuracy for agent-level and step-level attribution using initial configurations. Identify common failure patterns and iteratively refine LLM prompts and attribution logic based on 'Who&When' insights. Experiment with hybrid attribution methods to balance accuracy and computational cost.

Phase 3: Scalable Deployment & Continuous Learning

Deploy the refined automated attribution system to monitor live multi-agent operations. Implement feedback loops to continuously improve attribution models with new failure data. Explore advanced techniques like reasoning mechanisms and context handling strategies to enhance long-term performance and reduce manual debugging overhead.

Begin Your AI Transformation

Ready to Automate Your AI Debugging?

Unlock the full potential of your LLM multi-agent systems with advanced failure attribution. Schedule a free, no-obligation consultation with our AI experts today.

Book Your Consultation Now

Enterprise AI Analysis

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Executive Impact & Key Findings

Deep Analysis & Enterprise Applications

Problem Formulation

Dataset & Annotation

Attribution Methods

Model Performance

Automated Failure Attribution: A New Research Area

Enterprise Process Flow

Who&When Dataset: A Foundational Resource

Comparison of Failure Attribution Methods (GPT-4o)

Impact of Context Length on Performance

Challenges with SOTA Reasoning Models

Hybrid Methods for Enhanced Attribution

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Diagnostic Integration

Phase 2: Performance Baseline & Refinement

Phase 3: Scalable Deployment & Continuous Learning

Ready to Automate Your AI Debugging?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai