Skip to main content
Enterprise AI Analysis: Explaining the Reasoning of Large Language Models Using Attribution Graphs

AI INTERPRETABILITY

Explaining the Reasoning of Large Language Models Using Attribution Graphs

This paper introduces CAGE, a novel framework that improves LLM interpretability by explaining reasoning chains through attribution graphs, offering more faithful and complete insights than existing methods.

Executive Impact: Enhancing Trust and Performance in LLMs

The CAGE framework significantly boosts the interpretability of Large Language Models, leading to measurable improvements in trust, safety, and operational efficiency across various enterprise applications.

0% Average Faithfulness Gain
0% Maximum Faithfulness Gain
0% Win Rate Across Metrics
0x Enhanced Transparency

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The CAGE Framework
Attribution Graphs
Qualitative Insights
Quantitative Results

Context Attribution via Graph Explanations (CAGE)

The CAGE framework is a novel approach for explaining the reasoning of autoregressive Large Language Models (LLMs). It addresses the limitations of existing context attribution methods which often provide incomplete or misleading explanations by discarding inter-generational influences. CAGE constructs an **attribution graph** that faithfully models LLM reasoning chains, preserving causality and ensuring proper influence propagation from the prompt through prior generations to the generation(s) of interest. This framework consistently enhances the quality of explanations, making LLMs more transparent and trustworthy.

Constructing & Utilizing Attribution Graphs

At the core of CAGE is the **attribution graph**, a directed graph where vertices represent prompt and generated tokens, and edges quantify prediction influence. This graph adheres to two critical properties: **Causality**, ensuring edges point forward in time, and **Row Stochasticity**, where incoming edge weights are non-negative and sum to 1. This construction allows for the marginalization of intermediate contributions along causal paths, providing a complete and faithful context attribution. The graph also visualizes prompt-level explanations and the intricate reasoning pathways within chain-of-thought processes.

Qualitative Improvements in LLM Explanations

Qualitative analysis, demonstrated through examples from datasets like Facts and Math, showcases CAGE's superior ability to capture causal influence compared to traditional row attribution methods. For instance, in tasks requiring fact reuse tracking, CAGE successfully attributes importance to previously generated sentences, preventing redundant information. In complex math reasoning, it attributes all critical prompt sentences, ensuring that vital context for answering questions is not ignored, which is a common failure mode for existing methods. This visual fidelity helps in understanding how LLMs truly reason.

Quantitative Validation of CAGE's Performance

CAGE's effectiveness is rigorously validated through quantitative evaluations using metrics such as Attribution Coverage (AC) and Faithfulness (RISE, MAS). Across various models (Llama 3, Qwen 3) and datasets (Facts, Math, MorehopQA), CAGE consistently demonstrates significant improvements. It achieves an average gain of up to **40%** in faithfulness and a maximum gain of **134%**, along with an **85% win rate** against five leading row attribution methods. These results underscore CAGE's ability to produce more faithful and complete explanations, solidifying its position as a robust framework for LLM interpretability.

Enterprise Process Flow: CAGE Framework

Input Prompt & Prior Generations
LLM Generates Next Token
Apply Base Attribution Method M
Construct Attribution Graph (Causality, Row Stochasticity)
Marginalize Contributions along Causal Paths
Output Context Attribution
0%
Average improvement in attribution faithfulness across diverse models and datasets.

CAGE vs. Traditional Row Attribution

Feature CAGE Framework Traditional Row Attribution
Causal Influence Tracking
  • Traces influence from prompt through prior generations.
  • Models inter-generational effects.
  • Only captures direct prompt influence.
  • Discards inter-generational effects.
Explanation Completeness
  • Provides complete, causality-respecting explanations.
  • Visualizes reasoning pathways.
  • Incomplete and potentially misleading explanations.
  • Assumes equal contribution of intermediate tokens.
Graph Properties
  • Ensures causality and row stochasticity.
  • Stabilizes influence propagation.
  • No explicit graph structure for causal flow.
  • Summation-based aggregation.
Faithfulness & Coverage
  • Significantly improved faithfulness (up to 134%).
  • Better attribution coverage of critical prompt sentences.
  • Lower faithfulness scores.
  • Misses critical context for multi-step reasoning.

Case Study: Chain-of-Thought Reasoning in Math Problems

In complex Math word problems requiring **chain-of-thought reasoning**, LLMs often generate multiple intermediate steps to arrive at the final answer. Traditional row attribution methods struggle here, as they typically only attribute the final answer directly to the prompt, missing the crucial influence of these intermediate steps. **CAGE, however, constructs an attribution graph that explicitly links each generated step to prior steps and the initial prompt, ensuring that the full causal reasoning path is captured.** This leads to significantly more faithful explanations, accurately highlighting which parts of the prompt and which intermediate calculations were critical for the final correct answer, preventing scenarios where vital context is ignored.

Attribution Graph Example from Paper Figure 1

Figure 1: Context attributions explain an autoregressive LLM by identifying how prompt tokens causally influence its output. Current row attribution approaches (middle row) apply a base attribution method M at each generation step, summing only direct prompt influence and discarding inter-generational effects, thus missing causal reasoning. CAGE (bottom row) instead constructs an attribution graph that captures both prompt and inter-generational influence, then marginalizes influence along its paths to produce faithful, causality-respecting context attributions.

0%
Win rate of CAGE across diverse models, datasets, and attribution methods.

Ablation Studies: Validating CAGE's Design Choices

Ablation studies confirm the necessity of CAGE's non-negativity and row-stochasticity constraints. Removing these properties leads to significant degradation in faithfulness and interpretability. For instance, removing row-normalization can cause **value explosions**, where influence on non-target sentences overwhelms relevant attributions. Similarly, allowing negative attributions without proper handling can result in **recurrent sign flips and information cancellation**, making explanations unstable and misleading. These studies underscore that CAGE's design choices are critical for producing stable, interpretable, and faithful influence graphs, ensuring its robust performance across different model scales and tasks.

Attribution Graph Construction from Paper Figure 2

Figure 2: We illustrate the construction of the attribution graph. At each LLM generation step (a), we apply a base attribution M, to measure the influence of the current input on the generation. We perform a nonnegative normalization of the influence values and add them to the adjacency matrix (b) of the attribution graph (c) that captures the causal influence of the generation process.

0x
Causal influence traced through all generations, not just direct prompt influence.

Calculate Your Potential ROI with Explainable AI

Estimate the impact of enhanced LLM interpretability on your operational efficiency and cost savings.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your Roadmap to Enhanced LLM Transparency

A structured approach to integrating CAGE and advanced interpretability into your enterprise AI initiatives.

Phase 01: Initial Assessment & Strategy

Evaluate current LLM usage, identify key interpretability challenges, and define specific goals for transparency and trust. Develop a tailored strategy for CAGE integration.

Phase 02: Pilot Implementation & Validation

Deploy CAGE on a pilot project, integrating attribution graph generation and context attribution. Validate improved explanation quality against internal benchmarks and user feedback.

Phase 03: Scaled Integration & Training

Roll out CAGE across relevant LLM applications, providing comprehensive training to data scientists, developers, and end-users on interpreting attribution graphs and leveraging insights.

Phase 04: Continuous Improvement & Monitoring

Establish monitoring frameworks to track ongoing interpretability performance. Continuously refine CAGE implementation based on evolving LLM capabilities and business needs.

Ready to Unlock the Full Potential of Your LLMs?

Schedule a free consultation to explore how CAGE can transform your AI initiatives with unparalleled transparency and trustworthiness.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking