AI INTERPRETABILITY
Explaining the Reasoning of Large Language Models Using Attribution Graphs
This paper introduces CAGE, a novel framework that improves LLM interpretability by explaining reasoning chains through attribution graphs, offering more faithful and complete insights than existing methods.
Executive Impact: Enhancing Trust and Performance in LLMs
The CAGE framework significantly boosts the interpretability of Large Language Models, leading to measurable improvements in trust, safety, and operational efficiency across various enterprise applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Context Attribution via Graph Explanations (CAGE)
The CAGE framework is a novel approach for explaining the reasoning of autoregressive Large Language Models (LLMs). It addresses the limitations of existing context attribution methods which often provide incomplete or misleading explanations by discarding inter-generational influences. CAGE constructs an **attribution graph** that faithfully models LLM reasoning chains, preserving causality and ensuring proper influence propagation from the prompt through prior generations to the generation(s) of interest. This framework consistently enhances the quality of explanations, making LLMs more transparent and trustworthy.
Constructing & Utilizing Attribution Graphs
At the core of CAGE is the **attribution graph**, a directed graph where vertices represent prompt and generated tokens, and edges quantify prediction influence. This graph adheres to two critical properties: **Causality**, ensuring edges point forward in time, and **Row Stochasticity**, where incoming edge weights are non-negative and sum to 1. This construction allows for the marginalization of intermediate contributions along causal paths, providing a complete and faithful context attribution. The graph also visualizes prompt-level explanations and the intricate reasoning pathways within chain-of-thought processes.
Qualitative Improvements in LLM Explanations
Qualitative analysis, demonstrated through examples from datasets like Facts and Math, showcases CAGE's superior ability to capture causal influence compared to traditional row attribution methods. For instance, in tasks requiring fact reuse tracking, CAGE successfully attributes importance to previously generated sentences, preventing redundant information. In complex math reasoning, it attributes all critical prompt sentences, ensuring that vital context for answering questions is not ignored, which is a common failure mode for existing methods. This visual fidelity helps in understanding how LLMs truly reason.
Quantitative Validation of CAGE's Performance
CAGE's effectiveness is rigorously validated through quantitative evaluations using metrics such as Attribution Coverage (AC) and Faithfulness (RISE, MAS). Across various models (Llama 3, Qwen 3) and datasets (Facts, Math, MorehopQA), CAGE consistently demonstrates significant improvements. It achieves an average gain of up to **40%** in faithfulness and a maximum gain of **134%**, along with an **85% win rate** against five leading row attribution methods. These results underscore CAGE's ability to produce more faithful and complete explanations, solidifying its position as a robust framework for LLM interpretability.
Enterprise Process Flow: CAGE Framework
CAGE vs. Traditional Row Attribution
| Feature | CAGE Framework | Traditional Row Attribution |
|---|---|---|
| Causal Influence Tracking |
|
|
| Explanation Completeness |
|
|
| Graph Properties |
|
|
| Faithfulness & Coverage |
|
|
Case Study: Chain-of-Thought Reasoning in Math Problems
In complex Math word problems requiring **chain-of-thought reasoning**, LLMs often generate multiple intermediate steps to arrive at the final answer. Traditional row attribution methods struggle here, as they typically only attribute the final answer directly to the prompt, missing the crucial influence of these intermediate steps. **CAGE, however, constructs an attribution graph that explicitly links each generated step to prior steps and the initial prompt, ensuring that the full causal reasoning path is captured.** This leads to significantly more faithful explanations, accurately highlighting which parts of the prompt and which intermediate calculations were critical for the final correct answer, preventing scenarios where vital context is ignored.
Figure 1: Context attributions explain an autoregressive LLM by identifying how prompt tokens causally influence its output. Current row attribution approaches (middle row) apply a base attribution method M at each generation step, summing only direct prompt influence and discarding inter-generational effects, thus missing causal reasoning. CAGE (bottom row) instead constructs an attribution graph that captures both prompt and inter-generational influence, then marginalizes influence along its paths to produce faithful, causality-respecting context attributions.
Ablation Studies: Validating CAGE's Design Choices
Ablation studies confirm the necessity of CAGE's non-negativity and row-stochasticity constraints. Removing these properties leads to significant degradation in faithfulness and interpretability. For instance, removing row-normalization can cause **value explosions**, where influence on non-target sentences overwhelms relevant attributions. Similarly, allowing negative attributions without proper handling can result in **recurrent sign flips and information cancellation**, making explanations unstable and misleading. These studies underscore that CAGE's design choices are critical for producing stable, interpretable, and faithful influence graphs, ensuring its robust performance across different model scales and tasks.
Figure 2: We illustrate the construction of the attribution graph. At each LLM generation step (a), we apply a base attribution M, to measure the influence of the current input on the generation. We perform a nonnegative normalization of the influence values and add them to the adjacency matrix (b) of the attribution graph (c) that captures the causal influence of the generation process.
Calculate Your Potential ROI with Explainable AI
Estimate the impact of enhanced LLM interpretability on your operational efficiency and cost savings.
Your Roadmap to Enhanced LLM Transparency
A structured approach to integrating CAGE and advanced interpretability into your enterprise AI initiatives.
Phase 01: Initial Assessment & Strategy
Evaluate current LLM usage, identify key interpretability challenges, and define specific goals for transparency and trust. Develop a tailored strategy for CAGE integration.
Phase 02: Pilot Implementation & Validation
Deploy CAGE on a pilot project, integrating attribution graph generation and context attribution. Validate improved explanation quality against internal benchmarks and user feedback.
Phase 03: Scaled Integration & Training
Roll out CAGE across relevant LLM applications, providing comprehensive training to data scientists, developers, and end-users on interpreting attribution graphs and leveraging insights.
Phase 04: Continuous Improvement & Monitoring
Establish monitoring frameworks to track ongoing interpretability performance. Continuously refine CAGE implementation based on evolving LLM capabilities and business needs.
Ready to Unlock the Full Potential of Your LLMs?
Schedule a free consultation to explore how CAGE can transform your AI initiatives with unparalleled transparency and trustworthiness.