Benchmarking LLM Judges for Faithfulness
C2-Faith: Evaluating Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions and controlled coverage deletions.
Executive Impact
LLM judges exhibit varying strengths: DeepSeek-V3.1 leads in binary causal detection, while 04-mini excels at precise error localization. A significant 26-33 percentage point gap exists between detecting an error and localizing it. Coverage judgments are systematically inflated for incomplete reasoning. Our findings provide practical guidance for selecting judges in process-level evaluation.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Causality: Does each reasoning step logically follow from the steps that precede it? A chain that contains a step inconsistent with its context is causally unfaithful, even if the final answer happens to be correct.
Coverage: Are the critical intermediate inferences actually present? A chain that jumps from problem statement to conclusion while omitting the reasoning that bridges them is incomplete, regardless of surface coherence.
Enterprise Process Flow
| Task | GPT-4.1 | DeepSeek-V3.1 | 04-mini |
|---|---|---|---|
| Exp 1 detect | 82.7% | 94.7% | 92.0% |
| Exp 2 exact match | 57.6% | 55.8% | 68.0% |
| Cov. ρ @70% | 0.331 | 0.149 | 0.331 |
| Cov. bias @10% | +0.35 | +0.43 | +0.28 |
The Detection-Localization Gap
A consistent pattern across all models is that detection rates substantially exceed exact-match accuracy. For GPT-4.1, the gap is 31.7 percentage points (89.3% vs. 57.6%). This indicates that while judges often detect 'something is wrong,' pinpointing the exact step of inconsistency is far harder.
This gap has important implications for applications like targeted chain correction, where approximate localization might be sufficient, but precise error identification remains a significant challenge.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings your enterprise could realize by implementing advanced AI solutions for process evaluation.
Your AI Implementation Roadmap
A typical journey to leveraging advanced AI for enhanced operational intelligence and process faithfulness.
Phase 01: Discovery & Strategy
Assess current LLM evaluation gaps, define faithfulness metrics, and align AI strategy with business goals. Identify key reasoning processes to target for improvement.
Phase 02: Pilot & Benchmarking
Deploy C2-Faith benchmark or similar diagnostic tools. Run pilot evaluations with frontier LLM judges on critical reasoning tasks. Measure baseline performance against established faithfulness dimensions.
Phase 03: Customization & Integration
Tailor LLM judge prompts and rubrics based on pilot findings. Integrate feedback mechanisms into your AI development pipeline. Train or fine-tune models to improve causal and coverage faithfulness.
Phase 04: Scalable Deployment & Monitoring
Roll out enhanced LLM evaluation across your organization. Continuously monitor judge performance and reasoning quality. Iterate on models and evaluation methods for sustained improvement.
Ready to Enhance Your AI's Reliability?
Book a personalized consultation to discuss how C2-Faith principles can be applied to your enterprise's AI initiatives.