Enterprise AI Analysis

How Much Do LLMs Hallucinate in Document Q&A Scenarios?

This study reveals significant hallucination rates in LLMs for document Q&A. Even top models fabricate answers at a non-trivial rate (1.19% at 32K context), increasing steeply with context length (exceeding 10% at 200K). Model selection is paramount, with some families showing higher fabrication resistance regardless of size. Temperature tuning is nuanced, as T=0.0, while sometimes best for accuracy, dramatically increases coherence loss (infinite loops) at longer contexts. Grounding and fabrication resistance are distinct capabilities, meaning models good at finding facts may still invent them. Hardware platforms do not significantly affect model behavior. These findings underscore the need for targeted training and robust safeguards in enterprise AI deployments.

Schedule Your Strategy Session

Executive Impact: Key Findings

Our extensive 172-billion-token study reveals critical insights for enterprise AI deployment:

0B Tokens Evaluated

0% Best-Case Fabrication

0pp Context Length Degradation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

1.19% Best-Case Fabrication Rate

Even under optimal conditions (32K context, optimal temperature, best hardware), the single best model (GLM 4.5) still fabricates answers. Top-tier models typically show 5-7% fabrication, and the median model around 25%.

Aspect	Traditional Approaches	RIKER Methodology
Ground Truth	Expensive human annotation, static	Generated from known ground truth, regenerable
Contamination Risk	High, vulnerable to models seeing test data	Low, fresh instances prevent contamination
Scoring	LLM-as-judge (biased, unreliable) or human	Deterministic, no human annotation/LLM judges
Scale	Limited, small datasets	Arbitrary scale, high statistical confidence

10%+ Fabrication at 200K Context

At 200K context, no model achieves a fabrication rate below 10%, even the best performer (Qwen3 Next 80B-A3B) still fabricates answers to 10.25% of questions about non-existent entities.

GLM 4.6 Catastrophic Collapse

GLM 4.6, ranking 6th at 32K (93.26%) and 5th at 128K (85.81%), collapses to last place at 200K (37.65%). Its fabrication rate explodes from 7.04% at 32K to 71.62% at 200K. This demonstrates that advertised context length often significantly exceeds usable capacity.

Key Takeaway: Advertised context length is a poor proxy for usable capacity. Models require testing at actual deployment context lengths.

46 pp Difference in Fabrication Rate

GLM 4.5 Air and Llama 3.1 70B achieve nearly identical grounding scores (91.47% vs. 90.18%), yet their fabrication rates differ by 46 percentage points (3.37% vs. 49.50%). This highlights that grounding ability and fabrication resistance are distinct capabilities.

Evaluation Paradigm Inversion

Define Ground Truth

→

Generate Documents & Questions

→

Deterministic Scoring

→

Deploy & Safeguard

Estimate Your AI ROI

Calculate potential annual savings and reclaimed hours by deploying enterprise AI solutions with reduced hallucination.

Industry

Number of Employees (Using LLMs for Q&A)

Average Hours/Week on Q&A Tasks

Average Hourly Rate ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A structured approach to integrating reliable LLMs into your enterprise workflows.

Phase 1: Hallucination Audit

Assess current LLM fabrication rates and coherence loss in your specific Q&A scenarios.

Phase 2: Model Selection & Tuning

Select models from families with proven low fabrication, and fine-tune temperature for optimal balance of accuracy and coherence.

Phase 3: Safeguard Integration

Implement post-processing and human-in-the-loop safeguards to detect and mitigate fabricated answers.

Phase 4: Continuous Monitoring

Regularly re-evaluate model performance and fabrication rates as context lengths and models evolve.

Ready to Deploy Hallucination-Resistant AI?

Schedule a complimentary strategy session with our AI experts to discuss how to integrate reliable LLMs into your enterprise Q&A workflows.

Schedule Your Strategy Session

Enterprise AI Analysis

How Much Do LLMs Hallucinate in Document Q&A Scenarios?

Executive Impact: Key Findings

Deep Analysis & Enterprise Applications

GLM 4.6 Catastrophic Collapse

Evaluation Paradigm Inversion

Estimate Your AI ROI

Your AI Implementation Roadmap

Phase 1: Hallucination Audit

Phase 2: Model Selection & Tuning

Phase 3: Safeguard Integration

Phase 4: Continuous Monitoring

Ready to Deploy Hallucination-Resistant AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai