Enterprise AI Analysis
How Much Do LLMs Hallucinate in Document Q&A Scenarios?
This study reveals significant hallucination rates in LLMs for document Q&A. Even top models fabricate answers at a non-trivial rate (1.19% at 32K context), increasing steeply with context length (exceeding 10% at 200K). Model selection is paramount, with some families showing higher fabrication resistance regardless of size. Temperature tuning is nuanced, as T=0.0, while sometimes best for accuracy, dramatically increases coherence loss (infinite loops) at longer contexts. Grounding and fabrication resistance are distinct capabilities, meaning models good at finding facts may still invent them. Hardware platforms do not significantly affect model behavior. These findings underscore the need for targeted training and robust safeguards in enterprise AI deployments.
Executive Impact: Key Findings
Our extensive 172-billion-token study reveals critical insights for enterprise AI deployment:
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Even under optimal conditions (32K context, optimal temperature, best hardware), the single best model (GLM 4.5) still fabricates answers. Top-tier models typically show 5-7% fabrication, and the median model around 25%.
| Aspect | Traditional Approaches | RIKER Methodology |
|---|---|---|
| Ground Truth | Expensive human annotation, static | Generated from known ground truth, regenerable |
| Contamination Risk | High, vulnerable to models seeing test data | Low, fresh instances prevent contamination |
| Scoring | LLM-as-judge (biased, unreliable) or human | Deterministic, no human annotation/LLM judges |
| Scale | Limited, small datasets | Arbitrary scale, high statistical confidence |
At 200K context, no model achieves a fabrication rate below 10%, even the best performer (Qwen3 Next 80B-A3B) still fabricates answers to 10.25% of questions about non-existent entities.
GLM 4.6 Catastrophic Collapse
GLM 4.6, ranking 6th at 32K (93.26%) and 5th at 128K (85.81%), collapses to last place at 200K (37.65%). Its fabrication rate explodes from 7.04% at 32K to 71.62% at 200K. This demonstrates that advertised context length often significantly exceeds usable capacity.
Key Takeaway: Advertised context length is a poor proxy for usable capacity. Models require testing at actual deployment context lengths.
GLM 4.5 Air and Llama 3.1 70B achieve nearly identical grounding scores (91.47% vs. 90.18%), yet their fabrication rates differ by 46 percentage points (3.37% vs. 49.50%). This highlights that grounding ability and fabrication resistance are distinct capabilities.
Evaluation Paradigm Inversion
Estimate Your AI ROI
Calculate potential annual savings and reclaimed hours by deploying enterprise AI solutions with reduced hallucination.
Your AI Implementation Roadmap
A structured approach to integrating reliable LLMs into your enterprise workflows.
Phase 1: Hallucination Audit
Assess current LLM fabrication rates and coherence loss in your specific Q&A scenarios.
Phase 2: Model Selection & Tuning
Select models from families with proven low fabrication, and fine-tune temperature for optimal balance of accuracy and coherence.
Phase 3: Safeguard Integration
Implement post-processing and human-in-the-loop safeguards to detect and mitigate fabricated answers.
Phase 4: Continuous Monitoring
Regularly re-evaluate model performance and fabrication rates as context lengths and models evolve.
Ready to Deploy Hallucination-Resistant AI?
Schedule a complimentary strategy session with our AI experts to discuss how to integrate reliable LLMs into your enterprise Q&A workflows.