Enterprise AI Analysis
Predictable Confabulations: Factual Recall by LLMS Scales with Model Size and Topic Frequency
Matthew L. Smith*, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji, Tegawendé F. Bissyandé
While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.
Executive Impact
Factual recall in LLMs is not random but follows predictable scaling laws tied to model size and topic frequency, revealing a structural inequality in knowledge encoding.
Optimize LLM deployment and fine-tuning strategies by understanding the predictable relationship between model capacity, training data composition, and factual recall quality, mitigating confabulations for business-critical applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Economics Insights for LLM Factual Recall
In economics-related topics, LLMs demonstrate varying factual recall capabilities depending on the specificity and frequency of the concept in training data. Highly represented topics like "Economics" (767,282 works) show higher recall, while niche areas such as "Microfinance loan repayment" (1,341 works) are more susceptible to confabulation. This implies that for enterprise applications relying on specific financial or economic data, model scaling and targeted fine-tuning on relevant datasets are crucial.
Education Insights for LLM Factual Recall
Educational topics highlight the "long-tail" challenge. While general concepts like "Education" (5,921,253 works) are well-recalled, specialized topics like "School dropout prevention in rural areas" (32 works) fall into the "floor regime" of confabulations. Enterprises building educational AI tools must consider supplemental retrieval systems for highly specific or underrepresented domains to ensure accuracy and avoid generating misleading information.
Energy Insights for LLM Factual Recall
Energy sector information in LLMs also follows the frequency-recall pattern. Broad topics like "Energy" (8,631,334 works) are robustly recalled, but granular ones such as "Mini-grid electrification" (747 works) exhibit lower recall quality. This finding is critical for energy enterprises using LLMs for analysis or content generation, suggesting that bespoke models or advanced RAG are necessary for reliable information on niche energy technologies and policies.
Environmental Science Insights for LLM Factual Recall
Environmental science topics, ranging from "Climate change" (1,222,665 works) to "Climate-smart agriculture for smallholders" (1,406 works), showcase the importance of training data diversity and volume. LLMs perform better on broadly discussed environmental issues but struggle with highly specific, localized, or emerging concepts. Enterprises in sustainability and agriculture must invest in targeted data strategies to ensure their AI systems provide accurate, detailed information.
Health Insights for LLM Factual Recall
In the health domain, recall quality is high for widespread topics like "Health" (8,504,910 works) and "Infectious disease" (469,401 works). However, very specific public health interventions such as "Insecticide-treated bed nets for malaria" (1,331 works) show reduced recall. For healthcare enterprises, this means LLMs can provide general medical information reliably, but for critical, highly specific treatment details or rare conditions, human verification and/or specialized knowledge bases are indispensable.
Political Science Insights for LLM Factual Recall
Political science topics reveal a similar pattern, with "Political science" (285,942 works) demonstrating better recall than "Biometric voter registration" (171 works). This highlights potential biases in LLM knowledge based on global discourse frequency. Enterprises using AI for policy analysis or public sector applications must be aware of these disparities, especially when dealing with nuanced or less-covered political and social topics, requiring robust validation or expert input.
Enterprise Process Flow: Scaling Law Derivation
The sigmoid functional form, combining model parameter count and topic frequency, accounts for nearly 60% of the observed variance in factual recall quality across diverse LLMs and topics, demonstrating a robust and predictable scaling relationship.
| Family | Key Characteristics | Recall Quality Trend |
|---|---|---|
| Llama |
|
Systematic offsets above baseline, suggests superior training/data. |
| Gemma & Mistral |
|
Systematic offsets below baseline, indicates potential for optimization. |
| Qwen3 (Outlier) |
|
Outlier behavior in smaller models, hints at floor regime limitations. |
| MoE Architectures |
|
Total parameters (not active) dominate noise floor for MoE models. |
Case Study: The Floor Regime of Factual Recall
At the lowest end of the sigmoid curve, models exhibit a 'floor regime' where they cease to produce verifiable references and instead generate templated fabrications. For example, Llama 3.2 1B produced only 22 verifiable references out of 215, often repeating the same work (Dahl's Polyarchy) with different publication years. Similarly, Qwen3 8B relied on a small pool of 62 unique first-author surnames, with 'Smith' appearing across all 24 topics, indicating slot-filling rather than true recall. This highlights the critical threshold where model capacity and topic frequency are insufficient for reliable factual generation, leading to predictable confabulations.
Estimate Your ROI
Unlock Productivity: Calculate Your Potential AI Savings
Project the tangible benefits of optimizing LLM factual recall within your enterprise. Understand how improved accuracy and reduced confabulations translate into significant time and cost savings.
Implementation Blueprint
Your Roadmap to Reliable LLM Factual Recall
Based on the predictable scaling of factual recall, here's a strategic roadmap to integrate these insights into your enterprise AI initiatives and minimize confabulations.
Phase 1: Baseline Assessment & Gap Analysis
Evaluate existing LLM factual recall performance against topic frequency and model size to identify knowledge gaps and confabulation hotspots across critical enterprise domains.
Phase 2: Targeted Data Curation & Fine-Tuning
Implement targeted pre-training and fine-tuning strategies focusing on low-frequency, high-value concepts to raise signal strength above the interference floor and improve recall.
Phase 3: Retrieval Augmented Generation (RAG) Integration
For very low-frequency or critical concepts below the recall floor, integrate robust RAG systems to bypass parametric recall and ensure accuracy, complementing LLM capabilities.
Phase 4: Continuous Monitoring & Adaptive Scaling
Establish ongoing monitoring of factual recall quality across diverse topics and model scales, using the sigmoid framework to adaptively optimize model deployments and resource allocation.
Ready to Transform Your AI Strategy?
Don't let unpredictable confabulations hinder your enterprise AI. Our experts are ready to help you implement a data-driven approach to LLM deployment, ensuring factual accuracy and maximizing ROI.