Skip to main content
Enterprise AI Analysis: NanoKnow: How to Know What Your Language Model Knows

AI KNOWLEDGE DISENTANGLEMENT

NanoKnow: Unveiling How Your Language Model Learns

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" - unknown or inaccessible. The recent release of nanochat – a family of small LLMs with fully open pre-training data – addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQUAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output.

Executive Impact: Key Findings on LLM Knowledge

NanoKnow provides critical insights into LLM behavior, revealing how pre-training data frequency, external evidence, and context influence accuracy and knowledge retention.

0% SQUAD Questions Verified
0% Natural Questions Verified
0x Accuracy Increase (High Freq. Answers)
0pts SQuAD Accuracy Gain (d20 to d34)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology
Key Findings
Research Impact

NanoKnow Generation Process Flow

BM25 Retrieval
Answer String Matching
LLM-Based Verification

NanoKnow partitions NQ and SQuAD questions into 'supported' (answer exists in pre-training data) and 'unsupported' splits, enabling controlled evaluation of LLMs.

More than 2X Closed-book QA accuracy increases for questions with high answer frequency vs. rare frequency.
Aspect Closed-Book QA Open-Book QA (w/ FineWeb Context)
Model Size Impact
  • Larger models (d34) show significant accuracy increases over smaller models (d20).
  • Suggests increased memorization with scale.
  • Relative improvement decreases as model size increases.
  • Smaller models benefit more from external evidence.
Frequency Dependence
  • Strong correlation: accuracy more than doubles for high-frequency answers.
  • LLMs struggle significantly with rare knowledge.
  • Mitigates frequency dependence but does not eliminate it.
  • Still more effective on questions with higher answer frequency.
Knowledge Type
  • Relies solely on parametric knowledge acquired during pre-training.
  • Demonstrates what LLMs *know* inherently.
  • Parametric and external knowledge are complementary.
  • LLMs perform better on questions they have seen before, even with external context.

These findings demonstrate the complex interplay between parametric and external knowledge, and how model scale and pre-training data characteristics influence LLM performance.

NanoKnow: A Foundation for LLM Transparency

NanoKnow addresses a critical gap in LLM research by providing a transparent view into a model's parametric knowledge. By mapping QA benchmarks onto an entirely open pre-training corpus like FineWeb-Edu, researchers can confidently disentangle the sources of knowledge LLMs rely on.

Our experiments confirm a range of results across the literature, highlighting NanoKnow's reliability. This benchmark not only deepens our understanding of how pre-training data shapes what LLMs know but also provides a framework for future explorations into data curation, topical composition, and the nuanced interactions between different knowledge sources.

NanoKnow establishes a new standard for controlled and reproducible studies on LLM knowledge, paving the way for more informed AI development.

Calculate Your Potential AI Impact

Estimate the tangible benefits of integrating advanced AI capabilities into your enterprise operations. See how NanoKnow's insights can translate into real-world efficiency gains and cost savings.

Estimated Annual Savings $0
Hours Reclaimed Annually 0

Your Path to Transparent AI Knowledge

Deploying advanced LLM analysis requires a structured approach. Our roadmap outlines the key phases to integrate NanoKnow's insights and establish robust AI knowledge governance within your enterprise.

Phase 1: Corpus Indexing & Data Prep

Build searchable index over your proprietary or chosen open-source LLM pre-training corpus to establish data traceability.

Phase 2: Question-Answer Projection

Map relevant enterprise QA pairs (or public benchmarks) onto the indexed corpus to identify 'supported' and 'unsupported' knowledge.

Phase 3: Controlled Experimentation

Conduct targeted evaluations of your LLMs across various closed-book and open-book QA scenarios, leveraging NanoKnow's splits.

Phase 4: Knowledge Source Disentanglement

Analyze results to understand the interplay of parametric and external knowledge, quantify frequency dependence, and identify impactful factors like distractors.

Ready to Disentangle Your LLM's Knowledge?

Don't let your LLM's knowledge remain a black box. Schedule a consultation with our AI experts to explore how NanoKnow can bring transparency and strategic advantage to your enterprise AI initiatives.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking