AI KNOWLEDGE DISENTANGLEMENT

NanoKnow: Unveiling How Your Language Model Learns

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" - unknown or inaccessible. The recent release of nanochat – a family of small LLMs with fully open pre-training data – addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQUAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output.

Understand LLM Knowledge

Executive Impact: Key Findings on LLM Knowledge

NanoKnow provides critical insights into LLM behavior, revealing how pre-training data frequency, external evidence, and context influence accuracy and knowledge retention.

0% SQUAD Questions Verified

0% Natural Questions Verified

0x Accuracy Increase (High Freq. Answers)

0pts SQuAD Accuracy Gain (d20 to d34)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology

Key Findings

Research Impact

NanoKnow Generation Process Flow

BM25 Retrieval

→

Answer String Matching

→

LLM-Based Verification

NanoKnow partitions NQ and SQuAD questions into 'supported' (answer exists in pre-training data) and 'unsupported' splits, enabling controlled evaluation of LLMs.

More than 2X Closed-book QA accuracy increases for questions with high answer frequency vs. rare frequency.

Aspect	Closed-Book QA	Open-Book QA (w/ FineWeb Context)
Model Size Impact	Larger models (d34) show significant accuracy increases over smaller models (d20). Suggests increased memorization with scale.	Relative improvement decreases as model size increases. Smaller models benefit more from external evidence.
Frequency Dependence	Strong correlation: accuracy more than doubles for high-frequency answers. LLMs struggle significantly with rare knowledge.	Mitigates frequency dependence but does not eliminate it. Still more effective on questions with higher answer frequency.
Knowledge Type	Relies solely on parametric knowledge acquired during pre-training. Demonstrates what LLMs know inherently.	Parametric and external knowledge are complementary. LLMs perform better on questions they have seen before, even with external context.

These findings demonstrate the complex interplay between parametric and external knowledge, and how model scale and pre-training data characteristics influence LLM performance.

NanoKnow: A Foundation for LLM Transparency

NanoKnow addresses a critical gap in LLM research by providing a transparent view into a model's parametric knowledge. By mapping QA benchmarks onto an entirely open pre-training corpus like FineWeb-Edu, researchers can confidently disentangle the sources of knowledge LLMs rely on.

Our experiments confirm a range of results across the literature, highlighting NanoKnow's reliability. This benchmark not only deepens our understanding of how pre-training data shapes what LLMs know but also provides a framework for future explorations into data curation, topical composition, and the nuanced interactions between different knowledge sources.

NanoKnow establishes a new standard for controlled and reproducible studies on LLM knowledge, paving the way for more informed AI development.

Calculate Your Potential AI Impact

Estimate the tangible benefits of integrating advanced AI capabilities into your enterprise operations. See how NanoKnow's insights can translate into real-world efficiency gains and cost savings.

Your Industry

Number of Employees Impacted

Avg. Hours/Week on Repetitive Tasks

Avg. Hourly Wage ($)

Estimated Annual Savings $0

Hours Reclaimed Annually 0

Discuss Your ROI

Your Path to Transparent AI Knowledge

Deploying advanced LLM analysis requires a structured approach. Our roadmap outlines the key phases to integrate NanoKnow's insights and establish robust AI knowledge governance within your enterprise.

Phase 1: Corpus Indexing & Data Prep

Build searchable index over your proprietary or chosen open-source LLM pre-training corpus to establish data traceability.

Phase 2: Question-Answer Projection

Map relevant enterprise QA pairs (or public benchmarks) onto the indexed corpus to identify 'supported' and 'unsupported' knowledge.

Phase 3: Controlled Experimentation

Conduct targeted evaluations of your LLMs across various closed-book and open-book QA scenarios, leveraging NanoKnow's splits.

Phase 4: Knowledge Source Disentanglement

Analyze results to understand the interplay of parametric and external knowledge, quantify frequency dependence, and identify impactful factors like distractors.

Ready to Disentangle Your LLM's Knowledge?

Don't let your LLM's knowledge remain a black box. Schedule a consultation with our AI experts to explore how NanoKnow can bring transparency and strategic advantage to your enterprise AI initiatives.

Schedule Your Strategy Session

AI KNOWLEDGE DISENTANGLEMENT

NanoKnow: Unveiling How Your Language Model Learns

Executive Impact: Key Findings on LLM Knowledge

Deep Analysis & Enterprise Applications

NanoKnow Generation Process Flow

NanoKnow: A Foundation for LLM Transparency

Calculate Your Potential AI Impact

Your Path to Transparent AI Knowledge

Phase 1: Corpus Indexing & Data Prep

Phase 2: Question-Answer Projection

Phase 3: Controlled Experimentation

Phase 4: Knowledge Source Disentanglement

Ready to Disentangle Your LLM's Knowledge?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai