AI KNOWLEDGE DISENTANGLEMENT
NanoKnow: Unveiling How Your Language Model Learns
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" - unknown or inaccessible. The recent release of nanochat – a family of small LLMs with fully open pre-training data – addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQUAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output.
Executive Impact: Key Findings on LLM Knowledge
NanoKnow provides critical insights into LLM behavior, revealing how pre-training data frequency, external evidence, and context influence accuracy and knowledge retention.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
NanoKnow Generation Process Flow
NanoKnow partitions NQ and SQuAD questions into 'supported' (answer exists in pre-training data) and 'unsupported' splits, enabling controlled evaluation of LLMs.
| Aspect | Closed-Book QA | Open-Book QA (w/ FineWeb Context) |
|---|---|---|
| Model Size Impact |
|
|
| Frequency Dependence |
|
|
| Knowledge Type |
|
|
These findings demonstrate the complex interplay between parametric and external knowledge, and how model scale and pre-training data characteristics influence LLM performance.
NanoKnow: A Foundation for LLM Transparency
NanoKnow addresses a critical gap in LLM research by providing a transparent view into a model's parametric knowledge. By mapping QA benchmarks onto an entirely open pre-training corpus like FineWeb-Edu, researchers can confidently disentangle the sources of knowledge LLMs rely on.
Our experiments confirm a range of results across the literature, highlighting NanoKnow's reliability. This benchmark not only deepens our understanding of how pre-training data shapes what LLMs know but also provides a framework for future explorations into data curation, topical composition, and the nuanced interactions between different knowledge sources.
NanoKnow establishes a new standard for controlled and reproducible studies on LLM knowledge, paving the way for more informed AI development.
Calculate Your Potential AI Impact
Estimate the tangible benefits of integrating advanced AI capabilities into your enterprise operations. See how NanoKnow's insights can translate into real-world efficiency gains and cost savings.
Your Path to Transparent AI Knowledge
Deploying advanced LLM analysis requires a structured approach. Our roadmap outlines the key phases to integrate NanoKnow's insights and establish robust AI knowledge governance within your enterprise.
Phase 1: Corpus Indexing & Data Prep
Build searchable index over your proprietary or chosen open-source LLM pre-training corpus to establish data traceability.
Phase 2: Question-Answer Projection
Map relevant enterprise QA pairs (or public benchmarks) onto the indexed corpus to identify 'supported' and 'unsupported' knowledge.
Phase 3: Controlled Experimentation
Conduct targeted evaluations of your LLMs across various closed-book and open-book QA scenarios, leveraging NanoKnow's splits.
Phase 4: Knowledge Source Disentanglement
Analyze results to understand the interplay of parametric and external knowledge, quantify frequency dependence, and identify impactful factors like distractors.
Ready to Disentangle Your LLM's Knowledge?
Don't let your LLM's knowledge remain a black box. Schedule a consultation with our AI experts to explore how NanoKnow can bring transparency and strategic advantage to your enterprise AI initiatives.