Enterprise AI Analysis

TRAPPED IN THE PAST? Disentangling Fluid and Crystallized Intelligence of Large Language Models Using Chess

This research investigates the fundamental question of whether Large Language Models (LLMs) rely on sophisticated recall (crystallized intelligence) or genuine reasoning ability (fluid intelligence). By leveraging the structured domain of chess, the study systematically evaluates multiple GPT generations across a spectrum of training corpus proximity, from common, memorizable states to novel positions requiring first-principles reasoning.

The findings reveal a consistent degradation in performance as fluid intelligence demands increase, with performance collapsing to random levels in out-of-distribution tasks. While newer models show improvement, progress significantly slows for tasks outside the training distribution. Reasoning-augmented inference offers benefits, but its marginal utility diminishes with decreased distributional proximity. These results underscore current architectures' limitations in systematic generalization, highlighting the critical need for advancements beyond mere scale to achieve robust fluid intelligence.

Authored by Leonard S. Pleiss, Maximilian Schiffer, and Robert K. von Weizsäcker.

Schedule Your Strategy Session

Executive Impact Summary

0 AI Performance Uplift (GPT-3.5 to GPT-5)

0 Higher ACPL in OOD vs. WD Positions

0 ACPL Reduction with Chain-of-Thought Reasoning

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Problem

Methodology

Key Results

Discussion & Implications

The Intelligence Debate: Crystallized vs. Fluid

Large Language Models (LLMs) have demonstrated impressive capabilities, yet it remains debated whether these stem from enhanced memorization (crystallized intelligence) or true problem-solving and reasoning (fluid intelligence). This distinction is vital for understanding LLMs' ability to generalize to novel problems.

Current benchmarks often conflate retrieval and reasoning, making it challenging to precisely quantify distributional proximity. This study addresses this gap by using chess as a controlled, verifiable environment to systematically analyze LLM performance across tasks requiring varying degrees of these two forms of intelligence.

Structured Testbed: Chess as a Controlled Environment

We employed chess as a controlled testbed to disentangle fluid and crystallized intelligence. Its clear rules, deep combinatorial structure, and computationally verifiable metrics via chess engines provide a robust framework. We categorized chess positions into Within-Distribution (WD), Near-Distribution (ND), and Out-of-Distribution (OOD) based on their likelihood of appearing in training data.

Enterprise Process Flow

Formalize Intelligence & Generalization

→

Instantiate Framework in Chess

→

Generate WD, ND, OOD Datasets

→

Define Performance Metrics

→

Detail Experimental Protocol

Performance was quantified using Centipawn Loss (CPL) to measure decision regret, and illegal move rates for syntactic reasoning. Evaluations were conducted across GPT-3.5, GPT-4, and GPT-5, including GPT-5 with varying reasoning efforts, to observe developmental trajectories.

Performance Gradient: Decreasing Accuracy with Novelty

Our analysis revealed a consistent performance gradient: model accuracy declines significantly as the demand for fluid intelligence increases. In OOD positions, performance often collapses to levels comparable to random play, suggesting severe limitations in generalizing beyond memorized patterns.

7.79x Higher ACPL in OOD vs. WD Positions

While newer GPT generations show continuous improvement in overall performance, the rate of progress slows significantly for tasks further from the training distribution. This suggests that simply scaling current architectures may not be sufficient for robust fluid intelligence.

66.2% Legal Moves from GPT-5 (Still 33.8% Illegal Overall)

Reasoning-augmented inference, such as chain-of-thought, notably improves performance across all conditions. However, the marginal benefit of reasoning per token decreases as distributional proximity diminishes, indicating that reasoning primarily amplifies existing knowledge rather than generating novel insights.

Condition	GPT-5 ACPL	GPT-5 Legal Moves (%)	GPT-5 w/ Reasoning ACPL	GPT-5 w/ Reasoning Legal Moves (%)
Within-Distribution (WD)	~50	~95	~10	~99.5
Near-Distribution (ND)	~400	~85	~70	~97
Out-of-Distribution (OOD)	~450	~65	~300	~90

Architectural Bottlenecks & Future Trajectory

The consistent performance gradient and diminishing returns for fluid intelligence gains suggest fundamental architectural limitations in current LLMs. They excel in crystallized intelligence (recall) but struggle with novel syntax and first-principles reasoning in OOD contexts.

The Chess Challenge: Beyond Memorization

Our findings from the chess testbed reveal that LLMs, despite their scale and reasoning capabilities, still face significant challenges in genuine fluid intelligence. This suggests implications for their deployment in safety-critical formal systems.

Brittle Generalization: Current models rely on surface-level heuristics learned from training data, becoming brittle when structural patterns of standard play are removed in novel situations.
Diminishing Returns: Scaling alone increasingly yields smaller gains in fluid intelligence, implying an architectural bottleneck that limits extrapolation to novel state spaces.
Reasoning as an Amplifier: Chain-of-thought mostly amplifies existing crystallized knowledge rather than inducing new conceptual representations or enabling true fluid discovery.
Caution for Formal Systems: Without advances in capturing latent causal and relational structure, deploying LLMs in domains like mathematics or software synthesis requires caution due to potential brittleness in novel scenarios.

Overcoming these limitations will require innovations beyond mere architectural scale or explicit reasoning. New forms of representation and inference capable of truly capturing and composing latent causal and relational structure are essential for achieving robust fluid intelligence and generalization.

Ready to Transform Your Enterprise with AI?

Our expert team can help you navigate the complexities of AI integration, ensuring scalable solutions and measurable ROI. Schedule a personalized consultation to explore how these insights apply to your unique business challenges.

Book a Free Consultation

Advanced ROI Calculator

Estimate the potential return on investment for integrating advanced AI solutions into your enterprise operations.

Your Industry

Number of Employees (Impacted by AI)

Avg. Hours/Week on Repetitive Tasks

Avg. Hourly Employee Cost ($)

Estimated Annual Savings

Annual Hours Reclaimed

Your AI Implementation Roadmap

A typical phased approach to integrating advanced AI capabilities into an enterprise, ensuring sustainable growth and impact.

Phase 1: Discovery & Strategy

Comprehensive assessment of current workflows, identification of AI opportunities, and development of a tailored AI strategy aligned with business objectives.

Phase 2: Pilot & Proof of Concept

Deployment of AI solutions in a controlled environment to validate effectiveness, measure ROI, and gather user feedback for iterative refinement.

Phase 3: Scaled Implementation

Full-scale integration of validated AI solutions across relevant departments, ensuring seamless adoption and maximizing enterprise-wide benefits.

Phase 4: Optimization & Future-Proofing

Continuous monitoring, performance tuning, and exploration of emerging AI technologies to maintain competitive advantage and adapt to evolving needs.

Get Started Today

Enterprise AI Analysis

TRAPPED IN THE PAST? Disentangling Fluid and Crystallized Intelligence of Large Language Models Using Chess

Executive Impact Summary

Deep Analysis & Enterprise Applications

The Intelligence Debate: Crystallized vs. Fluid

Structured Testbed: Chess as a Controlled Environment

Enterprise Process Flow

Performance Gradient: Decreasing Accuracy with Novelty

Architectural Bottlenecks & Future Trajectory

The Chess Challenge: Beyond Memorization

Ready to Transform Your Enterprise with AI?

Advanced ROI Calculator

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Proof of Concept

Phase 3: Scaled Implementation

Phase 4: Optimization & Future-Proofing

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai