RESEARCH INSIGHTS

Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation

Xidan Song, Weiqi Wang, Ruifeng Cao, Qingya Hu - Wuhan Donghu University & University of Manchester

The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework...

Discuss Your AI Strategy

Executive Summary: The Accuracy-Stability Paradox

Our study reveals a critical paradox in LLM performance: high accuracy on standard tasks does not guarantee robust conceptual understanding. Key findings for enterprise AI deployment:

0 GPT-5.1 Rotation Error

0 Claude Sonnet 4.5 Avg Stability Error

0 Gemini 2.5 Flash Illegal State Rejection

Schedule Your Strategy Session

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Consistency Failures

Tactical Blindness

Hallucination

The Breakdown of G-Invariance

Consistency Failures represent the inability of LLMs to maintain invariant evaluations under affine transformations of the board state (e.g., rotation, color-swapping). Our results show that this is the dominant error mode, particularly for GPT-5.1 (73.8% of its errors), indicating a heavy reliance on memorized FEN patterns rather than robust spatial reasoning.

Models like Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate significantly higher consistency, suggesting a better internal representation of board geometry, potentially due to data augmentation strategies during training.

Lipschitz Regularity vs. Singularities

Tactical Blindness occurs when LLMs fail to identify forced winning lines or mate sequences, predicting a balanced game state instead. This is often due to the model's inductive bias towards smoothness (Lipschitz regularity), which struggles to represent the sharp discontinuities of tactical singularities in chess.

Gemini 2.5 Flash exhibits significantly lower tactical error rates (133 compared to GPT-5.1's 495), implying an emergent capability for implicit Chain-of-Thought (CoT) reasoning or a deeper look-ahead search that approximates the true tactical landscape.

Manifold Thickness and Energy-Based Models

Hallucination Errors quantify the model's tendency to output valid numerical evaluations for theoretically illegal board states. This indicates a disconnect between the model's output formatting layer and its semantic verification layer, effectively "hallucinating" a plausible state.

Models like Gemini 2.5 Flash (7 errors) and Claude Sonnet 4.5 (26 errors) demonstrate near-zero hallucination rates, suggesting rigorous safety alignment and Reinforcement Learning from Human Feedback (RLHF) that penalizes generating outputs on invalid inputs. This behavior is akin to an Energy-Based Model (EBM) learning a sharp "Rejection Boundary" for illegal states.

Enterprise Process Flow: Geometric Stability Analysis

Sample Random Legal Chess Positions

→

Filter for High Similarity & Eval. Difference

→

Apply Geometric & Format Transformations

→

Assess LLM Evaluation Consistency & Safety

2531.28 cp GPT-5.1's Catastrophic Rotation Error (Mean Absolute Error)

This highlights the extreme sensitivity of GPT-5.1 to board rotations, suggesting a reliance on absolute coordinate features rather than relative piece configurations. Its learned heuristics collapse under non-canonical orientations.

Comprehensive Stability Analysis (Mean Absolute Error in Centipawns)

Model	Color Swap	Rotation	Mirror	Similarity	Format	Avg. Error
Claude Sonnet 4.5	198.78	626.56	195.85	318.25	62.41	280.37
Kimi K2 Turbo	367.67	726.21	94.94	329.38	36.22	310.88
DeepSeek Chat	254.46	996.88	235.18	399.58	121.48	401.52
Gemini 2.5 Flash	776.80	854.60	747.48	559.34	260.56	639.75
GPT-5.1	217.49	2531.28	181.08	297.99	48.11	655.19
Grok 4-1-Fast	785.95	1046.98	516.88	767.25	476.37	718.69

Case Study: The Brittle Specialist (GPT-5.1 Paradox)

GPT-5.1, despite being the second most accurate model with 362.17 cp error against Stockfish, is paradoxically the second least stable, showing an alarming 655.19 cp error across transformations.

This behavior is characteristic of overfitting to canonical training data. While GPT-5.1 excels at memorized patterns for standard openings and puzzles, it severely lacks the underlying geometric logic to handle rotated or transformed board states. This suggests a critical vulnerability where its learned heuristics collapse under novel, yet semantically identical, input representations.

In contrast, models like Claude Sonnet 4.5 and Kimi K2 Turbo exhibit "Robust Reasoner" characteristics, combining high accuracy with high stability, indicating a more generalized understanding of chess mechanics.

Explore Advanced AI Solutions

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by implementing geometrically robust AI solutions.

Your Industry

Number of Employees (engaged in relevant tasks)

Average Hours/Week per Employee (on relevant tasks)

Average Hourly Cost per Employee ($)

Projected Annual Savings $0

Hours Reclaimed Annually 0

Optimize Your Operations

Your Path to Robust AI Implementation

A structured approach to integrate AI systems that prioritize both accuracy and geometric stability.

Phase 1: Deep Dive & Assessment

Conduct a comprehensive audit of existing data pipelines and reasoning tasks. Identify areas where geometric stability and logical consistency are critical for robust AI performance.

Phase 2: Custom Framework Development

Design and implement a tailored Geometric Stability Framework. This includes custom data augmentation strategies and architectural optimizations to ensure invariant internal representations.

Phase 3: Robust Model Training & Evaluation

Train and fine-tune LLMs with an emphasis on consistency, tactical reasoning, and hallucination rejection. Rigorously evaluate against adversarial examples and invariant transformations.

Phase 4: Deployment & Continuous Monitoring

Deploy robust AI systems into production environments. Establish continuous monitoring for performance, consistency drift, and new adversarial patterns to maintain long-term reliability.

Start Your Implementation Roadmap

Ready to Build Robust AI?

Our experts are ready to help you navigate the complexities of AI, ensuring your systems are not just accurate, but also stable, logical, and safe for enterprise deployment.

Book Your Consultation

RESEARCH INSIGHTS

Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation

Executive Summary: The Accuracy-Stability Paradox

Deep Analysis & Enterprise Applications

The Breakdown of G-Invariance

Lipschitz Regularity vs. Singularities

Manifold Thickness and Energy-Based Models

Enterprise Process Flow: Geometric Stability Analysis

Comprehensive Stability Analysis (Mean Absolute Error in Centipawns)

Case Study: The Brittle Specialist (GPT-5.1 Paradox)

Calculate Your Potential ROI

Your Path to Robust AI Implementation

Phase 1: Deep Dive & Assessment

Phase 2: Custom Framework Development

Phase 3: Robust Model Training & Evaluation

Phase 4: Deployment & Continuous Monitoring

Ready to Build Robust AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai