RESEARCH INSIGHTS
Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation
Xidan Song, Weiqi Wang, Ruifeng Cao, Qingya Hu - Wuhan Donghu University & University of Manchester
The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework...
Executive Summary: The Accuracy-Stability Paradox
Our study reveals a critical paradox in LLM performance: high accuracy on standard tasks does not guarantee robust conceptual understanding. Key findings for enterprise AI deployment:
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Breakdown of G-Invariance
Consistency Failures represent the inability of LLMs to maintain invariant evaluations under affine transformations of the board state (e.g., rotation, color-swapping). Our results show that this is the dominant error mode, particularly for GPT-5.1 (73.8% of its errors), indicating a heavy reliance on memorized FEN patterns rather than robust spatial reasoning.
Models like Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate significantly higher consistency, suggesting a better internal representation of board geometry, potentially due to data augmentation strategies during training.
Lipschitz Regularity vs. Singularities
Tactical Blindness occurs when LLMs fail to identify forced winning lines or mate sequences, predicting a balanced game state instead. This is often due to the model's inductive bias towards smoothness (Lipschitz regularity), which struggles to represent the sharp discontinuities of tactical singularities in chess.
Gemini 2.5 Flash exhibits significantly lower tactical error rates (133 compared to GPT-5.1's 495), implying an emergent capability for implicit Chain-of-Thought (CoT) reasoning or a deeper look-ahead search that approximates the true tactical landscape.
Manifold Thickness and Energy-Based Models
Hallucination Errors quantify the model's tendency to output valid numerical evaluations for theoretically illegal board states. This indicates a disconnect between the model's output formatting layer and its semantic verification layer, effectively "hallucinating" a plausible state.
Models like Gemini 2.5 Flash (7 errors) and Claude Sonnet 4.5 (26 errors) demonstrate near-zero hallucination rates, suggesting rigorous safety alignment and Reinforcement Learning from Human Feedback (RLHF) that penalizes generating outputs on invalid inputs. This behavior is akin to an Energy-Based Model (EBM) learning a sharp "Rejection Boundary" for illegal states.
Enterprise Process Flow: Geometric Stability Analysis
This highlights the extreme sensitivity of GPT-5.1 to board rotations, suggesting a reliance on absolute coordinate features rather than relative piece configurations. Its learned heuristics collapse under non-canonical orientations.
| Model | Color Swap | Rotation | Mirror | Similarity | Format | Avg. Error |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 198.78 | 626.56 | 195.85 | 318.25 | 62.41 | 280.37 |
| Kimi K2 Turbo | 367.67 | 726.21 | 94.94 | 329.38 | 36.22 | 310.88 |
| DeepSeek Chat | 254.46 | 996.88 | 235.18 | 399.58 | 121.48 | 401.52 |
| Gemini 2.5 Flash | 776.80 | 854.60 | 747.48 | 559.34 | 260.56 | 639.75 |
| GPT-5.1 | 217.49 | 2531.28 | 181.08 | 297.99 | 48.11 | 655.19 |
| Grok 4-1-Fast | 785.95 | 1046.98 | 516.88 | 767.25 | 476.37 | 718.69 |
Case Study: The Brittle Specialist (GPT-5.1 Paradox)
GPT-5.1, despite being the second most accurate model with 362.17 cp error against Stockfish, is paradoxically the second least stable, showing an alarming 655.19 cp error across transformations.
This behavior is characteristic of overfitting to canonical training data. While GPT-5.1 excels at memorized patterns for standard openings and puzzles, it severely lacks the underlying geometric logic to handle rotated or transformed board states. This suggests a critical vulnerability where its learned heuristics collapse under novel, yet semantically identical, input representations.
In contrast, models like Claude Sonnet 4.5 and Kimi K2 Turbo exhibit "Robust Reasoner" characteristics, combining high accuracy with high stability, indicating a more generalized understanding of chess mechanics.
Calculate Your Potential ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by implementing geometrically robust AI solutions.
Your Path to Robust AI Implementation
A structured approach to integrate AI systems that prioritize both accuracy and geometric stability.
Phase 1: Deep Dive & Assessment
Conduct a comprehensive audit of existing data pipelines and reasoning tasks. Identify areas where geometric stability and logical consistency are critical for robust AI performance.
Phase 2: Custom Framework Development
Design and implement a tailored Geometric Stability Framework. This includes custom data augmentation strategies and architectural optimizations to ensure invariant internal representations.
Phase 3: Robust Model Training & Evaluation
Train and fine-tune LLMs with an emphasis on consistency, tactical reasoning, and hallucination rejection. Rigorously evaluate against adversarial examples and invariant transformations.
Phase 4: Deployment & Continuous Monitoring
Deploy robust AI systems into production environments. Establish continuous monitoring for performance, consistency drift, and new adversarial patterns to maintain long-term reliability.
Ready to Build Robust AI?
Our experts are ready to help you navigate the complexities of AI, ensuring your systems are not just accurate, but also stable, logical, and safe for enterprise deployment.