Skip to main content
Enterprise AI Analysis: X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

Unlocking the True Reasoning Capacity of LLMs with Structured, Verified Probes

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable structure, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-RAY generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.

Key Executive Takeaways

X-RAY provides unprecedented clarity into LLM reasoning capabilities, moving beyond surface-level metrics to offer actionable insights for enterprise AI adoption.

0 Reasoning Accuracy Boost
0 Contamination Resistance
0 Interpretable Failure Modes
0 Structural Robustness Gains

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Structured Reasoning
Formal Verification
Probing Methodology
Experimental Results

Unpacking LLM Reasoning

Traditional LLM evaluations often conflate pattern matching with true reasoning. X-RAY redefines reasoning capacity based on "extractable structure" – how well an LLM can understand and manipulate underlying problem constraints, dependencies, and solution space geometry, rather than merely recognizing familiar patterns. This distinction is crucial for tasks requiring robustness to novel conditions and multi-step transformations where simple pattern matching fails.

The Power of Formal Verification

X-RAY utilizes formal verification to ensure that all generated reasoning probes have unambiguous semantics and reliable ground truth. This eliminates issues like annotation noise, latent ambiguities, and uncontrolled surface cues common in existing benchmarks. Formal verification guarantees correctness and well-posedness, making measurements truly reflective of reasoning capacity, not dataset artifacts or contamination.

X-RAY's Probing Methodology

The X-RAY framework employs a sophisticated pipeline including autoformalization, difficulty quantification, controlled calibration, and online evaluation. Probes are generated via formal structural transformations, systematically increasing complexity along dimensions like constraint interaction depth or solution-space transformation. This controlled environment allows observed performance changes to be directly attributed to reasoning demands, offering fine-grained analysis of failure modes.

Key Findings & Model Behaviors

Experiments with state-of-the-art LLMs reveal systematic asymmetries: models are robust to constraint refinement but degrade sharply under solution-space restructuring. GPT-5 demonstrates strong cross-domain robustness, while models like 04-MINI show high variance and task-dependent instability. Calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and expose structurally interpretable failure modes.

Enterprise Process Flow: X-RAY Methodology

Natural Language Problem
Autoformalization (PCode)
Difficulty Quantification
Controlled Calibration
Formal Verification
Online Evaluation & Mapping
0
GLM-4.1V-9B-Thinking Success Rate on GSM8K (after CoT Training)

Formalized Chain-of-Thought (CoT) training significantly boosts model performance on structured reasoning tasks, demonstrating that verified supervision enhances generalizable reasoning mechanisms.

Structured Reasoning vs. Pattern Matching

Structured Reasoning (X-RAY's Focus) Pattern Matching (Traditional LLMs)
  • Robust to novel combinations of conditions and dependencies.
  • Extracts and recomposes latent constraints.
  • Generalizes effectively beyond seen instances.
  • Failure modes are structurally interpretable.
  • Enabled by formal verification and calibrated probes.
  • Succeeds by matching familiar templates and surface forms.
  • Limited insight into underlying problem structure.
  • Breaks down under novel or complex conditions.
  • Conflates structural ability with surface cues.
  • Often leads to opaque failure modes.

Case Study: 1D Ice-Puck Collision Problem

The paper uses a 1D ice-puck collision problem to demonstrate stepwise reasoning, where LLMs must enforce global structural consistency. Initially, a powerful LLM like GPT-5 produces an incorrect final answer {-4.146, 2.073} N.s, violating momentum conservation. This reflects a failure to enforce a global conservation constraint during recomposition of intermediate results. However, under X-RAY's structured regime with formalized code, GPT-40 can correctly answer each sub-question and produce a globally consistent solution. This highlights the framework's ability to detect hidden reasoning failures and provide verified intermediate supervision for robust problem-solving.

Calculate Your Potential AI ROI

See how leveraging structured AI reasoning can translate into tangible operational savings for your enterprise.

Annual Savings Potential $-
Hours Reclaimed Annually 0

Your AI Reasoning Roadmap

A structured approach to integrating X-RAY's insights into your enterprise AI strategy.

Phase 01: Reasoning Capability Audit

Utilize X-RAY's framework to conduct a deep audit of your existing LLM's reasoning strengths and weaknesses across critical structural dimensions.

Phase 02: Targeted Model Fine-tuning

Leverage solver-verified Chain-of-Thought (CoT) traces for targeted fine-tuning, strengthening brittle reasoning operations and improving generalizable mechanisms.

Phase 03: Continuous, Verified Evaluation

Implement X-RAY's online evaluation to monitor model performance with calibrated probes, ensuring robustness and preventing performance degradation over time.

Phase 04: Strategic AI Deployment

Deploy LLM solutions with confidence, knowing their reasoning capacities are precisely mapped and continuously verified for mission-critical tasks.

Ready to Transform Your Enterprise AI?

Book a personalized consultation with our AI specialists to explore how X-RAY's insights can drive your strategic initiatives.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking