AI Benchmarking & LLM Evaluation
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
This paper introduces QSTRBench, a comprehensive benchmark designed to evaluate Large Language Models' (LLMs) ability to reason with qualitative spatial and temporal calculi (QSTR). The benchmark covers various reasoning tasks (composition, converse, conceptual neighborhoods) across nine QSTR calculi, including Point Algebra (PA), Allen's Interval Algebra (IA), RCC-8, and the newly published RCC-22 conceptual neighborhood. Experiments with 32 LLMs show that while models perform better than guessing, none consistently achieve full accuracy. Performance varies significantly by calculus, with PA being easiest and RCC-22 most difficult. The benchmark emphasizes diverse prompt variations (prefix/infix, words/symbols/nonce terms, schematic diagrams) to test model robustness and true reasoning capabilities, revealing inconsistencies and a reliance on training data rather than genuine reasoning. The benchmark and results are openly released to facilitate further research.
Executive Impact: Key Findings & Metrics
Our in-depth analysis of the QSTRBench study reveals critical performance indicators for current LLM capabilities in qualitative reasoning.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Qualitative Spatial & Temporal Reasoning (QSTR)
QSTR is a field concerned with representing and reasoning about qualitative spatial and temporal information. This involves using symbolic relations (e.g., 'on', 'before', 'part of') rather than precise numerical coordinates or timestamps. It defines relation algebras to formally reason about configurations of entities, commonly testing converse relations, composition of relations, and conceptual neighborhoods. LLMs are evaluated on their ability to perform these core inference tasks.
Composition Tables (CT)
Composition in QSTR refers to inferring a relationship R3(x,z) from two given relationships R1(x,y) and R2(y,z). A composition table (CT) records the results for all combinations of base relations within a particular calculus. Evaluating LLMs on CT tasks tests their ability to perform multi-step relational inference based on the defined algebra.
Conceptual Neighbourhoods (CN)
The conceptual neighborhood (CN) of a relation defines a set of relations that can be directly reached from it through continuous deformation or translation of the entities involved. This concept is crucial for understanding how qualitative relations evolve and provides insights into the 'closeness' of different spatial or temporal configurations. The RCC-22 CN is published for the first time in this paper.
Converse Relations
The converse of a relation R(x,y) is the relation R(y,x) obtained by reversing the arguments. It captures the same spatial configuration or temporal ordering from the opposite perspective. For example, in RCC-8, TPPi is the converse of TPP. Testing converse relations is a fundamental task in relational reasoning and a baseline for understanding an LLM's grasp of relational symmetry and asymmetry.
LLM Robustness & Reasoning vs. Pattern Matching
The benchmark includes extensive prompt variations (prefix/infix, words/symbols/nonce terms, schematic diagrams, anonymized relation names, swapped descriptions) to probe whether LLMs genuinely reason or merely replicate patterns from their training data. Inconsistent performance across these variations, or when relations are anonymized, suggests a reliance on surface-level patterns rather than deep conceptual understanding. The study highlights that even frontier models exhibit non-deterministic and inconsistent reasoning behavior, often failing on unseen or subtly varied problems.
Enterprise Process Flow
| Features | Our Solution | Traditional Approaches |
|---|---|---|
| Consistency |
|
|
| Accuracy |
|
|
| Adaptability to new phrasing |
|
|
| Handling of 'Nonce' terms |
|
|
| Interpretation of Schematic Diagrams |
|
|
| Cost & Latency |
|
|
| Reliability |
|
|
The RCC-22 Challenge
Problem: RCC-22, despite having fewer base relations than INDU, proved to be the most difficult calculus for LLMs. Its conceptual neighborhood (CN) was published for the first time in this paper, meaning LLMs had no prior training data on it.
Solution: The benchmark rigorously tested RCC-22 across composition, converse, and CN tasks with various prompt styles.
Result: Even top models like GPT-5.2 struggled significantly with RCC-22, especially CN questions, suggesting they lack a deep understanding of its complex relations involving concave regions. GPT-5.2 used an order of magnitude more tokens for RCC-22 questions than simpler calculi, yet still showed inconsistency and failed to correctly identify specific CN relations.
Advanced ROI Calculator
Estimate the potential annual savings and reclaimed human hours by deploying AI-powered qualitative reasoning in your operations.
Your Enterprise AI Implementation Roadmap
A phased approach to integrate advanced qualitative reasoning capabilities into your business processes.
Phase 1: Discovery & Strategy
Conduct a comprehensive assessment of existing qualitative reasoning needs and data sources. Develop a tailored strategy for QSTRBench integration and LLM fine-tuning.
Phase 2: Integration & Customization
Implement QSTRBench-derived LLM solutions, integrating with enterprise systems. Customize reasoning models for specific domain ontologies and data types.
Phase 3: Optimization & Scaling
Monitor LLM performance on QSTR tasks, continuously optimizing for accuracy and efficiency. Scale successful deployments across relevant departments and use cases.
Ready to Transform Your Enterprise?
Unlock the full potential of AI-driven qualitative reasoning. Schedule a personalized consultation to see how QSTRBench insights can propel your business forward.