Skip to main content
Enterprise AI Analysis: QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

AI Benchmarking & LLM Evaluation

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

This paper introduces QSTRBench, a comprehensive benchmark designed to evaluate Large Language Models' (LLMs) ability to reason with qualitative spatial and temporal calculi (QSTR). The benchmark covers various reasoning tasks (composition, converse, conceptual neighborhoods) across nine QSTR calculi, including Point Algebra (PA), Allen's Interval Algebra (IA), RCC-8, and the newly published RCC-22 conceptual neighborhood. Experiments with 32 LLMs show that while models perform better than guessing, none consistently achieve full accuracy. Performance varies significantly by calculus, with PA being easiest and RCC-22 most difficult. The benchmark emphasizes diverse prompt variations (prefix/infix, words/symbols/nonce terms, schematic diagrams) to test model robustness and true reasoning capabilities, revealing inconsistencies and a reliance on training data rather than genuine reasoning. The benchmark and results are openly released to facilitate further research.

Executive Impact: Key Findings & Metrics

Our in-depth analysis of the QSTRBench study reveals critical performance indicators for current LLM capabilities in qualitative reasoning.

0 QSTR Calculi Tested
0 Total Questions (Extended Benchmark)
0 Top Model Accuracy (GPT-5.2 High Reasoning)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Qualitative Spatial & Temporal Reasoning (QSTR)

QSTR is a field concerned with representing and reasoning about qualitative spatial and temporal information. This involves using symbolic relations (e.g., 'on', 'before', 'part of') rather than precise numerical coordinates or timestamps. It defines relation algebras to formally reason about configurations of entities, commonly testing converse relations, composition of relations, and conceptual neighborhoods. LLMs are evaluated on their ability to perform these core inference tasks.

Composition Tables (CT)

Composition in QSTR refers to inferring a relationship R3(x,z) from two given relationships R1(x,y) and R2(y,z). A composition table (CT) records the results for all combinations of base relations within a particular calculus. Evaluating LLMs on CT tasks tests their ability to perform multi-step relational inference based on the defined algebra.

Conceptual Neighbourhoods (CN)

The conceptual neighborhood (CN) of a relation defines a set of relations that can be directly reached from it through continuous deformation or translation of the entities involved. This concept is crucial for understanding how qualitative relations evolve and provides insights into the 'closeness' of different spatial or temporal configurations. The RCC-22 CN is published for the first time in this paper.

Converse Relations

The converse of a relation R(x,y) is the relation R(y,x) obtained by reversing the arguments. It captures the same spatial configuration or temporal ordering from the opposite perspective. For example, in RCC-8, TPPi is the converse of TPP. Testing converse relations is a fundamental task in relational reasoning and a baseline for understanding an LLM's grasp of relational symmetry and asymmetry.

LLM Robustness & Reasoning vs. Pattern Matching

The benchmark includes extensive prompt variations (prefix/infix, words/symbols/nonce terms, schematic diagrams, anonymized relation names, swapped descriptions) to probe whether LLMs genuinely reason or merely replicate patterns from their training data. Inconsistent performance across these variations, or when relations are anonymized, suggests a reliance on surface-level patterns rather than deep conceptual understanding. The study highlights that even frontier models exhibit non-deterministic and inconsistent reasoning behavior, often failing on unseen or subtly varied problems.

RCC-22 Most Challenging Calculus for LLMs

Enterprise Process Flow

Define QSTR Calculi & Tasks (CT, CN, Converse)
Generate Diverse Prompts (Natural Lang, Symbols, Nonce, Schematic)
Test LLMs (32 Models, Commercial & Open-Weight)
Evaluate Accuracy (Strict & Guess Rates)
Analyze Performance by Calculus & Prompt Style
Publish Benchmark & Results

LLM Reasoning vs. Symbolic Reasoners

Features Our Solution Traditional Approaches
Consistency
  • Highly consistent across queries
Accuracy
  • 100% accurate for well-defined problems
Adaptability to new phrasing
  • Excels at interpreting varied natural language inputs
  • Can generalize to some extent with different descriptions
  • Requires precise symbolic input (autoformalization needed)
Handling of 'Nonce' terms
  • Can perform better when forced to reason without relying on known terms (RCC-22 example)
Interpretation of Schematic Diagrams
  • Struggles with schematic diagrams compared to natural language
  • Diagrammatic reasoners excel
Cost & Latency
  • High cost and latency for large reasoning models
  • Fast and low-cost for well-defined problems
Reliability
  • Non-deterministic and inconsistent results, even with temperature=0
  • Deterministic and reliable results

The RCC-22 Challenge

Problem: RCC-22, despite having fewer base relations than INDU, proved to be the most difficult calculus for LLMs. Its conceptual neighborhood (CN) was published for the first time in this paper, meaning LLMs had no prior training data on it.

Solution: The benchmark rigorously tested RCC-22 across composition, converse, and CN tasks with various prompt styles.

Result: Even top models like GPT-5.2 struggled significantly with RCC-22, especially CN questions, suggesting they lack a deep understanding of its complex relations involving concave regions. GPT-5.2 used an order of magnitude more tokens for RCC-22 questions than simpler calculi, yet still showed inconsistency and failed to correctly identify specific CN relations.

$77.28 Cost per run for GPT-5.2 (High Reasoning)

Advanced ROI Calculator

Estimate the potential annual savings and reclaimed human hours by deploying AI-powered qualitative reasoning in your operations.

Potential Annual Savings $0
Human Hours Reclaimed 0

Your Enterprise AI Implementation Roadmap

A phased approach to integrate advanced qualitative reasoning capabilities into your business processes.

Phase 1: Discovery & Strategy

Conduct a comprehensive assessment of existing qualitative reasoning needs and data sources. Develop a tailored strategy for QSTRBench integration and LLM fine-tuning.

Phase 2: Integration & Customization

Implement QSTRBench-derived LLM solutions, integrating with enterprise systems. Customize reasoning models for specific domain ontologies and data types.

Phase 3: Optimization & Scaling

Monitor LLM performance on QSTR tasks, continuously optimizing for accuracy and efficiency. Scale successful deployments across relevant departments and use cases.

Ready to Transform Your Enterprise?

Unlock the full potential of AI-driven qualitative reasoning. Schedule a personalized consultation to see how QSTRBench insights can propel your business forward.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking