ENTERPRISE AI ANALYSIS
Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning
This study developed and populated a novel reasoning benchmark (WnH) based on the Watson & Holmes tabletop detective game, revealing AI's rapid improvement in reasoning capabilities against human performance.
Executive Impact
Leveraging naturalistic benchmarks to understand and enhance LLM reasoning for critical enterprise applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introduction
Reasoning has long been regarded as a defining feature of human intelligence; and, more recently, artificial intelligence (AI) [1]. From early symbolic and rule-based systems to today's large language models (LLMs), progress in AI has often been measured by the extent to which machines can replicate or approximate human reasoning [2].
Unlike previous engineered reasoning systems, which follow explicit logical rules, contemporary AI models acquire reasoning capabilities implicitly through exposure to massive text corpora; emerging [2] as a by-product of language prediction [3]. Exploring how such reasoning arises and how it can be reliably elicited or improved has become essential for understanding the boundaries of LLM intelligence [4].
Background
This section outlines key perspectives and developments in reasoning and its evaluation. It reviews major theoretical accounts of human and machine reasoning, surveys existing reasoning benchmarks used to assess AI models' reasoning capabilities across domains, examines approaches to autograding open-ended reasoning performance, and concludes with a discussion of current benchmark limitations and gaps.
Philosophical approaches to reasoning typically present idealized accounts of how reasoning should operate to draw conclusions. A standard account is the classification of reasoning modes into deduction, induction and abduction. Deduction and induction were already recognized and distinguished in ancient Greek philosophy, particularly in Aristotle's logic [6]. Much later in the 19th century Charles Sanders Peirce [7] added abduction to this framework. The modes can defined as follows:
- Deduction: Reasoning following established canonical logical patterns.
- Induction: Inferences based on generalization from previous, ideally repeated, observations.
- Abduction: The best possible explanation for the given facts.
Experimental Methods
The Watson & Holmes (W&H) tabletop game [5] was adapted for use in assessing the reasoning abilities of LLMs. We first describe the original form of the game, and then its adaptation.
Watson & Holmes is a whodunnit game, in which players compete to be the first to solve a mystery using reasoning skills such as deduction, induction, and abduction. Each play of the game involves a fresh case, which can only be attempted once only.
The eleven cases used for performance evaluation had a mean of 15 (SD = 2) locations. The mean word count per case (introduction plus location texts) was 2,900 (SD = 600). The number of questions per case ranged from two to five (mean = 3.4), giving in total 37 questions across the eleven cases.
Results
Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with reasoning-oriented model architectures.
Systematic differences in the performance of AI models compared to humans, dependent on features of the specific detection puzzle, were mostly absent with the exception of a fall in performance for models when solving longer cases (case lengths being in the range of 1900-4000 words), and an advantage at inductive reasoning for reasoning models at early stages of case solving when evidence was scant.
Discussion & Conclusion
This study developed and populated a novel reasoning benchmark (WnH) based on the Watson & Holmes tabletop detective game. Several key findings emerge.
First, AI expected performance at the benchmark has improved from lower quartile of our comparison human population (Computer Science undergraduates at a leading university) to top 14% level (95% confidence) over 9 months of 2025. Roughly half of that improvement is incremental with model release date, and half can be attributed to a step change when rModels were introduced.
Finally, the utility of the benchmark is near its end for the frontier LLMs, with saturated performance expected to be achieved by the end of 2026. The benchmark is expected still to remain useful for assessing small, cost-effective AI models.
Adapted Gameplay Flow for AI Reasoning Evaluation
| Model/Human Group | Overall Score | Estimated Human Percentile |
|---|---|---|
| Top rModel (o3-pro) | 2.14 | 99% |
| Top cModel (GPT-4.1) | 1.45 | 53% |
| Average Human | 1.43 | 50% |
| Worst Human (Player2) | 0.89 | 4% |
| Notes: rModels show significant lead over humans, cModels perform at average human level. | ||
Impact of Case Length on LLM Reasoning
The study found that LLM reasoning performance decreases on longer cases (1900-4000 words), even within claimed context lengths. This suggests that claims of indefinitely expanding model contexts without performance degradation should be treated with caution: increasing length already appears to produce adverse impacts on LLM reasoning.
Key Takeaway: Longer contexts adversely impact LLM reasoning, despite increased context window claims.
Calculate Your Potential AI ROI
Estimate the significant operational savings and reclaimed human hours by implementing advanced AI reasoning in your enterprise.
Your AI Reasoning Implementation Roadmap
A structured approach to integrating advanced LLM reasoning into your business operations.
Phase 1: Discovery & Strategy
Deep dive into your existing workflows, identify reasoning bottlenecks, and define key performance indicators (KPIs) for AI integration.
Phase 2: Data Preparation & Model Selection
Curate and structure your narrative data, select optimal LLM architectures (cModel vs. rModel) based on reasoning requirements, and configure prompts for WnH-style tasks.
Phase 3: Prototype & Iteration
Develop initial AI-driven reasoning prototypes, conduct small-scale WnH benchmark runs, and iterate on prompt engineering and model fine-tuning for improved accuracy and nuance.
Phase 4: Scalable Deployment & Monitoring
Implement the AI reasoning system within your enterprise, establish automated grading and evaluation pipelines, and continuously monitor performance against human baselines and evolving benchmarks.
Ready to Enhance Your Enterprise Reasoning with AI?
Book a personalized consultation with our AI experts to discuss how these insights apply to your unique business challenges.