Enterprise AI Research Analysis

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Authors: Shubham Parashar*, Blake Olson*, Sambhav Khurana*, Eric Li*, Hongyi Ling, James Caverlee, Shuiwang Ji

Publication Date: 18 Feb 2025

This analysis summarizes key findings from recent research on enhancing Large Language Model (LLM) capabilities through inference-time techniques for complex reasoning and planning. It critically evaluates their performance, computational costs, and inherent limitations across diverse tasks.

Schedule Your AI Strategy Session

Executive Impact & Key Findings

Large Language Models (LLMs) face challenges in complex reasoning and planning tasks, despite their success in NLP. Inference-time techniques like Chain-of-Thought (CoT), Self-Consistency (SC), Tree-of-Thought (ToT), and Reasoning as Planning (RAP) have emerged to enhance LLM capabilities without additional training by exploring intermediate steps. This paper introduces Sys2Bench, a comprehensive benchmark to evaluate existing inference-time techniques on 11 diverse tasks across 5 reasoning and planning categories. It examines the tradeoff between computational cost and performance, and analyzes limitations of simply scaling inference-time computations. Simply scaling inference-time computation has limitations; no single inference-time technique consistently performs well across all reasoning and planning tasks. LLMs exhibit inherent biases and struggle with self-verification, leading to performance degradation with increased complexity in tree-search methods. A more strategic approach is needed to improve LLM reasoning, potentially combining Reinforcement Learning with inference-time methods, as hinted by models like DeepSeek-R1. Diverse approaches are essential for holistic reasoning capabilities.

0 Studies Analyzed

0 Reasoning Categories Impacted

0 Models Evaluated

0 Inference Techniques Explored

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Arithmetic Reasoning Insights

LLMs demonstrate strong results in multi-step arithmetic problems (GSM8K, AQUA) with CoT and SC, which benefit from reducing randomness. However, tree search methods like ToT often underperform due to LLMs' struggles with self-verification of intermediate steps. Large Reasoning Models (LRMs) show exceptional performance in this domain.

98.0% O1-mini Accuracy on GSM8K

Method	GSM8K Accuracy	AQUA Accuracy
Chain-of-Thought (CoT)	97.0%	79.9%
Self-Consistency (SC)	97.5%	83.9%
Tree-of-Thought (ToT)	96.0%	78.0%
Notes: SC often improves over CoT by aggregating responses, while ToT struggles with self-verification for complex arithmetic paths.

Challenges of Tree Search in Arithmetic Reasoning

Despite the theoretical benefits of exploring multiple reasoning paths, tree search methods like Tree-of-Thought (ToT) often underperform for arithmetic tasks. This is primarily because LLMs struggle with self-verification, meaning they frequently select incorrect intermediate arithmetic steps.

As LLMs fail to reliably identify correct intermediate computations, scaling inference-time techniques through extensive path exploration does not consistently translate to performance gains in arithmetic reasoning, leading to wrong final answers.

Logical Reasoning Insights

Evaluating LLMs' ability to derive conclusions from structured rules (ProntoQA) reveals nuanced performance. While Self-Consistency (SC) can improve performance for smaller models, it may degrade performance for larger LLMs by increasing the likelihood of generating multiple incorrect reasoning chains. Tree search methods similarly show limitations.

88.4% LLaMa 3.1 70B SC on ProntoQA

Method	ProntoQA Accuracy
Chain-of-Thought (CoT)	91.8%
Self-Consistency (SC)	91.4%
Tree-of-Thought (ToT)	32.8%
Notes: For larger models, SC can introduce more incorrect chains, and tree search methods generally underperform in complex logical deduction.

Impact of SC on Logical Reasoning Chains

For large language models such as LLaMa 3.1 405B and GPT-based models, Self-Consistency (SC) on ProntoQA can lead to a performance drop. This is because SC, by generating multiple reasoning paths, increases the likelihood of producing various incorrect logical chains.

When evaluation focuses on the accuracy of these chains, majority voting, a core component of SC, does not effectively filter out the numerous wrong reasoning paths, thus failing to improve or even degrading overall performance compared to single-path CoT.

Common Sense Reasoning Insights

Common Sense Reasoning, which involves drawing conclusions from implicit everyday knowledge (StrategyQA, HotPotQA), shows that CoT and SC performance improves with increasing LLM size. However, the effectiveness of tree search methods varies significantly, being more effective for binary outputs (e.g., StrategyQA) but less so for short-answer generation (e.g., HotPotQA) due to hallucination risks.

82.0% LLaMa 3.1 405B SC on StrategyQA

Method	HotPotQA Accuracy	StrategyQA Accuracy
Chain-of-Thought (CoT)	41.0%	79.2%
Self-Consistency (SC)	45.6%	79.8%
Tree-of-Thought (ToT)	31.5%	73.5%
Notes: Tree search methods are more effective when the LLM can effectively utilize generated facts for binary outputs, but less so for open-ended questions.

Tree Search Limitations in Common Sense Reasoning

For tasks like HotPotQA, where LLMs must produce short answers based on provided facts, tree search methods prove ineffective. The generation of additional 'supporting facts' by the LLM during tree exploration often leads to hallucinations and increased error rates.

Unlike tasks requiring binary outputs, the open-ended nature of many common sense questions makes it challenging for tree search to guide LLMs toward accurate and concise answers, ultimately limiting performance gains.

Algorithmic Reasoning Insights

Algorithmic Reasoning tasks, including Game of 24 and Binpacking, require solving complex NP-hard/NP-complete problems. Chain-of-Thought (CoT) and Self-Consistency (SC) generally underperform. In contrast, tree search methods like ToT and Reasoning as Planning (RAP) perform well on most models, benefiting from extensive search capabilities, though smaller models may struggle.

90.0% O1-mini RAP on Binpacking

Method	Game of 24 Accuracy	Binpacking Accuracy
Chain-of-Thought (CoT)	7.0%	31.0%
Self-Consistency (SC)	6.0%	41.0%
Tree-of-Thought (ToT)	69.0%	53.0%
Notes: Tree search methods generally outperform CoT/SC due to their ability to perform extensive combinatorial search, crucial for NP-hard problems.

Why CoT/SC Struggle with Algorithmic Tasks

Chain-of-Thought (CoT) and Self-Consistency (SC) largely underperform on algorithmic reasoning tasks like Game of 24 and Binpacking. These problems, being combinatorial optimization challenges, demand extensive search and precise evaluation of intermediate steps.

LLMs, when prompted with CoT/SC, struggle to accurately generate and verify the complex sequences of operations needed for optimal solutions. Their linear or simple aggregated reasoning paths are insufficient for the exhaustive exploration required by such NP-hard problems, leading to poor performance.

Planning Insights

Planning tasks (BlocksWorld, Rubik's Cube, TripPlan, CalendarPlan) are identified as the most challenging. While CoT and SC performance generally improves with larger model sizes, tree search methods yield mixed results. Reasoning as Planning (RAP) shows exceptional performance on Blocksworld by leveraging a world model, but Rubik's Cube remains a significant challenge for all current methods due to high demands on spatial reasoning.

46.8% LLaMa 3.1 8B RAP on Blocksworld

Enterprise Process Flow

Initial State (So)

→

Set of Actions (A)

→

Achieve Goal (G)

Rubik's Cube: A Persistent Challenge for LLMs

The Rubik's Cube task remains uniquely challenging for all evaluated methods and models, including advanced LRMs like O1. This is primarily because it requires advanced spatial reasoning and precise prediction of the consequences of each rotation.

Current LLMs and LRMs lack the sophisticated spatial understanding and state-transition modeling capabilities necessary to effectively plan and execute the complex sequence of actions required to solve a scrambled Rubik's Cube. The task is considered 'out-of-distribution' (OOD) for their existing reasoning architectures.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by strategically implementing AI solutions, informed by the latest research.

Your Industry

Number of Employees (Impacted by AI)

Average Weekly Hours on Repetitive Tasks

Average Hourly Fully-Burdened Rate ($)

Estimated Annual Savings 0

Annual Hours Reclaimed 0

Optimize Your Operations

Your AI Implementation Roadmap

A phased approach ensures seamless integration and maximum impact, leveraging the insights from this analysis for robust, verifiable AI solutions.

Phase 1: Strategic Assessment & Data Preparation

Conduct a deep dive into your existing workflows and data infrastructure. Identify key areas where LLM reasoning and planning can deliver the highest ROI. Prepare and clean relevant datasets for optimal model performance.

Phase 2: Pilot Program & Technique Selection

Implement a pilot program using an appropriate LLM and select inference-time techniques (e.g., CoT, SC, ToT, RAP) based on task-specific requirements highlighted in the research. Focus on small, manageable tasks to demonstrate initial value.

Phase 3: Performance Tuning & Validation

Rigorously evaluate the chosen techniques, iterating on prompts and model configurations. Implement robust verification mechanisms, potentially including external tools or human-in-the-loop processes, to counter LLM biases and hallucinations.

Phase 4: Scaled Deployment & Continuous Improvement

Roll out the validated AI solutions across the enterprise, focusing on gradual scaling. Establish monitoring and feedback loops for continuous improvement, adapting techniques and models as new research emerges and business needs evolve.

Start Your AI Journey

Ready to Transform Your Enterprise with AI?

Leverage cutting-edge research to build intelligent, efficient, and reliable AI systems. Schedule a free consultation with our AI strategists today.

Schedule Your Strategy Session

Enterprise AI Research Analysis

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Executive Impact & Key Findings

Deep Analysis & Enterprise Applications

Arithmetic Reasoning Insights

Challenges of Tree Search in Arithmetic Reasoning

Logical Reasoning Insights

Impact of SC on Logical Reasoning Chains

Common Sense Reasoning Insights

Tree Search Limitations in Common Sense Reasoning

Algorithmic Reasoning Insights

Why CoT/SC Struggle with Algorithmic Tasks

Planning Insights

Enterprise Process Flow

Rubik's Cube: A Persistent Challenge for LLMs

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Strategic Assessment & Data Preparation

Phase 2: Pilot Program & Technique Selection

Phase 3: Performance Tuning & Validation

Phase 4: Scaled Deployment & Continuous Improvement

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai