Research Paper Analysis

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

This paper introduces TangramPuzzle, a new benchmark for evaluating the compositional spatial reasoning abilities of Multimodal Large Language Models (MLLMs). Inspired by the classic Tangram game, it uses a symbolic geometric framework (TCE) for precise, machine-verifiable coordinate specifications. Two tasks are proposed: Outline Prediction and End-to-End Code Generation, assessing global shape inference and inverse geometric assembly. Evaluations reveal MLLMs struggle with geometric constraints, often prioritizing visual matching over accuracy, highlighting limitations in complex spatial reasoning.

Schedule Your Strategy Session

Executive Impact: Key Findings for Your Enterprise

TangramPuzzle reveals critical gaps in MLLM spatial reasoning, impacting enterprise applications requiring precise geometric understanding. While MLLMs show strong visual recognition, their inability to maintain geometric integrity (e.g., non-overlap, rigid shape preservation) leads to unreliable outputs. This benchmark is vital for developing robust AI for design, robotics, and manufacturing, emphasizing the need for advanced compositional intelligence.

0 Task 1 Accuracy (Best MLLM)

0 Task 2 Success Rate (Best MLLM)

0 Human Average Accuracy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Motivation

Tangram Construction Expression (TCE)

Tasks & Evaluation

Key Findings

Multimodal Large Language Models (MLLMs) have made significant strides in visual recognition and semantic understanding. However, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often use simple tasks, semantic approximations, or coarse relative positioning, with limited and imprecise evaluation metrics. This paper addresses these limitations by introducing TangramPuzzle, a geometry-grounded benchmark for evaluating compositional spatial reasoning through the classic Tangram game.

The paper proposes the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications. This mitigates ambiguity from visual approximation and ensures geometric rigor. Each TCE instance includes components like instance_id, target_outline, initial_state, final_state, and adjacency_graph. Quantities are encoded as exact algebraic expressions in LATEX to prevent precision loss from floating-point representations. The final_state includes a transform_matrix for rigid motions (translation, rotation, reflection) of pieces.

TangramPuzzle features two complementary tasks: Outline Prediction (Task 1) requires models to infer global shapes from local components, selecting the correct silhouette from candidates. End-to-End Code Generation (Task 2) demands solving inverse geometric assembly problems, outputting a complete TCE in JSON format. Evaluation for Task 2 involves a hierarchical Constraint-based Evaluation Framework, checking for Syntax Error (TSE), Rigid Geometry Error (RGE), and Physical Error (PE). Shape similarity is measured by IoU and Hausdorff Distance. Human performance is also benchmarked to provide a realistic ceiling.

Experiments on various MLLMs (open-source and proprietary) reveal that models prioritize matching the target silhouette visually while often neglecting geometric constraints. This leads to distortions or deformations of pieces, or impermissible overlaps. Even top-tier models with high IoU scores often fail constraint validation (0% success in many cases for Task 2), highlighting a fundamental limitation in true compositional spatial reasoning. Gemini3-Pro stands out as an exception with robust geometric reasoning.

MLLM vs. Human Performance on Task 2

0 Best MLLM Success Rate (%) vs. Human Average Success Rate (%) 72.67

Enterprise Process Flow

Raw Data Collection & Filtering

→

Interactive Annotation

→

Geometric Normalization

→

Validation

Spatial Reasoning Limitations in MLLMs

Aspect	MLLM Performance	Ideal Performance
Geometric Constraints	Frequently violated (e.g., RGE, PE) Prioritize visual fit over geometric rules	Strictly adhered Maintains physical feasibility
Compositional Reasoning	Struggles with inverse assembly Relies on textual cues (crutches)	Systematic decomposition Direct visual grounding
Solution Validity	Low Validation Pass Rate (VPR) Many solutions geometrically invalid	High VPR Geometrically valid and executable solutions

Case Study: U-shaped Target Assembly

A visualization for a U-shaped target (Task 2) showed Gemini3-Pro achieving a perfect assembly, demonstrating its ability to effectively ground coordinate constraints. In contrast, Claude-Sonnet-4.5 attempted non-rigid elongations, InternVL3-78B generated severe overlaps, and GPT-5.2 used an incorrect piece inventory (e.g., hallucinating an extra square). This highlights the common failure mode where MLLMs prioritize visual silhouette matching at the expense of geometric constraints.

Advanced ROI Calculator

Estimate the potential return on investment for integrating advanced AI capabilities into your enterprise workflows.

Industry Sector

Number of Employees (impacted)

Average Hours Spent on Repetitive Tasks (per week, per employee)

Average Hourly Wage ($)

Potential Annual Savings $0

Annual Hours Reclaimed 0

Calculate Your Potential ROI

Implementation Roadmap

A strategic outline for integrating advanced AI capabilities, from foundational setup to validated deployment.

Phase 1: Foundation & Data Integration

Integrate TangramPuzzle benchmark data and TCE framework. Establish baseline MLLM performance for both Outline Prediction and End-to-End Code Generation tasks.

Phase 2: Model Adaptation & Fine-tuning

Develop and fine-tune MLLM architectures to enhance geometric reasoning capabilities. Focus on constraint satisfaction (RGE, PE) and precise coordinate generation.

Phase 3: Advanced Spatial Reasoning Modules

Implement specialized modules for rigid body transformations, non-overlap checking, and topological validity. Explore symbolic reasoning integration to improve geometric fidelity.

Phase 4: Comprehensive Validation & Deployment

Rigorously validate model outputs against TCE constraints and human performance. Prepare for deployment in real-world applications requiring high-precision spatial intelligence.

Discuss Your Implementation

Unlock Precision in Your Enterprise AI

Discuss how compositional spatial reasoning can transform your AI applications from approximation to exact execution.

Book a Consultation

Research Paper Analysis

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

Executive Impact: Key Findings for Your Enterprise

Deep Analysis & Enterprise Applications

MLLM vs. Human Performance on Task 2

Enterprise Process Flow

Spatial Reasoning Limitations in MLLMs

Case Study: U-shaped Target Assembly

Advanced ROI Calculator

Implementation Roadmap

Phase 1: Foundation & Data Integration

Phase 2: Model Adaptation & Fine-tuning

Phase 3: Advanced Spatial Reasoning Modules

Phase 4: Comprehensive Validation & Deployment

Unlock Precision in Your Enterprise AI

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai