Skip to main content
Enterprise AI Analysis: TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

Research Paper Analysis

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

This paper introduces TangramPuzzle, a new benchmark for evaluating the compositional spatial reasoning abilities of Multimodal Large Language Models (MLLMs). Inspired by the classic Tangram game, it uses a symbolic geometric framework (TCE) for precise, machine-verifiable coordinate specifications. Two tasks are proposed: Outline Prediction and End-to-End Code Generation, assessing global shape inference and inverse geometric assembly. Evaluations reveal MLLMs struggle with geometric constraints, often prioritizing visual matching over accuracy, highlighting limitations in complex spatial reasoning.

Executive Impact: Key Findings for Your Enterprise

TangramPuzzle reveals critical gaps in MLLM spatial reasoning, impacting enterprise applications requiring precise geometric understanding. While MLLMs show strong visual recognition, their inability to maintain geometric integrity (e.g., non-overlap, rigid shape preservation) leads to unreliable outputs. This benchmark is vital for developing robust AI for design, robotics, and manufacturing, emphasizing the need for advanced compositional intelligence.

0 Task 1 Accuracy (Best MLLM)
0 Task 2 Success Rate (Best MLLM)
0 Human Average Accuracy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Motivation
Tangram Construction Expression (TCE)
Tasks & Evaluation
Key Findings

Multimodal Large Language Models (MLLMs) have made significant strides in visual recognition and semantic understanding. However, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often use simple tasks, semantic approximations, or coarse relative positioning, with limited and imprecise evaluation metrics. This paper addresses these limitations by introducing TangramPuzzle, a geometry-grounded benchmark for evaluating compositional spatial reasoning through the classic Tangram game.

The paper proposes the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications. This mitigates ambiguity from visual approximation and ensures geometric rigor. Each TCE instance includes components like instance_id, target_outline, initial_state, final_state, and adjacency_graph. Quantities are encoded as exact algebraic expressions in LATEX to prevent precision loss from floating-point representations. The final_state includes a transform_matrix for rigid motions (translation, rotation, reflection) of pieces.

TangramPuzzle features two complementary tasks: Outline Prediction (Task 1) requires models to infer global shapes from local components, selecting the correct silhouette from candidates. End-to-End Code Generation (Task 2) demands solving inverse geometric assembly problems, outputting a complete TCE in JSON format. Evaluation for Task 2 involves a hierarchical Constraint-based Evaluation Framework, checking for Syntax Error (TSE), Rigid Geometry Error (RGE), and Physical Error (PE). Shape similarity is measured by IoU and Hausdorff Distance. Human performance is also benchmarked to provide a realistic ceiling.

Experiments on various MLLMs (open-source and proprietary) reveal that models prioritize matching the target silhouette visually while often neglecting geometric constraints. This leads to distortions or deformations of pieces, or impermissible overlaps. Even top-tier models with high IoU scores often fail constraint validation (0% success in many cases for Task 2), highlighting a fundamental limitation in true compositional spatial reasoning. Gemini3-Pro stands out as an exception with robust geometric reasoning.

MLLM vs. Human Performance on Task 2

0 Best MLLM Success Rate (%) vs. Human Average Success Rate (%) 72.67

Enterprise Process Flow

Raw Data Collection & Filtering
Interactive Annotation
Geometric Normalization
Validation

Spatial Reasoning Limitations in MLLMs

Aspect MLLM Performance Ideal Performance
Geometric Constraints
  • Frequently violated (e.g., RGE, PE)
  • Prioritize visual fit over geometric rules
  • Strictly adhered
  • Maintains physical feasibility
Compositional Reasoning
  • Struggles with inverse assembly
  • Relies on textual cues (crutches)
  • Systematic decomposition
  • Direct visual grounding
Solution Validity
  • Low Validation Pass Rate (VPR)
  • Many solutions geometrically invalid
  • High VPR
  • Geometrically valid and executable solutions

Case Study: U-shaped Target Assembly

A visualization for a U-shaped target (Task 2) showed Gemini3-Pro achieving a perfect assembly, demonstrating its ability to effectively ground coordinate constraints. In contrast, Claude-Sonnet-4.5 attempted non-rigid elongations, InternVL3-78B generated severe overlaps, and GPT-5.2 used an incorrect piece inventory (e.g., hallucinating an extra square). This highlights the common failure mode where MLLMs prioritize visual silhouette matching at the expense of geometric constraints.

Advanced ROI Calculator

Estimate the potential return on investment for integrating advanced AI capabilities into your enterprise workflows.

Potential Annual Savings $0
Annual Hours Reclaimed 0
Calculate Your Potential ROI

Implementation Roadmap

A strategic outline for integrating advanced AI capabilities, from foundational setup to validated deployment.

Phase 1: Foundation & Data Integration

Integrate TangramPuzzle benchmark data and TCE framework. Establish baseline MLLM performance for both Outline Prediction and End-to-End Code Generation tasks.

Phase 2: Model Adaptation & Fine-tuning

Develop and fine-tune MLLM architectures to enhance geometric reasoning capabilities. Focus on constraint satisfaction (RGE, PE) and precise coordinate generation.

Phase 3: Advanced Spatial Reasoning Modules

Implement specialized modules for rigid body transformations, non-overlap checking, and topological validity. Explore symbolic reasoning integration to improve geometric fidelity.

Phase 4: Comprehensive Validation & Deployment

Rigorously validate model outputs against TCE constraints and human performance. Prepare for deployment in real-world applications requiring high-precision spatial intelligence.

Unlock Precision in Your Enterprise AI

Discuss how compositional spatial reasoning can transform your AI applications from approximation to exact execution.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking