AI RESEARCH ANALYSIS
Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos
Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in game-play videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we de-sign a meta-information-guided prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real-world physical reasoning performance of Qwen2.5-VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.
Authors: Meng Cao¹*, Haoran Tang², Haoze Zhao¹, Mingfei Han¹, Ruyang Liu², Qiang Sun¹³, Xiaojun Chang¹, Ian Reid¹, Xiaodan Liang¹⁵
Executive Impact: Key Findings for Enterprise AI
This research presents a novel, scalable approach to physical world understanding for AI models, leveraging 'glitches' in gameplay videos. This method significantly enhances multimodal AI's ability to reason about physics, bridging the gap between simulated environments and real-world applications. Enterprises can harness this for more robust vision-language models in domains requiring advanced physical comprehension.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
What is PhysGame?
PhysGame is an instruction-tuning dataset constructed from gameplay videos with glitch-induced question-answer pairs. It contains 140,057 QA pairs across five physical domains and sixteen fine-grained categories, leveraging visual anomalies that violate predefined physical laws as a rich and scalable supervision source for physical world understanding. This approach overcomes the limitations of high annotation costs in real-world videos and limited realism in synthetic simulations.
Taxonomy of Physical Domains
PhysGame categorizes physical understanding into five primary domains and sixteen fine-grained categories, enabling comprehensive coverage of physical phenomena:
- Mechanics: Deals with forces, torques, and motion effects (gravity, elasticity, friction, velocity, acceleration).
- Optics: Focuses on light behavior and interaction with matter (reflection, refraction, absorption & transmission).
- Material Properties: Refers to inherent material characteristics (color, rigidity, object shape, human body gesture).
- Thermodynamics: Involves heat, temperature, and energy transfer (heat conduction, phase change).
- Electromagnetism: Encompasses phenomena related to electric and magnetic fields and their interactions.
Video Collection and Quality Control
Videos are curated from the GamePhysics subreddit, a community sharing gameplay videos with unusual events and glitches. A meta-information-guided prompting strategy, utilizing video titles and descriptions, is used to ensure high-quality QA generation. Manual verification on a sample of 2,000 instances confirmed high accuracy rates.
| Dataset / Benchmark | Photo-realistic | Scalable to Collect | Instruct Tuning | Video-Based | Multi-modality | Bug/Glitch Focus |
|---|---|---|---|---|---|---|
| PhysBench (Taesiri & Bezemer, 2024) | No | No | No | Yes | Yes | No |
| MMVU (Taesiri et al., 2022a) | Yes | No | No | Yes | Yes | No |
| CLEVER (Taesiri et al., 2022b) | No | Yes | No | No | No | No |
| Comphy (Chen et al., 2024e) | No | Yes | No | Yes | Yes | No |
| GameBunny (Taesiri & Bezemer, 2024) | No | Yes | Yes | Yes | Yes | Yes |
| GameBug Descript (Taesiri et al., 2022b) | No | Yes | No | Yes | No | Yes |
| GlitchBench (Taesiri et al., 2024a) | No | Yes | Yes | Yes | Yes | Yes |
| PhysGame (Ours) | Yes | Yes | Yes | Yes | Yes | Yes |
Scalable Instruction-Following Generation
PhysGame employs a self-instruction paradigm using GPT-40 to generate diverse question-answer pairs. Questions are designed to be as varied as possible, encompassing three types: explicitly inquiring about glitches, probing anomalies, and straightforward content questions. This allows for comprehensive assessment of physical understanding.
Meta-Information Guided Prompting
To mitigate errors from naive prompting, a meta-information guided strategy is used. Video-associated metadata, such as titles and descriptions, is incorporated into the prompt. This meta-information significantly improves the accuracy of generated responses by guiding the MLLM to detect specific physical glitches and preventing hallucinated content.
Enterprise Process Flow: Instruction Generation
Impact of Meta-Information on QA Quality
Figure 4 in the paper illustrates the critical role of meta-information. Without the video title '[GTA IV] I hope they bring back the water physics', the model hallucinates unrealistic helicopter behavior. With the title hint, the model accurately identifies the glitch as 'No realistic wave propagation', demonstrating how metadata guides precise physical commonsense understanding. The quality validation shows 91% QA accuracy with meta-information versus 64% without it.
Introducing GameBench
Complementing PhysGame, GameBench is an expert-annotated evaluation benchmark comprising 880 glitch-identified gameplay videos. Each video is accompanied by expert-curated multiple-choice questions explicitly probing glitch characteristics, designed to evaluate physical reasoning capabilities within gaming contexts.
Robust Annotation and Quality Control
The dataset adheres to rigorous quality control, including duplicate checks and content filters to ensure only relevant gameplay glitches are included. Annotation involves a four-way multiple-choice format, where correct options describe video-specific glitches contravening physical commonsense. Distractor options are designed to be highly correlated with observed individuals or actions to prevent easy identification. All questions undergo LLM-assisted and human cross-inspection to ensure accuracy and prevent biases.
| Models | AVG (%) | Mechanics (%) | Optics (%) | Material (%) | Thermo (%) | Electro (%) |
|---|---|---|---|---|---|---|
| Claude3.5-Sonnet Anthropic (2024) | 54.3 | 53.7 | 49.6 | 54.5 | 71.4 | 55.3 |
| Gemini-1.5-pro Team et al. (2024) | 55.2 | 55.0 | 49.6 | 58.7 | 63.3 | 51.1 |
| GPT-40 OpenAI (2024) | 56.1 | 55.4 | 48.9 | 60.3 | 67.3 | 55.3 |
| Qwen2-VL-7B Wang et al. (2024a) | 37.5 | 35.5 | 34.6 | 37.0 | 34.7 | 34.0 |
| w/ PhysGame | 43.8 | 45.5 | 47.4 | 48.7 | 40.8 | 36.2 |
| Qwen2.5-VL-7B Bai et al. (2025) | 44.4 | 46.5 | 40.6 | 46.0 | 34.7 | 38.3 |
| w/ PhysGame | 48.1 | 51.5 | 42.1 | 48.1 | 34.7 | 44.7 |
| InternVL2.5-8B Chen et al. (2024c) | 38.6 | 35.9 | 33.1 | 43.4 | 59.2 | 40.4 |
| w/ PhysGame | 48.0 | 44.4 | 43.6 | 56.6 | 67.3 | 40.4 |
Bridging Simulation to Reality
The PhysGame dataset demonstrates strong Game2Real transferability, significantly improving performance on real-world physical understanding benchmarks like PhysBench and MMVU. This indicates that physics knowledge learned from simulated gameplay glitches generalizes effectively to real-world scenarios.
PhysBench evaluates MLLMs' understanding of physical properties, relationships, scene, and dynamics. MMVU assesses knowledge-intensive reasoning on specialized-domain videos. Fine-tuning models on PhysGame consistently improves performance across these benchmarks, confirming its effectiveness in enhancing physical understanding and generalization.
| Models | Average (%) | Property (%) | Relationships (%) | Scene (%) | Dynamics (%) | MMVU Val (%) |
|---|---|---|---|---|---|---|
| GPT-40 OpenAI (2024) | 49.5 | 56.9 | 64.8 | 30.2 | 47.0 | 67.4 |
| Gemini-1.5-pro Team et al. (2024) | 49.1 | 57.3 | 63.6 | 36.5 | 41.6 | 65.4 |
| Qwen2-VL-7B Wang et al. (2024a) | 46.6 | 54.3 | 60.4 | 39.4 | 47.0 | 42.1 |
| w/ PhysGame | 49.1 | 57.5 | 63.2 | 41.8 | 49.1 | 48.6 |
| Qwen2.5-VL-7B Bai et al. (2025) | 45.9 | 54.6 | 63.2 | 37.0 | 46.9 | 48.8 |
| w/ PhysGame | 47.6 | 57.7 | 64.1 | 38.3 | 48.2 | 50.5 |
| InternVL2.5-8B Chen et al. (2024c) | 43.9 | 55.9 | 48.7 | 29.4 | 41.2 | 41.1 |
| w/ PhysGame | 46.0 | 57.9 | 66.6 | 30.7 | 51.2 | 49.6 |
Generalizing Physical Knowledge
Beyond physical understanding, PhysGame also enhances general-purpose video understanding. The physical knowledge acquired from gameplay videos proves transferable to broader video understanding tasks, as evidenced by consistent performance boosts on MVBench, Video-MME, and LongVideoBench benchmarks.
MVBench focuses on multifaceted video understanding in short videos, while Video-MME and LongVideoBench emphasize long-form video comprehension. The improvements across these diverse benchmarks confirm that physics-oriented pretraining via PhysGame not only improves physical reasoning but also enhances the generalization capabilities of video-language models.
| Models | MVBench AVG (%) | Video-MME AVG (%) | Video-MME Short (%) | Video-MME Medium (%) | Video-MME Long (%) | LongVideoBench Val (%) |
|---|---|---|---|---|---|---|
| GPT-4V Achiam et al. (2023) | 43.5 | 59.9 | 70.5 | 55.8 | 53.5 | - |
| NVILA Liu et al. (2024e) | 68.1 | 64.2 | 75.7 | 62.2 | 54.8 | 57.7 |
| Qwen2-VL-7B Wang et al. (2024a) | 64.5 | 58.0 | 70.2 | 55.8 | 48.1 | 55.3 |
| w/ PhysGame | 66.4 | 59.3 | 70.7 | 57.6 | 49.7 | 56.1 |
| Qwen2.5-VL-7B Bai et al. (2025) | 66.0 | 64.7 | 76.2 | 63.3 | 54.6 | 56.0 |
| w/ PhysGame | 67.3 | 65.8 | 76.6 | 65.7 | 55.2 | 58.3 |
| InternVL2.5-8B Chen et al. (2024c) | 72.0 | 64.0 | 75.8 | 62.8 | 53.4 | 57.2 |
| w/ PhysGame | 72.8 | 64.2 | 76.8 | 62.6 | 53.3 | 59.5 |
Qualitative Improvements in General Video Understanding
Figure 6 in the paper showcases how PhysGame-enhanced models provide more physically consistent interpretations. In one example, the baseline model misattributes a landslide to 'dynamite' based on superficial cues, while the PhysGame-tuned model correctly identifies 'gravity' as the underlying cause. Another case highlights accurate inference of degrees of freedom in a linkage mechanism by the PhysGame model, contrasting with the baseline's incorrect output. This demonstrates PhysGame's ability to bridge visual perception and intuitive physics reasoning for both natural and synthetic scenes.
Monotonic Performance Gains with Data Scaling
Experiments with varying amounts of PhysGame data (0K, 40K, 80K, 140K) show a consistent improvement in performance across all real-world and general-video tasks (PhysBench, MMVU, MVBench, Video-MME, LongVideoBench). The gains are monotonic and do not plateau even at 140K samples, indicating that the dataset's scaling potential has not yet been saturated.
Optimizing Training Data Composition
Ablation studies reveal that a mixture of PhysGame data with general video instruction tuning data (LLaVA-Hound) is crucial for optimal performance and generalization. Pure PhysGame training, while effective, can induce a mild domain bias towards gameplay scenes, highlighting the importance of domain balance.
| Method | PhysBench (%) | MMVU (%) | MVBench (%) | Video-MME (%) | LongVideoBench (%) |
|---|---|---|---|---|---|
| Baseline | 45.9 | 48.8 | 66.0 | 64.7 | 56.0 |
| Full Mode (w/ meta-data & mixture) | 47.6 | 50.5 | 67.3 | 65.8 | 58.3 |
| w/o meta-data | 42.9 | 46.7 | 65.2 | 63.5 | 55.2 |
| w/o data mixture | 45.1 | 49.2 | 66.7 | 65.1 | 57.4 |
| Configuration of PG/LH | 0K/160K (%) | 40K/120K (%) | 80K/80K (%) | 140K/20K (%) | 140K/0K (%) |
|---|---|---|---|---|---|
| AVG on PhysBench | 45.9% | 46.9% | 47.3% | 47.6% | 45.1% |
Identifying Bottlenecks in Physical World Understanding
An error analysis of top-performing MLLMs (Gemini 1.5 Pro, GPT-40, Claude 3.5 Sonnet) on the GameBench benchmark categorizes errors into four types: Perception Error, Lack of Knowledge, Reasoning Error, and Refusal to Answer. This helps in understanding the current limitations of MLLMs in physical world understanding.
The dominant error type is 'Lack of Knowledge' (79-81% of total errors), indicating factual incompleteness or outdated knowledge as the primary bottleneck. Reasoning Errors contribute 7-12%, and Perception Errors are minor (4-7%), reflecting general maturity in visual grounding. Refusal to Answer is least frequent, suggesting that while perception and reasoning are improving, expanding the knowledge base remains critical for multimodal model reliability.
Error Distribution Across Leading MLLMs
Figure 7 in the paper presents pie charts showing the error distribution for Claude 3.5-Sonnet, GPT-40, and Gemini 1.5-Pro. All models consistently show 'Lack of Knowledge' as the most common error. For example, GPT-40 shows 79% 'Lack of Knowledge', 12% 'Reasoning Error', 5% 'Refusal to Answer', and 4% 'Perception Error'. This analysis underscores the need for more comprehensive physical knowledge integration in MLLMs to achieve human-level physical understanding.
Calculate Your Potential AI Impact
Estimate the efficiency gains and reclaimed operational hours for your enterprise by integrating advanced multimodal AI solutions, powered by research like PhysGame.
Your AI Implementation Roadmap
Our structured approach ensures a seamless integration of advanced AI, from initial assessment to ongoing optimization, maximizing your enterprise's benefit.
Phase 1: Discovery & Strategy
Comprehensive assessment of current workflows, identification of high-impact AI opportunities, and development of a tailored implementation strategy leveraging PhysGame's physical reasoning capabilities.
Phase 2: Pilot & Integration
Rapid prototyping and pilot deployment of AI solutions in a controlled environment. Seamless integration with existing systems, prioritizing minimal disruption and maximum learning.
Phase 3: Scaling & Optimization
Full-scale deployment across your enterprise. Continuous monitoring, performance optimization, and iterative improvements to ensure long-term value and adaptability to evolving business needs.
Phase 4: Advanced Capabilities & R&D
Explore cutting-edge applications, integration with multimodal vision-language models for complex scenarios, and continuous research and development to maintain your competitive edge.
Ready to Transform Your Enterprise with AI?
Leverage the power of advanced AI for physical world understanding and general video intelligence. Schedule a personalized consultation to explore how these breakthroughs can drive efficiency and innovation in your organization.