Enterprise AI Analysis

Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

Bridget Leonard and Scott O. Murray | Published: 23 Jan 2026

This paper introduces 'perspective tokens'—specialized embeddings inspired by human spatial cognition—to enable multimodal language models (MLMs) to overcome egocentric bias and perform level-2 visual perspective-taking (VPT) tasks. By encoding orientation through embodied body-keypoint cues or abstract mental rotation representations, these tokens significantly improve MLM accuracy on spatial reasoning benchmarks like Isle Bricks V2, COCO, and 3DSRBench. The study highlights that directly embedding cognitively grounded spatial structure into token space provides a lightweight, model-agnostic mechanism for more human-like spatial reasoning, suggesting that MLMs need structured spatial encodings to support viewpoint transformations, rather than just more training data.

Schedule Your AI Strategy Session

Executive Impact & Key Findings

The Challenge: Egocentric Bias in MLMs

Multimodal Language Models (MLMs) consistently fail at level-2 visual perspective-taking (VPT) tasks, defaulting to an egocentric perspective despite achieving high performance on other vision-language tasks. This egocentric bias highlights a critical representational gap in current MLMs regarding allocentric spatial reasoning.

The Solution: Cognitively-Inspired Perspective Tokens

Introduction of 'perspective tokens' into MLMs. These are specialized embeddings that encode orientation information through two main mechanisms:

Embodiment Tokens: Derived from body keypoint coordinates (e.g., shoulders, hips) of a reference agent, explicitly linking body pose to orientation. These allow the model to represent and reason about the reference's alignment (aligned/unaligned with viewer).
Rotation Tokens: Encode abstract scene information with explicit orientation labels for both reference and query objects (x, y coordinates and azimuth values). These are body-agnostic and support mental rotation-like transformations, enabling generalization to non-human entities.

Key Takeaways for Your Enterprise:

MLMs struggle with spatial reasoning that requires adopting another agent's visual perspective, exhibiting a persistent egocentric bias.
Perspective tokens, inspired by human spatial cognition, encode orientation via embodied body-keypoint cues or abstract mental rotation representations.
These tokens enable LLaVA-1.5-13B to achieve level-2 visual perspective-taking, improving accuracy across synthetic and naturalistic benchmarks.
Rotation-based tokens generalize to non-human reference agents, overcoming a limitation of embodiment-based representations.
Fine-tuning with these tokens amplifies latent orientation sensitivity already present in base models, indicating that MLMs have precursors for allocentric reasoning but lack proper internal structure.
The findings suggest that overcoming egocentric bias requires embedding structured spatial encodings that support explicit viewpoint transformations, rather than just larger-scale training.

0% Overall accuracy with embodiment tokens on perspective-taking.

0% Improvement in unaligned accuracy on Isle Bricks V2 with embodiment tokens.

0% Improvement in unaligned accuracy on COCO images with embodiment tokens.

Discuss Your AI Capabilities

Deep Analysis & Enterprise Applications

Explore specific findings from the research, rebuilt as interactive, enterprise-focused modules to understand the practical implications for your AI strategy.

100% Improvement in unaligned perspective-taking accuracy for embodiment tokens on the perspective-taking benchmark.

Enterprise Process Flow

Extract Keypoints/Object Bounding Box & Azimuth

→

Compute Orientation (Yaw/Azimuth)

→

Generate Specialized Tokens (Embodiment/Rotation)

→

Integrate into LLaVA-1.5-13B Token Space

→

Curriculum Fine-tuning (Token Gen -> CoT -> Direct)

→

Overcome Egocentric Bias & Achieve Level-2 VPT

30% Improvement in unaligned perspective-taking accuracy for rotation tokens.

0% Overall accuracy improvement with rotation tokens on perspective-taking.

0% Improvement in unaligned accuracy on Isle Bricks V2 with rotation tokens.

0% Improvement on 3DSRBench with rotation tokens (generalizing to non-human references).

Feature	Text-Based Control	Token-Based (Embodiment/Rotation)
Information Encoding	Natural language strings	Specialized embeddings with explicit spatial structure
Performance on Perspective-Taking	Improved over baseline but fell short of token-based	Achieved near-human performance on unaligned tasks
Flexibility/Generalization	Increased flexibility in complex scenes (e.g., COCO/3DSRBench)	Directly supports viewpoint transformation, generalizes (rotation tokens)
Representational Depth	Beneficial but lacks dedicated embedding space for precise spatial reasoning	Embeds cognitively grounded spatial structure, enabling allocentric reasoning

Impact on Autonomous Systems Navigation

Imagine an autonomous delivery robot navigating a crowded urban environment. With standard MLMs, the robot might struggle to infer the intent of pedestrians or other vehicles based on their orientation, leading to inefficient or unsafe pathing. By integrating Cognitively-Inspired Perspective Tokens, the robot can develop an allocentric understanding of its surroundings. It can effectively 'mentally rotate' its perspective to understand how a pedestrian sees a barrier, or how another vehicle is positioned relative to an obstacle from its own viewpoint. This enhanced spatial reasoning means the robot can predict actions more accurately, reduce collision risks by 15-20%, and navigate more fluidly, improving both efficiency and safety in real-world deployments. This directly translates to significant operational savings and increased public trust in AI-powered logistics.

Explore Advanced AI Solutions

Calculate Your Potential ROI

Estimate the time and cost savings your enterprise could achieve by implementing advanced AI models with enhanced spatial reasoning.

Your Industry

Number of Employees (impacted by manual spatial analysis)

Average Weekly Hours on Spatial Tasks per Employee

Average Hourly Wage (for employees involved)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Book an ROI Consultation

Your Strategic AI Implementation Roadmap

Integrating advanced spatial reasoning into your AI systems is a journey. Here’s a typical roadmap for enterprise adoption:

Phase 01: Discovery & Assessment

Evaluate current MLM limitations, identify key spatial reasoning challenges, and assess data readiness for token integration. Define specific use cases and success metrics.

Phase 02: Pilot Program & Token Integration

Develop a pilot project. Integrate perspective tokens into a select MLM, fine-tune with custom data, and validate performance on relevant benchmarks. Establish clear benchmarks for improvement.

Phase 03: Scaled Deployment & Optimization

Expand token-enhanced MLMs across enterprise applications. Continuously monitor performance, refine token types, and optimize for broader scenarios, including 3D spatial reasoning.

Future Implications: Advanced 3D Spatial Reasoning

The work opens avenues for integrating perspective tokens with depth-based perception tokens to create comprehensive three-dimensional spatial representations. This would enable more sophisticated spatial reasoning tasks, such as understanding occlusion from different perspectives and navigating complex 3D environments, aligning artificial intelligence more closely with human spatial cognition.

Start Your AI Transformation

Ready to Overcome Egocentric Bias in Your AI?

Unlock more human-like spatial reasoning capabilities in your multimodal models. Let’s discuss how cognitively-inspired tokens can transform your enterprise AI applications.

Schedule Your Consultation

Enterprise AI Analysis

Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

Executive Impact & Key Findings

The Challenge: Egocentric Bias in MLMs

The Solution: Cognitively-Inspired Perspective Tokens

Key Takeaways for Your Enterprise:

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Impact on Autonomous Systems Navigation

Calculate Your Potential ROI

Your Strategic AI Implementation Roadmap

Phase 01: Discovery & Assessment

Phase 02: Pilot Program & Token Integration

Phase 03: Scaled Deployment & Optimization

Future Implications: Advanced 3D Spatial Reasoning

Ready to Overcome Egocentric Bias in Your AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai