Foundation Model for Multimodal AI Agents

Magma: Bridging Verbal and Spatial Intelligence for Autonomous AI

Magma introduces a groundbreaking foundation model capable of interpreting and grounding multimodal inputs within its environment. It uniquely formulates plans and executes actions to achieve complex goals across both digital and physical worlds. By leveraging Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning, Magma achieves state-of-the-art performance in UI navigation and robotic manipulation, demonstrating a significant leap in generalist AI agent capabilities.

Schedule Your Strategy Session

By Jianwei Yang et al. | arXiv:2502.13130v1 [cs.CV] 18 Feb 2025

Executive Impact: Unlocking New AI Agent Capabilities

Magma represents a pivotal advancement in AI, offering unprecedented capabilities for autonomous agents across diverse environments. Its unique approach to integrating verbal and spatial intelligence unlocks new levels of efficiency and operational excellence for enterprise applications.

New SOTA UI Nav & Robotics Performance

First True Foundation AI Agent

+19.6% Robotics Success Rate Gain (vs. OpenVLA)

Discover Magma's Enterprise Potential

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Unlocking Autonomous Action Across Digital and Physical Realms

Magma is designed as a generalist multimodal AI agent, proficient in both multimodal understanding and action-taking. It seamlessly navigates complex tasks in digital interfaces, like UI navigation, and performs precise manipulations in physical environments, such as robotic arm control. This dual capability is a hallmark of true agentic intelligence, enabling adaptive responses to diverse real-world scenarios.

Magma in Action: Bridging Digital and Physical Environments

In UI Navigation, Magma achieves new state-of-the-art results on benchmarks like Mind2Web and AITW. Its ability to accurately ground actions in UI screenshots, understanding clickable elements and desired outcomes, significantly outperforms previous vision-based and domain-specific models. For instance, in mobile UI tasks, Magma can effectively complete complex workflows from a home screen to installing an app, demonstrating robust zero-shot transferability.

For Robotic Manipulation, Magma excels in 3D spatial intelligence, achieving new SOTA on tasks from the Bridge and LIBERO benchmarks. When deployed on a WidowX robot for tasks like 'Put Object in Drawer' or 'Pick up Mushroom to Pot,' Magma shows superior success rates, often nearly doubling those of leading models like OpenVLA. This performance is attributed to its precise spatial understanding and grounding, allowing for sophisticated object manipulation even in tasks requiring fine motor skills.

Discuss Your Implementation

Magma's Agentic Process Flow

Perceive Multimodal Input

→

Understand Goal & Context

→

Formulate Action Plan

→

Ground Actions Spatially

→

Execute Actions in Environment

→

Achieve Task

Set-of-Mark & Trace-of-Mark: The Foundation of Spatial-Temporal Intelligence

Magma's core innovation lies in its novel pretraining tasks: Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning. These techniques transform diverse datasets—from static images to dynamic videos—into 'vision-language-action' data, effectively bridging the semantic gap between verbal intelligence and spatial-temporal actions.

SoM enables the model to identify actionable visual objects by overlaying numerical marks on images (e.g., clickable buttons, robot arms), significantly easing the action grounding process. ToM extends this by predicting future object movements and trajectories in videos, forcing the model to learn longer temporal horizons and action-related dynamics, while efficiently leveraging vast amounts of unlabeled video data.

Critical Synergy SoM & ToM Enhance Spatial-Temporal Intelligence

Magma's Action Grounding & Planning Workflow

Visual Observations (Image/Video)

→

Identify Actionable Objects (SoM)

→

Predict Future Trajectories (ToM)

→

Generate Action Sequence

→

Refine Spatial-Temporal Understanding

State-of-the-Art Performance Across Diverse Tasks

Magma consistently sets new state-of-the-art benchmarks across a wide range of AI agentic tasks, demonstrating its superior generalization capabilities. From complex UI navigation to intricate robotic manipulation and general multimodal understanding (VQA, VideoQA), Magma's unified architecture outperforms models specifically tailored for individual domains.

Model	Key Feature	UI Navigation (Avg SR)	Robotics Manipulation (Avg SR)	Multimodal Understanding (Avg Score)
Magma-8B (Ours)	Unified VL + SoM/ToM	71.8% (Mind2Web)	52.3% (SimplerEnv)	80.0% (VQAv2)
OpenVLA [54]	Robotics-focused VLA	Not Supported	31.7% (SimplerEnv)	N/A
GPT-4V-OmniParser [83]	Vision-based UI Agent	77.3% (ScreenSpot Web)	Not Supported	N/A
LLaVA-1.5 [71]	General-domain LMM	Not Supported	Not Supported	78.5% (VQAv2)

28% Gain Video QA Performance (vs. Leading Models)

Calculate Your Potential AI ROI

Estimate the potential cost savings and efficiency gains your enterprise could realize by implementing advanced AI agent solutions like Magma.

Your Industry

Number of Employees Performing Repetitive Tasks

Average Hours Spent Per Week on Repetitive Tasks (Per Employee)

Average Hourly Fully-Burdened Cost Per Employee ($)

Estimated Annual Savings

Annual Hours Reclaimed

Quantify Your AI Advantage

Your AI Agent Implementation Roadmap

A phased approach ensures successful integration and maximum impact. Our experts guide you through each step, from initial assessment to full-scale deployment and ongoing optimization.

Phase 01: Strategic Assessment & Planning

Evaluate existing workflows, identify high-impact automation opportunities, and define clear objectives and success metrics for AI agent deployment.

Phase 02: Pilot Program & Customization

Implement a pilot program with Magma, customizing it to your specific data and environment. Validate performance and refine configurations based on real-world feedback.

Phase 03: Full-Scale Deployment & Integration

Seamlessly integrate Magma into your enterprise systems. Scale operations, ensure robust performance, and provide training for your teams.

Phase 04: Continuous Optimization & Support

Monitor agent performance, identify areas for further improvement, and receive ongoing support and updates to ensure sustained value and adaptability.

Map Your AI Journey

Ready to Transform Your Enterprise with AI Agents?

Connect with our AI specialists to explore how Magma can empower your operations, enhance efficiency, and drive innovation across your digital and physical workflows.

Book Your Consultation Now

Foundation Model for Multimodal AI Agents

Magma: Bridging Verbal and Spatial Intelligence for Autonomous AI

Executive Impact: Unlocking New AI Agent Capabilities

Deep Analysis & Enterprise Applications

Unlocking Autonomous Action Across Digital and Physical Realms

Magma in Action: Bridging Digital and Physical Environments

Magma's Agentic Process Flow

Set-of-Mark & Trace-of-Mark: The Foundation of Spatial-Temporal Intelligence

Magma's Action Grounding & Planning Workflow

State-of-the-Art Performance Across Diverse Tasks

Calculate Your Potential AI ROI

Your AI Agent Implementation Roadmap

Phase 01: Strategic Assessment & Planning

Phase 02: Pilot Program & Customization

Phase 03: Full-Scale Deployment & Integration

Phase 04: Continuous Optimization & Support

Ready to Transform Your Enterprise with AI Agents?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai