M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Revolutionizing Robotic Action Segmentation with Multimodal AI

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Schedule Your Robotics AI Strategy Session

Executive Impact: Key Performance & Strategic Advantages

Highly Positive: M2R2 represents a significant leap forward in robotic temporal action segmentation, offering robust and generalizable performance across diverse tasks and embodiments. The modular architecture facilitates easier integration with advanced TAS models and ensures learned features are reusable, accelerating AI adoption in complex manipulation tasks.

0 F1@50 Score (REASSEMBLE)

0 F1@50 Improvement (Im)PerfectPour)

0 F1@50 Score (JIGSAWS)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Challenges

M2R2 Architecture

Training Strategy

Experimental Results

Ablation Studies Insights

Temporal Action Segmentation (TAS) is critical in robotics and computer vision for breaking down complex tasks. Robotics traditionally uses proprioceptive data, while computer vision uses exteroceptive sensors. Current multimodal approaches in robotics are often end-to-end, limiting feature reusability, and vision-only methods struggle with occlusion. M2R2 addresses these by offering a modular, multimodal feature extractor.

M2R2 is a multimodal deep learning feature extractor that fuses proprioceptive (force, torque, joint positions, end-effector pose, gripper width) and exteroceptive (RGB cameras, audio) data. It employs a late fusion strategy with independent processing per modality, followed by a Transformer-based model for fusion. This modularity allows reuse with various state-of-the-art TAS models.

The training strategy for M2R2 focuses on learning temporal dependencies and accurate boundary detection. It uses a multi-head self-attention transformer encoder (Fusion Transformer) and a Boundary Regression Network. Training objectives include maximizing similarity between window representations and textual embeddings (Laction) and minimizing MSE for boundary predictions (Lboundary).

M2R2 achieves state-of-the-art performance on REASSEMBLE, (Im)PerfectPour, and JIGSAWS datasets. It outperforms vision-only and proprioception-only baselines, demonstrating strong generalization across task domains and embodiments. Ablation studies show that combining all modalities yields the highest performance, with vision alone performing the worst.

Ablation studies reveal the significant impact of different modalities. Vision-only features struggle with object visibility. Audio improves boundary detection, especially when distinct sounds are present. Proprioceptive data alone performs comparably to all modalities on REASSEMBLE, highlighting the dataset's specificity. Gripper information is particularly crucial for object identification based on size.

82.4% State-of-the-art F1@50 on REASSEMBLE Dataset with M2R2+ASRF

M2R2 Feature Extraction Process

Raw Sensory Inputs (Vision, Audio, Proprioception)

→

Modality-Specific Encoders (ActionCLIP, AST, Linear Projection)

→

Temporal Fusion Transformer

→

Multimodal M2R2 Feature

M2R2 vs. Baselines on REASSEMBLE (F1@50)
Method	F1@50 (Fine-grain)	Key Advantage
BRP+ASRF (Vision-only)	6.4%	Struggles with object visibility
BOCPD (FT)	12.8%	Heuristic, unsupervised Prone to over-segmentation
AWE (P)	35.8%	Proprioception-only Relies on position trajectories
M2R2+ASRF (All Modalities)	82.4%	Proposed multimodal method State-of-the-art, robust performance

Impact of Multimodal Fusion in Robotic Assembly

Challenge: Distinguishing visually similar small objects and accurately segmenting contact-rich manipulation tasks are difficult with unimodal data.

Solution: M2R2 leverages vision, audio, and proprioceptive data, processing each independently before fusing them with a Transformer. This allows for rich contextual understanding, differentiating objects by size, contact sounds, and force profiles.

Result: Improved F1@50 scores by 46.6 percentage points over unsupervised robotic TAS models on the REASSEMBLE dataset, and strong generalization to new task domains like (Im)PerfectPour.

Quantify Your Robotic TAS ROI

Estimate the efficiency gains and cost savings for your enterprise by implementing M2R2's multimodal temporal action segmentation.

Your Industry Sector

Number of Employees in Robotic Operations

Average Hours/Week on Manual Assembly/Segmentation

Average Hourly Wage (incl. overhead)

Annual Savings Potential

Annual Hours Reclaimed

Discuss Your ROI with an Expert

Your M2R2 Implementation Roadmap

A clear path to leveraging multimodal AI for enhanced robotic task segmentation and efficiency.

Phase 1: Data Collection & M2R2 Feature Extraction Setup

Configure existing sensors (cameras, microphones, robot proprioception) and integrate M2R2 feature extractor. Initial data collection and preprocessing for target tasks.

Phase 2: M2R2 Training & Fine-tuning

Train M2R2 feature extractor on custom datasets (or fine-tune on general datasets). Optimize for specific task domains and robotic embodiments.

Phase 3: Integration with TAS Models & Deployment

Integrate M2R2 features with state-of-the-art TAS models (e.g., ASRF, DiffAct). Validate performance in real-world robotic environments and deploy for automated task segmentation.

Ready to Transform Your Operations?

Unlock the full potential of your robotic systems with M2R2's advanced multimodal action segmentation. Our experts are ready to guide you.

Schedule Your Robotics AI Strategy Session

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Revolutionizing Robotic Action Segmentation with Multimodal AI

Executive Impact: Key Performance & Strategic Advantages

Deep Analysis & Enterprise Applications

M2R2 Feature Extraction Process

M2R2 vs. Baselines on REASSEMBLE (F1@50)

Impact of Multimodal Fusion in Robotic Assembly

Quantify Your Robotic TAS ROI

Your M2R2 Implementation Roadmap

Phase 1: Data Collection & M2R2 Feature Extraction Setup

Phase 2: M2R2 Training & Fine-tuning

Phase 3: Integration with TAS Models & Deployment

Ready to Transform Your Operations?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai