Skip to main content
Enterprise AI Analysis: DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

Enterprise AI Analysis

DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

This paper introduces DRMOT, a novel task for RGBD Referring Multi-Object Tracking, overcoming the limitations of 2D RGB-only methods in handling complex spatial semantics and occlusion by integrating RGB, Depth, and Language modalities. It presents DRSet, a new dataset tailored for 3D-aware tracking, and DRTrack, an MLLM-guided framework demonstrating state-of-the-art performance.

Authors: Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao

Affiliation: Huazhong University of Science and Technology, South-Central Minzu University

Publication Date: February 5, 2026

Executive Impact: Key Findings for Enterprise AI

This research fundamentally enhances AI's ability to understand and track objects in complex real-world environments, a critical advancement for robotics, autonomous systems, and interactive AI. By integrating 3D depth information with visual and linguistic cues, it provides unprecedented precision and robustness, paving the way for more reliable and intelligent enterprise applications.

1st RGBD Referring Multi-Object Tracking Task
187 Real-world Scenes in DRSet
56 Depth-Related Language Descriptions
~120% HOTA Score Increase (over MLLM baseline)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Critical Need for 3D Spatial Understanding in RMOT

Traditional Referring Multi-Object Tracking (RMOT) systems, relying solely on 2D RGB data, face significant challenges in enterprise applications such as autonomous navigation and robotics. These limitations become apparent when dealing with complex spatial semantics in language descriptions (e.g., 'the person closest to the camera') or severe object occlusion. Without explicit 3D spatial information, models struggle to accurately ground targets and maintain identity consistency, leading to unreliable performance in real-world dynamic environments. This gap necessitates a paradigm shift towards integrating depth information for robust AI perception.

Introducing RGBD Referring Multi-Object Tracking (DRMOT)

The proposed DRMOT task addresses these limitations by explicitly requiring models to fuse RGB, Depth (D), and Language (L) modalities for 3D-aware tracking. To facilitate research, the authors constructed DRSet, a tailored RGBD referring multi-object tracking dataset. DRSet includes 187 diverse scenes with paired RGB images and depth maps, along with 240 language descriptions. Crucially, 56 of these descriptions incorporate explicit depth-related information, making DRSet a comprehensive resource for evaluating models' spatial-semantic grounding and tracking capabilities under real-world conditions. This enables the development of models that can truly understand and utilize 3D spatial cues from natural language.

DRTrack: An MLLM-Guided Depth-Referring Tracking Framework

To tackle the DRMOT task, the paper introduces DRTrack, a framework designed for robust grounding and stable cross-frame association. DRTrack leverages a Multimodal Large Language Model (MLLM), specifically Qwen2.5-VL-3B-Instruct, for its 'Depth-Promoted Language Grounding' stage. This MLLM takes RGB images, depth maps, and language descriptions as joint inputs, allowing it to perform depth-aware target grounding and eliminate 2D ambiguities. Subsequently, the 'Depth-Enhanced OC-SORT Association' stage uses the MLLM's output bounding boxes and depth-weighted IoU constraints for precise and stable trajectory association, effectively resolving identity confusion, especially under occlusion. The framework also employs Geometric-Aware GRPO Fine-Tuning to align the MLLM's output policy with 3D geometric constraints.

Superior Performance through RGBD-L Fusion

Extensive experiments on the DRSet dataset demonstrate DRTrack's significant superiority, achieving a HOTA score of 33.24%. This represents a substantial improvement over the strongest zero-shot MLLM baseline (HOTA: 15.13%), showcasing a nearly 120% increase in performance. The ablation studies confirm the critical contribution of the depth modality, boosting HOTA from 15.13% to 32.68% alone, and the effectiveness of Geometric-Aware GRPO fine-tuning, further refining the score to 33.24%. DRTrack also attains the highest scores across other metrics like DetA (32.35%) and AssA (34.97%), validating its robust spatial-semantic grounding and association capabilities for RGBD Referring Multi-Object Tracking.

2D RGB-only RMOT vs. DRMOT (RGBD+L)

Feature 2D RGB-only RMOT DRMOT (RGBD+L)
Input Modalities RGB, Language RGB, Depth, Language
Spatial Reasoning Limited, struggles with depth-dependent descriptions Enhanced, leverages depth for 3D relationships
Occlusion Robustness Poor, unstable identities Improved, uses depth cues for ID consistency
Target Grounding Ambiguous for spatial semantics Accurate for complex spatial semantics
56 Depth-Related Language Descriptions

This unique feature of the DRSet dataset enables rigorous evaluation of models' ability to understand and utilize 3D spatial cues embedded in natural language, a critical advancement for robust referring multi-object tracking.

DRTrack Framework Pipeline

RGB, Depth, Language Input
Multimodal Large Language Model (MLLM)
Geometric-Aware GRPO Fine-Tuning
Precise Bounding Box Output
Depth-Enhanced OC-SORT Association
Robust Trajectories
33.24% HOTA Score Achieved by DRTrack

DRTrack demonstrates state-of-the-art performance, significantly outperforming previous RGB-only methods and baseline MLLMs on the DRSet dataset, validating the power of fusing RGB, Depth, and Language modalities.

Quantify Your Enterprise AI Advantage

Use our interactive calculator to estimate the potential annual savings and reclaimed employee hours by implementing advanced AI solutions, leveraging insights from this cutting-edge research.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

Implementing cutting-edge AI like DRMOT requires a strategic approach. Our roadmap guides enterprises through the essential phases to ensure successful integration and maximum impact.

Phase 1: Discovery & Strategy

Assess current infrastructure, identify key use cases for RGBD Referring Multi-Object Tracking, and define clear business objectives. Develop a tailored strategy aligning with organizational goals and technical capabilities.

Phase 2: Data Integration & Model Adaptation

Integrate existing RGB and depth data streams. Adapt DRTrack or similar MLLM-guided frameworks using transfer learning and fine-tuning on proprietary datasets to optimize for specific enterprise environments.

Phase 3: Pilot Deployment & Evaluation

Deploy the AI solution in a controlled pilot environment. Rigorously evaluate performance against defined KPIs, focusing on accuracy, robustness, and real-time capabilities. Gather feedback for iterative refinement.

Phase 4: Full-Scale Integration & Monitoring

Seamlessly integrate the refined AI system into existing operational workflows. Establish continuous monitoring and maintenance protocols to ensure ongoing performance, scalability, and security.

Ready to Transform Your Enterprise with Advanced AI?

The insights from DRMOT highlight the immense potential of 3D-aware AI in complex tracking scenarios. Let's discuss how these innovations can be leveraged to create intelligent, efficient, and robust solutions for your business challenges.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking