Enterprise AI Analysis
DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking
This paper introduces DRMOT, a novel task for RGBD Referring Multi-Object Tracking, overcoming the limitations of 2D RGB-only methods in handling complex spatial semantics and occlusion by integrating RGB, Depth, and Language modalities. It presents DRSet, a new dataset tailored for 3D-aware tracking, and DRTrack, an MLLM-guided framework demonstrating state-of-the-art performance.
Authors: Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao
Affiliation: Huazhong University of Science and Technology, South-Central Minzu University
Publication Date: February 5, 2026
Executive Impact: Key Findings for Enterprise AI
This research fundamentally enhances AI's ability to understand and track objects in complex real-world environments, a critical advancement for robotics, autonomous systems, and interactive AI. By integrating 3D depth information with visual and linguistic cues, it provides unprecedented precision and robustness, paving the way for more reliable and intelligent enterprise applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Critical Need for 3D Spatial Understanding in RMOT
Traditional Referring Multi-Object Tracking (RMOT) systems, relying solely on 2D RGB data, face significant challenges in enterprise applications such as autonomous navigation and robotics. These limitations become apparent when dealing with complex spatial semantics in language descriptions (e.g., 'the person closest to the camera') or severe object occlusion. Without explicit 3D spatial information, models struggle to accurately ground targets and maintain identity consistency, leading to unreliable performance in real-world dynamic environments. This gap necessitates a paradigm shift towards integrating depth information for robust AI perception.
Introducing RGBD Referring Multi-Object Tracking (DRMOT)
The proposed DRMOT task addresses these limitations by explicitly requiring models to fuse RGB, Depth (D), and Language (L) modalities for 3D-aware tracking. To facilitate research, the authors constructed DRSet, a tailored RGBD referring multi-object tracking dataset. DRSet includes 187 diverse scenes with paired RGB images and depth maps, along with 240 language descriptions. Crucially, 56 of these descriptions incorporate explicit depth-related information, making DRSet a comprehensive resource for evaluating models' spatial-semantic grounding and tracking capabilities under real-world conditions. This enables the development of models that can truly understand and utilize 3D spatial cues from natural language.
DRTrack: An MLLM-Guided Depth-Referring Tracking Framework
To tackle the DRMOT task, the paper introduces DRTrack, a framework designed for robust grounding and stable cross-frame association. DRTrack leverages a Multimodal Large Language Model (MLLM), specifically Qwen2.5-VL-3B-Instruct, for its 'Depth-Promoted Language Grounding' stage. This MLLM takes RGB images, depth maps, and language descriptions as joint inputs, allowing it to perform depth-aware target grounding and eliminate 2D ambiguities. Subsequently, the 'Depth-Enhanced OC-SORT Association' stage uses the MLLM's output bounding boxes and depth-weighted IoU constraints for precise and stable trajectory association, effectively resolving identity confusion, especially under occlusion. The framework also employs Geometric-Aware GRPO Fine-Tuning to align the MLLM's output policy with 3D geometric constraints.
Superior Performance through RGBD-L Fusion
Extensive experiments on the DRSet dataset demonstrate DRTrack's significant superiority, achieving a HOTA score of 33.24%. This represents a substantial improvement over the strongest zero-shot MLLM baseline (HOTA: 15.13%), showcasing a nearly 120% increase in performance. The ablation studies confirm the critical contribution of the depth modality, boosting HOTA from 15.13% to 32.68% alone, and the effectiveness of Geometric-Aware GRPO fine-tuning, further refining the score to 33.24%. DRTrack also attains the highest scores across other metrics like DetA (32.35%) and AssA (34.97%), validating its robust spatial-semantic grounding and association capabilities for RGBD Referring Multi-Object Tracking.
| Feature | 2D RGB-only RMOT | DRMOT (RGBD+L) |
|---|---|---|
| Input Modalities | RGB, Language | RGB, Depth, Language |
| Spatial Reasoning | Limited, struggles with depth-dependent descriptions | Enhanced, leverages depth for 3D relationships |
| Occlusion Robustness | Poor, unstable identities | Improved, uses depth cues for ID consistency |
| Target Grounding | Ambiguous for spatial semantics | Accurate for complex spatial semantics |
This unique feature of the DRSet dataset enables rigorous evaluation of models' ability to understand and utilize 3D spatial cues embedded in natural language, a critical advancement for robust referring multi-object tracking.
DRTrack Framework Pipeline
DRTrack demonstrates state-of-the-art performance, significantly outperforming previous RGB-only methods and baseline MLLMs on the DRSet dataset, validating the power of fusing RGB, Depth, and Language modalities.
Quantify Your Enterprise AI Advantage
Use our interactive calculator to estimate the potential annual savings and reclaimed employee hours by implementing advanced AI solutions, leveraging insights from this cutting-edge research.
Your AI Implementation Roadmap
Implementing cutting-edge AI like DRMOT requires a strategic approach. Our roadmap guides enterprises through the essential phases to ensure successful integration and maximum impact.
Phase 1: Discovery & Strategy
Assess current infrastructure, identify key use cases for RGBD Referring Multi-Object Tracking, and define clear business objectives. Develop a tailored strategy aligning with organizational goals and technical capabilities.
Phase 2: Data Integration & Model Adaptation
Integrate existing RGB and depth data streams. Adapt DRTrack or similar MLLM-guided frameworks using transfer learning and fine-tuning on proprietary datasets to optimize for specific enterprise environments.
Phase 3: Pilot Deployment & Evaluation
Deploy the AI solution in a controlled pilot environment. Rigorously evaluate performance against defined KPIs, focusing on accuracy, robustness, and real-time capabilities. Gather feedback for iterative refinement.
Phase 4: Full-Scale Integration & Monitoring
Seamlessly integrate the refined AI system into existing operational workflows. Establish continuous monitoring and maintenance protocols to ensure ongoing performance, scalability, and security.
Ready to Transform Your Enterprise with Advanced AI?
The insights from DRMOT highlight the immense potential of 3D-aware AI in complex tracking scenarios. Let's discuss how these innovations can be leveraged to create intelligent, efficient, and robust solutions for your business challenges.