Language-Referred Audio-Visual Segmentation
Audit After Segmentation: Reference-Free Mask Quality Assessment
This research introduces a critical, yet overlooked, problem in language-referred audio-visual segmentation (Ref-AVS): Mask Quality Assessment (MQA). Current methods focus on generating masks, but lack tools for reference-free evaluation and actionable quality diagnosis in real-world scenarios where ground truth is unavailable. We propose MQA-RefAVS, a novel task to automatically assess mask quality, classify error types, and recommend corrective actions. To support this, we built MQ-RAVSBench, the first benchmark with diverse, representative mask error modes (26,061 instances across six types: Perfect, Cutout, Dilate, Erode, Merge, Full_neg), annotated with IoU and recommended actions. We also developed MQ-Auditor, a multimodal large language model (MLLM)-based system that reasons over audio, video, language, and mask information to provide quantitative (IoU) and qualitative (error type, action) assessments. Extensive experiments show MQ-Auditor significantly outperforms strong open-source and commercial MLLMs (e.g., Gemini-3-Flash) in accuracy and reliability. Crucially, MQ-Auditor can be seamlessly integrated with existing Ref-AVS models to detect segmentation failures and improve performance by guiding mask refinement. This work shifts the paradigm beyond mere mask generation towards robust, interpretable, and self-correcting segmentation systems.
Executive Impact: Key Performance Indicators
MQ-Auditor delivers measurable improvements in mask quality assessment, enabling more reliable and efficient AI-driven segmentation workflows.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introduction
This section introduces the novel task of Mask Quality Assessment (MQA) in the context of Language-Referred Audio-Visual Segmentation (Ref-AVS). It highlights the limitations of current Ref-AVS research, which primarily focuses on mask generation and IoU evaluation against ground truth, often unavailable in deployment. The proposed MQA-RefAVS task aims to automatically infer mask quality, identify error types, and recommend actions without ground truth. The paper also introduces MQ-RAVSBench, a new benchmark dataset for MQA in Ref-AVS, and MQ-Auditor, an MLLM-based auditor for assessment.
Task: MQA-RefAVS
This section formally defines the Mask Quality Assessment for Language-Referred Audio-Visual Segmentation (MQA-RefAVS) task. Given a video (V) with synchronized audio (A), a referring expression (R) for a target object, a key video frame (Vt), and a candidate binary mask (Mt), an auditor model (Φ) must predict: (1) Intersection over Union (IoU) (s ∈ [0,1]), (2) mask type (m) from predefined categories (perfect, full_neg, cutout, dilate, erode, merge), and (3) recommended action (a) for quality control (accept, minor revision, major revision, reject). The task requires joint multimodal perception and reasoning.
Dataset: MQ-RAVSBench
MQ-RAVSBench is presented as the first benchmark for mask quality assessment in Ref-AVS. Built on Ref-AVSBench, it comprises 1,840 videos and 2,046 reference texts, generating 26,061 mask instances. It includes six mask types: Perfect, Cutout, Dilate, Erode, Merge, and Full_neg, covering geometric and semantic errors. Each mask is annotated with IoU and a recommended action. The dataset is split into training (1,306 videos) and testing (534 videos, further divided into Seen/Unseen categories for open-vocabulary evaluation). The masks are generated using automated pipelines involving object detection models and MLLMs.
Method: MQ-Auditor
MQ-Auditor is a multimodal large language model (MLLM)-based auditor designed for the MQA-RefAVS task. It processes audio (A) and video (V) via modality-specific encoders (BEATs, CLIP-ViT-L/14) and Q-Former modules to generate latent embeddings. The referring expression (R) is tokenized. A key innovation is generating a masked frame (V'_t) by multiplying V_t with candidate mask M_t, which explicitly highlights the masked regions. The binary mask M_t is also converted to pseudo-RGB. All multimodal embeddings are then fed into an LLaMA-2-7B-Chat backbone, trained via supervised instruction tuning, to output natural-language quality analysis, IoU estimation, and mask type/action predictions.
Experiments
This section details the experimental setup and results. MQ-Auditor is benchmarked against state-of-the-art MLLMs (Video-LLaMA3-7B, Qwen2.5-Omni-7B, Ming-Flash-Omni, Gemini-3-Flash-Preview) using image-based and video-based evaluation protocols. Metrics include RMSE for IoU estimation and F2-score for mask type and action predictions. MQ-Auditor consistently outperforms competitors, demonstrating superior accuracy and reliability across various mask types and settings (Seen/Unseen). Ablation studies highlight the importance of combining raw mask and masked frame representations and balancing positive/negative samples during training. MQ-Auditor also effectively improves existing Ref-AVS models by identifying and helping correct segmentation failures.
Conclusion
The paper concludes by summarizing the contributions: identifying Mask Quality Assessment (MQA) as a crucial, overlooked problem in Ref-AVS, establishing the MQ-RAVSBench dataset, and proposing MQ-Auditor. It reiterates that MQ-Auditor significantly outperforms other MLLMs and can integrate with existing Ref-AVS systems to diagnose errors and facilitate improvements without ground truth. The work encourages future research in mask quality assessment beyond generation, focusing on explicit error diagnosis and automatic correction, while also acknowledging the potential for inherited biases from MLLMs.
MQ-RAVSBench Mask Construction Pipeline
| Feature | Traditional Ref-AVS Evaluation | MQ-Auditor (Proposed) |
|---|---|---|
| Ground Truth Dependency | Requires ground-truth masks for IoU calculation. | Reference-free; assesses quality without ground-truth masks at inference. |
| Output Metrics | Scalar IoU score. | Quantitative IoU estimate, qualitative mask type, and recommended action. |
| Error Diagnosis | No explicit diagnosis of error types. | Identifies specific error types (e.g., cutout, full_neg, merge). |
| Actionable Feedback | Provides no direct guidance for improvement. | Recommends actions like 'accept', 'minor revision', 'reject' for quality control. |
| Integration with SOTA Ref-AVS | Primarily for post-hoc model evaluation. | Seamlessly integrates to detect failures and guide mask refinement. |
Impact on Segmentation Performance: EEMC Baseline
When integrated with the EEMC baseline, MQ-Auditor successfully identifies 'Full_neg' segmentation failures and provides corrective target object information. This allows for re-generation of masks using Grounded-SAM2, leading to significant improvements in segmentation performance.
- Jaccard Index (I) improved by 40% for Unseen test samples.
- F-score (F) improved by 33.5% for Unseen test samples.
- Demonstrates MQ-Auditor's role as a reflection agent providing actionable guidance.
Calculate Your Potential AI Audit ROI
Estimate the cost savings and efficiency gains by integrating automated mask quality assessment into your enterprise.
Implementation Roadmap
A phased approach to integrating MQ-Auditor into your enterprise workflows for maximum impact.
Phase 1: Initial Integration & Mask Audit
Integrate MQ-Auditor with your existing Ref-AVS segmentation pipeline. Deploy MQ-Auditor to automatically assess generated masks, identifying error types and estimated IoU without ground truth.
Phase 2: Actionable Feedback Loop
Utilize MQ-Auditor's recommended actions ('accept', 'minor revision', 'major revision', 'reject') to triage segmentation outputs. Implement automated flagging for masks requiring human review or re-generation.
Phase 3: Automated Refinement & Performance Boost
For 'reject' or 'major revision' masks, leverage MQ-Auditor's diagnostic output (e.g., corrected target object cues) to prompt a secondary segmentation model (e.g., Grounded-SAM2) for re-generation, significantly improving overall segmentation accuracy.
Phase 4: Continuous Monitoring & Customization
Continuously monitor MQ-Auditor's performance and adapt its instruction-tuning prompts to your specific domain and failure modes. Explore reinforcement learning to further refine its accuracy and generalization to novel error patterns.
Ready to Enhance Your Segmentation Pipeline?
Book a strategic consultation to discover how MQ-Auditor can transform your enterprise's audio-visual segmentation workflows, ensuring precision and reliability.