Language-Referred Audio-Visual Segmentation

Audit After Segmentation: Reference-Free Mask Quality Assessment

This research introduces a critical, yet overlooked, problem in language-referred audio-visual segmentation (Ref-AVS): Mask Quality Assessment (MQA). Current methods focus on generating masks, but lack tools for reference-free evaluation and actionable quality diagnosis in real-world scenarios where ground truth is unavailable. We propose MQA-RefAVS, a novel task to automatically assess mask quality, classify error types, and recommend corrective actions. To support this, we built MQ-RAVSBench, the first benchmark with diverse, representative mask error modes (26,061 instances across six types: Perfect, Cutout, Dilate, Erode, Merge, Full_neg), annotated with IoU and recommended actions. We also developed MQ-Auditor, a multimodal large language model (MLLM)-based system that reasons over audio, video, language, and mask information to provide quantitative (IoU) and qualitative (error type, action) assessments. Extensive experiments show MQ-Auditor significantly outperforms strong open-source and commercial MLLMs (e.g., Gemini-3-Flash) in accuracy and reliability. Crucially, MQ-Auditor can be seamlessly integrated with existing Ref-AVS models to detect segmentation failures and improve performance by guiding mask refinement. This work shifts the paradigm beyond mere mask generation towards robust, interpretable, and self-correcting segmentation systems.

Schedule Your Strategy Session

Executive Impact: Key Performance Indicators

MQ-Auditor delivers measurable improvements in mask quality assessment, enabling more reliable and efficient AI-driven segmentation workflows.

0 Mask Instances Generated for MQ-RAVSBench

0 Diverse Mask Error Types Covered

0 Improvement over SOTA MLLMs in F2-Score (Avg)

0 Faster Inference Latency than Strongest Baseline

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction

This section introduces the novel task of Mask Quality Assessment (MQA) in the context of Language-Referred Audio-Visual Segmentation (Ref-AVS). It highlights the limitations of current Ref-AVS research, which primarily focuses on mask generation and IoU evaluation against ground truth, often unavailable in deployment. The proposed MQA-RefAVS task aims to automatically infer mask quality, identify error types, and recommend actions without ground truth. The paper also introduces MQ-RAVSBench, a new benchmark dataset for MQA in Ref-AVS, and MQ-Auditor, an MLLM-based auditor for assessment.

Task: MQA-RefAVS

This section formally defines the Mask Quality Assessment for Language-Referred Audio-Visual Segmentation (MQA-RefAVS) task. Given a video (V) with synchronized audio (A), a referring expression (R) for a target object, a key video frame (Vt), and a candidate binary mask (Mt), an auditor model (Φ) must predict: (1) Intersection over Union (IoU) (s ∈ [0,1]), (2) mask type (m) from predefined categories (perfect, full_neg, cutout, dilate, erode, merge), and (3) recommended action (a) for quality control (accept, minor revision, major revision, reject). The task requires joint multimodal perception and reasoning.

Dataset: MQ-RAVSBench

MQ-RAVSBench is presented as the first benchmark for mask quality assessment in Ref-AVS. Built on Ref-AVSBench, it comprises 1,840 videos and 2,046 reference texts, generating 26,061 mask instances. It includes six mask types: Perfect, Cutout, Dilate, Erode, Merge, and Full_neg, covering geometric and semantic errors. Each mask is annotated with IoU and a recommended action. The dataset is split into training (1,306 videos) and testing (534 videos, further divided into Seen/Unseen categories for open-vocabulary evaluation). The masks are generated using automated pipelines involving object detection models and MLLMs.

Method: MQ-Auditor

MQ-Auditor is a multimodal large language model (MLLM)-based auditor designed for the MQA-RefAVS task. It processes audio (A) and video (V) via modality-specific encoders (BEATs, CLIP-ViT-L/14) and Q-Former modules to generate latent embeddings. The referring expression (R) is tokenized. A key innovation is generating a masked frame (V'_t) by multiplying V_t with candidate mask M_t, which explicitly highlights the masked regions. The binary mask M_t is also converted to pseudo-RGB. All multimodal embeddings are then fed into an LLaMA-2-7B-Chat backbone, trained via supervised instruction tuning, to output natural-language quality analysis, IoU estimation, and mask type/action predictions.

Experiments

This section details the experimental setup and results. MQ-Auditor is benchmarked against state-of-the-art MLLMs (Video-LLaMA3-7B, Qwen2.5-Omni-7B, Ming-Flash-Omni, Gemini-3-Flash-Preview) using image-based and video-based evaluation protocols. Metrics include RMSE for IoU estimation and F2-score for mask type and action predictions. MQ-Auditor consistently outperforms competitors, demonstrating superior accuracy and reliability across various mask types and settings (Seen/Unseen). Ablation studies highlight the importance of combining raw mask and masked frame representations and balancing positive/negative samples during training. MQ-Auditor also effectively improves existing Ref-AVS models by identifying and helping correct segmentation failures.

Conclusion

The paper concludes by summarizing the contributions: identifying Mask Quality Assessment (MQA) as a crucial, overlooked problem in Ref-AVS, establishing the MQ-RAVSBench dataset, and proposing MQ-Auditor. It reiterates that MQ-Auditor significantly outperforms other MLLMs and can integrate with existing Ref-AVS systems to diagnose errors and facilitate improvements without ground truth. The work encourages future research in mask quality assessment beyond generation, focusing on explicit error diagnosis and automatic correction, while also acknowledging the potential for inherited biases from MLLMs.

40% Improvement in Jaccard Index (I) for Unseen Test Samples with MQ-Auditor

MQ-RAVSBench Mask Construction Pipeline

Object Detection (Detic)

→

Ground-truth Perfect Masks

→

Geometric Quality Issues (OpenCV: Cutout, Dilate, Erode)

→

Semantic Errors (MLLMs/VLMs: Full_neg)

→

Combine for Merge Masks

→

IoU & Action Annotation

MQ-Auditor vs. Traditional Ref-AVS Evaluation

Feature	Traditional Ref-AVS Evaluation	MQ-Auditor (Proposed)
Ground Truth Dependency	Requires ground-truth masks for IoU calculation.	Reference-free; assesses quality without ground-truth masks at inference.
Output Metrics	Scalar IoU score.	Quantitative IoU estimate, qualitative mask type, and recommended action.
Error Diagnosis	No explicit diagnosis of error types.	Identifies specific error types (e.g., cutout, full_neg, merge).
Actionable Feedback	Provides no direct guidance for improvement.	Recommends actions like 'accept', 'minor revision', 'reject' for quality control.
Integration with SOTA Ref-AVS	Primarily for post-hoc model evaluation.	Seamlessly integrates to detect failures and guide mask refinement.

Impact on Segmentation Performance: EEMC Baseline

When integrated with the EEMC baseline, MQ-Auditor successfully identifies 'Full_neg' segmentation failures and provides corrective target object information. This allows for re-generation of masks using Grounded-SAM2, leading to significant improvements in segmentation performance.

Jaccard Index (I) improved by 40% for Unseen test samples.
F-score (F) improved by 33.5% for Unseen test samples.
Demonstrates MQ-Auditor's role as a reflection agent providing actionable guidance.

51% Less Peak Memory Usage Compared to Strongest MLLM Baseline

Calculate Your Potential AI Audit ROI

Estimate the cost savings and efficiency gains by integrating automated mask quality assessment into your enterprise.

Your Industry

Number of Employees Handling Segmentation Output

Average Weekly Hours Spent Auditing

Average Hourly Rate (Fully Loaded)

Potential Annual Savings $0

Annual Hours Reclaimed 0

Implementation Roadmap

A phased approach to integrating MQ-Auditor into your enterprise workflows for maximum impact.

Phase 1: Initial Integration & Mask Audit

Integrate MQ-Auditor with your existing Ref-AVS segmentation pipeline. Deploy MQ-Auditor to automatically assess generated masks, identifying error types and estimated IoU without ground truth.

Phase 2: Actionable Feedback Loop

Utilize MQ-Auditor's recommended actions ('accept', 'minor revision', 'major revision', 'reject') to triage segmentation outputs. Implement automated flagging for masks requiring human review or re-generation.

Phase 3: Automated Refinement & Performance Boost

For 'reject' or 'major revision' masks, leverage MQ-Auditor's diagnostic output (e.g., corrected target object cues) to prompt a secondary segmentation model (e.g., Grounded-SAM2) for re-generation, significantly improving overall segmentation accuracy.

Phase 4: Continuous Monitoring & Customization

Continuously monitor MQ-Auditor's performance and adapt its instruction-tuning prompts to your specific domain and failure modes. Explore reinforcement learning to further refine its accuracy and generalization to novel error patterns.

Ready to Enhance Your Segmentation Pipeline?

Book a strategic consultation to discover how MQ-Auditor can transform your enterprise's audio-visual segmentation workflows, ensuring precision and reliability.

Schedule Your Strategy Session

Language-Referred Audio-Visual Segmentation

Audit After Segmentation: Reference-Free Mask Quality Assessment

Executive Impact: Key Performance Indicators

Deep Analysis & Enterprise Applications

Introduction

Task: MQA-RefAVS

Dataset: MQ-RAVSBench

Method: MQ-Auditor

Experiments

Conclusion

MQ-RAVSBench Mask Construction Pipeline

MQ-Auditor vs. Traditional Ref-AVS Evaluation

Impact on Segmentation Performance: EEMC Baseline

Calculate Your Potential AI Audit ROI

Implementation Roadmap

Phase 1: Initial Integration & Mask Audit

Phase 2: Actionable Feedback Loop

Phase 3: Automated Refinement & Performance Boost

Phase 4: Continuous Monitoring & Customization

Ready to Enhance Your Segmentation Pipeline?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai