PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
Redefining Realism: The PhyAVBench Advantage
Current text-to-audio-video (T2AV) models frequently fail to produce physically plausible sounds, exhibiting unrealistic audio or asynchrony. Existing benchmarks primarily focus on high-level audio-visual alignment, overlooking explicit evaluation of audio-physics grounding.
PhyAVBench introduces the first benchmark to systematically evaluate audio-physics grounding in T2AV, I2AV, and V2A models. It features PhyAV-Sound-11K, a new dataset of 11,605 audible videos with controlled physical variations, and proposes the Audio-Physics Sensitivity Test (APST) with the Contrastive Physical Response Score (CPRS) metric for quantitative assessment.
This work opens new avenues for advancing physically grounded audio-visual generation, crucial for realistic filmmaking, advertising, and sophisticated world modeling applications by guiding the development of physics-aware generative AI.
Unlocking Physically Grounded Audio-Visual AI
PhyAVBench's innovations address critical gaps in current T2AV, I2AV, and V2A models, providing a foundation for next-generation AI that understands and generates sound with real-world physical plausibility.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Novel PhyAVBench Framework
PhyAVBench is the first benchmark systematically evaluating audio-physics grounding for T2AV, I2AV, and V2A models. It introduces PhyAV-Sound-11K, a dataset of 11,605 newly recorded videos, and the Audio-Physics Sensitivity Test (APST) using paired text prompts. The core innovation is the Contrastive Physical Response Score (CPRS), which quantifies acoustic consistency between generated videos and real-world counterparts.
Enterprise Process Flow
Quantifying Audio-Physics Grounding
The Audio-Physics Sensitivity Test (APST) uses paired prompts to evaluate directional consistency and magnitude alignment of generated audio with real-world, physics-grounded trends. The Contrastive Physical Response Score (CPRS) is a novel metric combining cosine similarity for directional alignment and a Gaussian-normalized projection for magnitude consistency. Human studies confirm CPRS strongly correlates with expert perception, making it a reliable automatic proxy.
State-of-the-Art Model Performance
A comprehensive evaluation of 17 SOTA models (e.g., Sora 2, Veo 3.1) across T2AV, I2AV, and V2A tasks reveals a significant performance gap. Even leading commercial models struggle with fundamental audio-physical phenomena, such as material-dependent timbre or Helmholtz resonance. This highlights a critical gap beyond audio-visual synchronization, emphasizing the need for physics-aware generative modeling.
Key Model Performance Gaps
- Sora 2 (T2AV): Achieves a CPRS of only 0.4512, indicating struggle with fine-grained physical transitions despite leading in overall performance in other metrics.
- MMAudio (V2A): Peaks at a CPRS of 0.4003, demonstrating limitations in capturing underlying audio-physical mechanisms in V2A tasks.
- Overall: Current models are predominantly semantic-driven, generating plausible audio types based on categories but failing to reflect nuanced, directional acoustic shifts dictated by varying physical properties or physical laws.
- Human Evaluation: Even SOTA models show a substantial 'reality gap' in PVR-MOS, underscoring the challenge in achieving rigorous physical and acoustic consistency.
Quantify Your AI Advantage
Estimate the potential annual savings and reclaimed human hours your enterprise could achieve by integrating AI solutions based on insights from PhyAVBench.
Your AI Implementation Roadmap
A phased approach to integrate physically grounded audio-visual AI into your enterprise.
Phase 1: Needs Assessment & Data Curation
Identify specific audio-visual generation requirements and assess existing data assets. Initiate specialized data collection for physics-grounded audio, following PhyAVBench's methodology to build a robust foundation.
Phase 2: Physics-Aware Model Development
Develop or fine-tune generative AI models with explicit audio-physics grounding. Integrate mechanisms sensitive to material properties, environmental acoustics, and interaction dynamics, guided by PhyAVBench's test points.
Phase 3: Rigorous Validation & Benchmarking
Implement PhyAVBench for comprehensive model evaluation. Utilize APST and CPRS to quantitatively assess audio-physics sensitivity, ensuring models generate acoustically plausible and consistent content.
Phase 4: Integration & Continuous Optimization
Deploy validated physics-aware AI solutions into production workflows. Establish a feedback loop for continuous monitoring, performance optimization, and adaptation to evolving enterprise needs.
Ready to Build Physically Grounded AI?
Explore how PhyAVBench's insights can transform your enterprise's audio-visual content generation, leading to more realistic and impactful AI applications.