Enterprise AI Analysis
Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis
This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection.
Executive Impact at a Glance
MLVAS provides objective, quantitative metrics that translate directly into improved diagnostic accuracy and operational efficiency for healthcare enterprises.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Vocal fold Paralysis (VFP) is a condition where one of the vocal folds fails to move properly, leading to voice changes, difficulty swallowing, and potential breathing problems. Accurate diagnosis of VFP is crucial for appropriate medical or surgical intervention. Clinicians often use laryngeal videostroboscopy, but raw recordings can be time-consuming to analyze. AI methods can assist, but often rely on single modalities or pre-processed segments, neglecting crucial multimodal data and real-world raw video challenges.
The MLVAS system integrates audio and video modalities. An audio-based Keyword Spotting (KWS) mechanism identifies relevant video segments containing complete phonation cycles. A pre-trained audio encoder (Dasheng) is fine-tuned for vocal fold pathology detection. Visual features are extracted using a two-stage glottis image segmentation: a U-Net model followed by diffusion-based refinement, which helps correct false positives. Novel Left and Right Vocal Fold Dynamics (LVFDyn and RVFDyn) are derived for precise diagnosis of unilateral VFP.
Experiments show that combining audio and video modalities significantly improves VFP detection. The proposed system achieves superior performance, with an F-score of 78.49% and high sensitivity (88.63%). The diffusion refinement module effectively reduces the false alarm rate in glottis segmentation. Statistical significance testing confirms the improvements from enhancement modules and multimodal integration. MLVAS accurately differentiates between left and right VFP using variance analysis of LVFDyn and RVFDyn.
Enterprise Process Flow
| Feature/Model | ROC-AUC | Accuracy | F-score | Mean (Sens. & Speci.) |
|---|---|---|---|---|
| RF (MFCC) | 78.59 | 56.23 | 69.56 | 56.23 |
| Dasheng (Spec.) | 85.47 | 75.18 | 78.49 | 75.18 |
| Audio+VFDyn (QF+DR) | 87.04 | 78.12 | 80.52 | 78.12 |
Clinical Impact: Automated Identification of Right Vocal Fold Paralysis
MLVAS was applied to patient #7530 from the SYSU-A dataset, who was diagnosed with right VFP. The system's VFDyn analysis clearly showed a markedly greater level of oscillation for the Left Vocal Fold Dynamics (LVFDyn) compared to the Right Vocal Fold Dynamics (RVFDyn). This objective measurement aligns perfectly with the clinical diagnosis, demonstrating MLVAS's ability to provide interpretable and accurate insights for unilateral VFP.
Key Achievements:
- Precise differentiation of left vs. right VFP.
- Objective validation of clinical diagnoses.
- Elimination of manual pre-editing for raw video footage.
Calculate Your Potential ROI
Our Multimodal Laryngoscopic Video Analyzing System significantly reduces the manual effort and time required for diagnosing vocal fold paralysis, leading to substantial operational savings and improved diagnostic accuracy. Calculate your potential ROI below.
Implementation Timeline
Our structured approach ensures a smooth and efficient integration of MLVAS into your existing workflows.
Phase 1: Data Integration & Baseline Setup (2-4 Weeks)
Integrate existing laryngoscopic video and audio data. Configure the MLVAS system for initial data processing and establish baseline performance metrics. This phase includes setting up the Keyword Spotting model and initial glottis segmentation.
Phase 2: Model Fine-tuning & Feature Enhancement (4-8 Weeks)
Fine-tune the pre-trained audio encoder (Dasheng) with your specific clinical data. Implement and optimize the diffusion-based refinement for glottis segmentation and the Quadratic Fitting for Vocal Fold Dynamics extraction. Validate improvements in metric extraction.
Phase 3: Multimodal Integration & Clinical Validation (6-10 Weeks)
Integrate audio and visual features into the multimodal back-end classifier. Conduct extensive clinical validation with your medical team to assess VFP and Unilateral VFP detection performance. Gather feedback for iterative system improvements.
Phase 4: Deployment & Ongoing Monitoring (Ongoing)
Deploy the MLVAS system into clinical practice. Provide training for users. Establish a continuous monitoring framework for performance and accuracy, ensuring the system adapts to new data and maintains high diagnostic reliability.
Ready to Transform Your Laryngeal Diagnosis Workflow?
Connect with our AI specialists to explore how MLVAS can be tailored to your enterprise's unique needs and deliver a significant impact on clinical outcomes and operational efficiency.