Enterprise AI Analysis

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection.

Schedule Your Strategy Session

Executive Impact at a Glance

MLVAS provides objective, quantitative metrics that translate directly into improved diagnostic accuracy and operational efficiency for healthcare enterprises.

0 Sensitivity for VFP detection, significantly reducing missed diagnoses.

0 Reduced False Alarm Rate (FAR) in glottis segmentation.

0 Absolute Improvement in F-score with multimodal integration.

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Vocal fold Paralysis (VFP) is a condition where one of the vocal folds fails to move properly, leading to voice changes, difficulty swallowing, and potential breathing problems. Accurate diagnosis of VFP is crucial for appropriate medical or surgical intervention. Clinicians often use laryngeal videostroboscopy, but raw recordings can be time-consuming to analyze. AI methods can assist, but often rely on single modalities or pre-processed segments, neglecting crucial multimodal data and real-world raw video challenges.

The MLVAS system integrates audio and video modalities. An audio-based Keyword Spotting (KWS) mechanism identifies relevant video segments containing complete phonation cycles. A pre-trained audio encoder (Dasheng) is fine-tuned for vocal fold pathology detection. Visual features are extracted using a two-stage glottis image segmentation: a U-Net model followed by diffusion-based refinement, which helps correct false positives. Novel Left and Right Vocal Fold Dynamics (LVFDyn and RVFDyn) are derived for precise diagnosis of unilateral VFP.

Experiments show that combining audio and video modalities significantly improves VFP detection. The proposed system achieves superior performance, with an F-score of 78.49% and high sensitivity (88.63%). The diffusion refinement module effectively reduces the false alarm rate in glottis segmentation. Statistical significance testing confirms the improvements from enhancement modules and multimodal integration. MLVAS accurately differentiates between left and right VFP using variance analysis of LVFDyn and RVFDyn.

90.06% Sensitivity for VFP detection, significantly reducing missed diagnoses.

Enterprise Process Flow

Raw Laryngoscopic Video/Audio

→

Multimodal Front-End (KWS, Glottis Detection, Strobing)

→

Feature Extraction (Audio Embeddings, VFDyn)

→

Multimodal Back-End (ConvLSTM Classifier)

→

VFP & Unilateral VFP Diagnosis

Comparative Performance Analysis

Feature/Model	ROC-AUC	Accuracy	F-score	Mean (Sens. & Speci.)
RF (MFCC)	78.59	56.23	69.56	56.23
Dasheng (Spec.)	85.47	75.18	78.49	75.18
Audio+VFDyn (QF+DR)	87.04	78.12	80.52	78.12

Clinical Impact: Automated Identification of Right Vocal Fold Paralysis

MLVAS was applied to patient #7530 from the SYSU-A dataset, who was diagnosed with right VFP. The system's VFDyn analysis clearly showed a markedly greater level of oscillation for the Left Vocal Fold Dynamics (LVFDyn) compared to the Right Vocal Fold Dynamics (RVFDyn). This objective measurement aligns perfectly with the clinical diagnosis, demonstrating MLVAS's ability to provide interpretable and accurate insights for unilateral VFP.

Key Achievements:

Precise differentiation of left vs. right VFP.
Objective validation of clinical diagnoses.
Elimination of manual pre-editing for raw video footage.

Calculate Your Potential ROI

Our Multimodal Laryngoscopic Video Analyzing System significantly reduces the manual effort and time required for diagnosing vocal fold paralysis, leading to substantial operational savings and improved diagnostic accuracy. Calculate your potential ROI below.

Your Industry

Number of Employees (Impacted by manual processes)

Average Hours Spent Weekly on Manual Tasks (per employee)

Average Hourly Cost (including benefits)

Annual Cost Savings --

Annual Hours Reclaimed --

Personalized ROI Consultation

Implementation Timeline

Our structured approach ensures a smooth and efficient integration of MLVAS into your existing workflows.

Phase 1: Data Integration & Baseline Setup (2-4 Weeks)

Integrate existing laryngoscopic video and audio data. Configure the MLVAS system for initial data processing and establish baseline performance metrics. This phase includes setting up the Keyword Spotting model and initial glottis segmentation.

Phase 2: Model Fine-tuning & Feature Enhancement (4-8 Weeks)

Fine-tune the pre-trained audio encoder (Dasheng) with your specific clinical data. Implement and optimize the diffusion-based refinement for glottis segmentation and the Quadratic Fitting for Vocal Fold Dynamics extraction. Validate improvements in metric extraction.

Phase 3: Multimodal Integration & Clinical Validation (6-10 Weeks)

Integrate audio and visual features into the multimodal back-end classifier. Conduct extensive clinical validation with your medical team to assess VFP and Unilateral VFP detection performance. Gather feedback for iterative system improvements.

Phase 4: Deployment & Ongoing Monitoring (Ongoing)

Deploy the MLVAS system into clinical practice. Provide training for users. Establish a continuous monitoring framework for performance and accuracy, ensuring the system adapts to new data and maintains high diagnostic reliability.

Start Your Implementation Today

Ready to Transform Your Laryngeal Diagnosis Workflow?

Connect with our AI specialists to explore how MLVAS can be tailored to your enterprise's unique needs and deliver a significant impact on clinical outcomes and operational efficiency.

Book a Free Consultation

Enterprise AI Analysis

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Executive Impact at a Glance

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Comparative Performance Analysis

Clinical Impact: Automated Identification of Right Vocal Fold Paralysis

Calculate Your Potential ROI

Implementation Timeline

Phase 1: Data Integration & Baseline Setup (2-4 Weeks)

Phase 2: Model Fine-tuning & Feature Enhancement (4-8 Weeks)

Phase 3: Multimodal Integration & Clinical Validation (6-10 Weeks)

Phase 4: Deployment & Ongoing Monitoring (Ongoing)

Ready to Transform Your Laryngeal Diagnosis Workflow?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai