Skip to main content
Enterprise AI Analysis: Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Enterprise AI Analysis

Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

This paper introduces Nwāchā Munā, a 5.39-hour Devanagari speech corpus for Nepal Bhasha (Newari), and establishes the first script-preserving ASR benchmark. It demonstrates that proximal cross-lingual transfer from Nepali can match multilingual pretraining (Whisper-Small) in an ultra-low-resource setting, significantly reducing Character Error Rate (CER) from 52.54% to 17.59% with data augmentation, while using fewer parameters. This computationally efficient approach is crucial for digitally enabling endangered languages.

Key Performance Indicators

This research highlights significant advancements in ASR for low-resource languages, demonstrating tangible improvements.

5.39 hrs Hours of Newari Speech Data
52.54% Zero-Shot CER Baseline
17.59% Augmented Data CER (NepConformer)
3 hrs Hours Training Time (Conformer)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Data Collection & Augmentation
Model Training & Transfer Learning

The Nwāchā Munā corpus was meticulously curated from digital and print media, including Wikipedia, OSCAR dataset, regional newspapers, and primary school textbooks, totaling 5.39 hours of transcribed speech. A dual-modal strategy involved original field recordings and web-sourced audio, normalized to 16 KHz mono-channel WAV. Data augmentation techniques like speed perturbation, volume randomization, and noise injection proved critical, reducing CER from 52.54% to 17.59% for NepConformer, matching Whisper-Small’s performance with fewer parameters. Pseudo-labeling was attempted but degraded performance due to domain shift from broadcast data, highlighting the importance of domain alignment.

17.59% Final CER achieved with Augmented Data (NepConformer)

Newari ASR System Development Flow

Textual Data Acquisition (5727 sentences)
Audio Data Acquisition (5.39 hrs)
Data Pre-processing (16 KHz mono WAV)
Acoustic Modeling (Conformer/Whisper)
Language Modeling & Decoding (KenLM)
Evaluation (CER)

The study utilized a comparative experimental framework to evaluate cross-lingual transfer strategies. A zero-shot NepConformer baseline yielded 52.54% CER, highlighting the need for explicit fine-tuning. Supervised fine-tuning of NepConformer reduced CER to 18.72%, comparable to Whisper-Small (18.76%), demonstrating that acoustic and orthographic proximity between Nepali and Newari enables effective encoder reuse. Decoder-only fine-tuning yielded similar results (18.77% CER), indicating the pre-trained Nepali encoder features are sufficiently generalized. Shallow fusion with an external KenLM n-gram model further improved lexical regularity.

Comparison of ASR Model Strategies (CER %)

Strategy CER (%)
Zero-Shot NepConformer 52.54
NepConformer (Fine-tuned on base data) 18.72
Whisper-Small (Fine-tuned on base data) 18.76
NepConformer + Augmented Data 17.59
Whisper-Small + Augmentation 17.88

Proximal Transfer Success Story

The study demonstrated that proximal transfer from Nepali to Newari can yield performance comparable to large multilingual models like Whisper-Small. This is attributed to the acoustic and orthographic proximity between the two languages, enabling efficient reuse of pre-trained encoder features. This approach offers a computationally efficient alternative for developing ASR systems for ultra-low-resource South Asian languages.

Calculate Your Potential AI ROI

Estimate the transformative impact of AI on your enterprise by understanding potential cost savings and efficiency gains.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

Our structured approach ensures a seamless and successful integration of AI, tailored to your enterprise needs.

Phase 1: Data Curation & Augmentation

Gather and meticulously transcribe speech data from target endangered language. Implement domain-aligned data augmentation strategies to expand the dataset effectively, avoiding domain shift issues.

Phase 2: Proximal Model Adaptation

Fine-tune a pre-trained ASR model from a proximal, higher-resource language. Focus on encoder reuse and decoder adaptation to leverage existing acoustic representations while adapting to target language's linguistic patterns.

Phase 3: Language Model Integration & Refinement

Integrate external n-gram language models via shallow fusion to enhance lexical regularity. Conduct iterative evaluation and error analysis to refine the model for optimal transcription accuracy.

Phase 4: Community Engagement & Deployment

Collaborate with language communities for feedback and ongoing data collection. Deploy the ASR system as an open-source tool, enabling broader digital access and linguistic preservation efforts.

Ready to Transform Your Enterprise with AI?

Our experts are ready to discuss your unique challenges and design a bespoke AI solution that drives measurable results.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking