Enterprise AI Analysis
FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
Presented by Kai-Tuo Xu, Feng-Long Xie, Xu Tang, Yao Hu from Xiaohongshu Inc.
Unlocking SOTA Performance and Efficiency in Mandarin ASR with Open-Source Models.
We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises two variants: FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities. On public Mandarin benchmarks, FireRedASR-LLM (8.3B parameters) achieves an average Character Error Rate (CER) of 3.05%, surpassing the latest SOTA of 3.33% with an 8.4% relative CER reduction (CERR). It demonstrates superior generalization capability over industrial-grade baselines, achieving 24%-40% CERR in multi-source Mandarin ASR scenarios such as video, live, and intelligent assistant. FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture. On public Mandarin benchmarks, FireRedASR-AED (1.1B parameters) achieves an average CER of 3.18%, slightly worse than FireRedASR-LLM but still outperforming the latest SOTA model with over 12B parameters. It offers a more compact size, making it suitable for resource-constrained applications. Moreover, both models exhibit competitive results on Chinese dialects and English speech benchmarks and excel in singing lyrics recognition. To advance research in speech processing, we release our models and inference code at https://github.com/FireRedTeam/FireRedASR.
Executive Impact: Next-Gen Mandarin ASR
FireRedASR marks a significant advancement in Mandarin ASR, delivering state-of-the-art accuracy and robust real-world performance, while ensuring open-source accessibility.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
FireRedASR-AED: The Efficient Encoder-Decoder Foundation
FireRedASR-AED utilizes an Attention-based Encoder-Decoder (AED) architecture with a Conformer-based Encoder and Transformer-based Decoder. It balances high performance with computational efficiency, comprising up to 1.1B parameters. This variant serves as an effective speech representation module and is suitable for resource-constrained applications, achieving an average CER of 3.18% on public Mandarin benchmarks, outperforming larger SOTA models.
FireRedASR-LLM Integration Framework
FireRedASR-LLM: State-of-the-Art through LLM Integration
FireRedASR-LLM adopts an Encoder-Adapter-LLM framework, leveraging a Conformer-based audio Encoder, a lightweight audio-text alignment Adapter, and a pre-trained text-based LLM (initialized with Qwen2-7B-Instruct). This architecture, with 8.3B parameters, is designed for SOTA performance and seamless end-to-end speech interaction. The adapter efficiently projects encoder output into the LLM's semantic space, including frame splicing for computational efficiency.
| Model | #Params | aishell1 | aishell2 | ws_net | ws_meeting | Average-4 |
|---|---|---|---|---|---|---|
| FireRedASR-LLM | 8.3B | 0.76 | 2.15 | 4.60 | 4.67 | 3.05 |
| FireRedASR-AED | 1.1B | 0.55 | 2.52 | 4.88 | 4.76 | 3.18 |
| Seed-ASR | 12B+ | 0.68 | 2.27 | 4.66 | 5.69 | 3.33 |
| Qwen-Audio | 8.4B | 1.30 | 3.10 | 9.50 | 10.87 | 6.19 |
| SenseVoice-L | 1.6B | 2.09 | 3.04 | 6.01 | 6.73 | 4.47 |
| Whisper-Large-v3 | 1.6B | 5.14 | 4.96 | 10.48 | 18.87 | 9.86 |
| Paraformer-Large | 0.2B | 1.68 | 2.85 | 6.74 | 6.97 | 4.56 |
| Model | Speech CER(%) | Speech CERR | Singing CER(%) | Singing CERR |
|---|---|---|---|---|
| FireRedASR-LLM | 3.48 | 0.0% | 7.05 | 0.0% |
| FireRedASR-AED | 3.74 | 7.0% | 7.51 | 6.1% |
| ProviderA-Large | 4.56 | 23.7% | 14.16 | 50.2% |
| ProviderA-Base | 5.67 | 38.6% | 21.37 | 67.0% |
| Paraformer-Large | 5.80 | 40.0% | 21.19 | 66.7% |
| Test Set | KeSpeech | LibriSpeech test-clean | LibriSpeech test-other |
|---|---|---|---|
| FireRedASR-LLM | 3.56 | 1.73 | 3.67 |
| FireRedASR-AED | 4.48 | 1.93 | 4.44 |
| Previous SOTA Results | 6.70 | 1.82 | 3.50 |
High-Quality and Diverse Training Data
Our models are trained on approximately 70,000 hours of high-quality, manually transcribed Mandarin Chinese speech, complemented by 11,000 hours of English speech. This extensive and diverse corpus, collected from real-world scenarios, provides superior training signals compared to weakly-labeled datasets, enhancing generalization capabilities. The inclusion of singing data significantly improves performance in musical content recognition.
Optimized Training Strategy with Progressive Regularization
We implemented a Progressive Regularization Training strategy to optimize model convergence and performance for FireRedASR-AED. This involves initially training without regularization for rapid convergence, then gradually introducing stronger regularization (dropout, SpecAugment) as overfitting tendencies emerge. This approach, combined with adjusted learning rates for larger models, proved crucial for achieving superior outcomes.
Scaling Law Observations
Empirical studies confirm that performance improves with increased model size, adhering to the scaling law. For FireRedASR-AED, CER consistently decreases as parameters increase (140M to 1.1B). FireRedASR-LLM also shows consistent improvements (7.3% CERR from XS to L configurations) by scaling the encoder, demonstrating the potential for further advancements with larger capacities.
Calculate Your Potential AI ROI
Estimate the return on investment for integrating advanced ASR into your enterprise operations.
Your FireRedASR Implementation Roadmap
A phased approach to integrate FireRedASR into your enterprise, ensuring a smooth transition and measurable impact.
Phase 1: Pilot & Customization
Deploy FireRedASR in a controlled environment, integrate with existing systems, and fine-tune models to your specific vocabulary and acoustic conditions for optimal accuracy.
Phase 2: Scaled Deployment
Roll out FireRedASR across departments or critical applications, providing training and support to end-users. Establish monitoring and feedback loops for continuous improvement.
Phase 3: Advanced Integration & Expansion
Explore deeper integration with LLM-based speech interaction systems, leverage FireRedASR's multilingual capabilities, and expand to new use cases like voice analytics and content moderation.
Ready to Transform Your Speech Interactions?
Unlock the power of industrial-grade Mandarin ASR. Book a free consultation to discuss how FireRedASR can elevate your enterprise's voice applications.