Skip to main content
Enterprise AI Analysis: FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration

Enterprise AI Analysis

FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration

Presented by Kai-Tuo Xu, Feng-Long Xie, Xu Tang, Yao Hu from Xiaohongshu Inc.
Unlocking SOTA Performance and Efficiency in Mandarin ASR with Open-Source Models.

We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises two variants: FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities. On public Mandarin benchmarks, FireRedASR-LLM (8.3B parameters) achieves an average Character Error Rate (CER) of 3.05%, surpassing the latest SOTA of 3.33% with an 8.4% relative CER reduction (CERR). It demonstrates superior generalization capability over industrial-grade baselines, achieving 24%-40% CERR in multi-source Mandarin ASR scenarios such as video, live, and intelligent assistant. FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture. On public Mandarin benchmarks, FireRedASR-AED (1.1B parameters) achieves an average CER of 3.18%, slightly worse than FireRedASR-LLM but still outperforming the latest SOTA model with over 12B parameters. It offers a more compact size, making it suitable for resource-constrained applications. Moreover, both models exhibit competitive results on Chinese dialects and English speech benchmarks and excel in singing lyrics recognition. To advance research in speech processing, we release our models and inference code at https://github.com/FireRedTeam/FireRedASR.

Executive Impact: Next-Gen Mandarin ASR

FireRedASR marks a significant advancement in Mandarin ASR, delivering state-of-the-art accuracy and robust real-world performance, while ensuring open-source accessibility.

0 Average CER (FireRedASR-LLM)
0 Relative CER Reduction vs. SOTA
0 Max Relative CERR vs. Industrial Baselines
0 Max CERR in Singing Lyrics Recognition

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

0 State-of-the-Art Average CER Achieved by FireRedASR-LLM on Public Mandarin Benchmarks
0 Relative CER Reduction of FireRedASR-LLM Compared to Previous SOTA
0 Maximum Relative CER Reduction Against Industrial Baselines in Real-World Scenarios
0 Maximum Relative CERR in Challenging Singing Lyrics Recognition

FireRedASR-AED: The Efficient Encoder-Decoder Foundation

FireRedASR-AED utilizes an Attention-based Encoder-Decoder (AED) architecture with a Conformer-based Encoder and Transformer-based Decoder. It balances high performance with computational efficiency, comprising up to 1.1B parameters. This variant serves as an effective speech representation module and is suitable for resource-constrained applications, achieving an average CER of 3.18% on public Mandarin benchmarks, outperforming larger SOTA models.

FireRedASR-LLM Integration Framework

Speech Input
Conformer Encoder
Audio-Text Adapter (Frame Splicing)
Large Language Model (Qwen2-7B-Instruct)

FireRedASR-LLM: State-of-the-Art through LLM Integration

FireRedASR-LLM adopts an Encoder-Adapter-LLM framework, leveraging a Conformer-based audio Encoder, a lightweight audio-text alignment Adapter, and a pre-trained text-based LLM (initialized with Qwen2-7B-Instruct). This architecture, with 8.3B parameters, is designed for SOTA performance and seamless end-to-end speech interaction. The adapter efficiently projects encoder output into the LLM's semantic space, including frame splicing for computational efficiency.

Comparison on Public Mandarin ASR Benchmarks (CER%)

Model#Paramsaishell1aishell2ws_netws_meetingAverage-4
FireRedASR-LLM8.3B0.762.154.604.673.05
FireRedASR-AED1.1B0.552.524.884.763.18
Seed-ASR12B+0.682.274.665.693.33
Qwen-Audio8.4B1.303.109.5010.876.19
SenseVoice-L1.6B2.093.046.016.734.47
Whisper-Large-v31.6B5.144.9610.4818.879.86
Paraformer-Large0.2B1.682.856.746.974.56

Comparison on Multi-source Mandarin Speech & Singing Benchmarks

ModelSpeech CER(%)Speech CERRSinging CER(%)Singing CERR
FireRedASR-LLM3.480.0%7.050.0%
FireRedASR-AED3.747.0%7.516.1%
ProviderA-Large4.5623.7%14.1650.2%
ProviderA-Base5.6738.6%21.3767.0%
Paraformer-Large5.8040.0%21.1966.7%

Comparison on Chinese Dialect and English ASR Benchmarks

Test SetKeSpeechLibriSpeech test-cleanLibriSpeech test-other
FireRedASR-LLM3.561.733.67
FireRedASR-AED4.481.934.44
Previous SOTA Results6.701.823.50

High-Quality and Diverse Training Data

Our models are trained on approximately 70,000 hours of high-quality, manually transcribed Mandarin Chinese speech, complemented by 11,000 hours of English speech. This extensive and diverse corpus, collected from real-world scenarios, provides superior training signals compared to weakly-labeled datasets, enhancing generalization capabilities. The inclusion of singing data significantly improves performance in musical content recognition.

Optimized Training Strategy with Progressive Regularization

We implemented a Progressive Regularization Training strategy to optimize model convergence and performance for FireRedASR-AED. This involves initially training without regularization for rapid convergence, then gradually introducing stronger regularization (dropout, SpecAugment) as overfitting tendencies emerge. This approach, combined with adjusted learning rates for larger models, proved crucial for achieving superior outcomes.

Scaling Law Observations

Empirical studies confirm that performance improves with increased model size, adhering to the scaling law. For FireRedASR-AED, CER consistently decreases as parameters increase (140M to 1.1B). FireRedASR-LLM also shows consistent improvements (7.3% CERR from XS to L configurations) by scaling the encoder, demonstrating the potential for further advancements with larger capacities.

Calculate Your Potential AI ROI

Estimate the return on investment for integrating advanced ASR into your enterprise operations.

Annual Savings
$0
Hours Reclaimed Annually
0

Your FireRedASR Implementation Roadmap

A phased approach to integrate FireRedASR into your enterprise, ensuring a smooth transition and measurable impact.

Phase 1: Pilot & Customization

Deploy FireRedASR in a controlled environment, integrate with existing systems, and fine-tune models to your specific vocabulary and acoustic conditions for optimal accuracy.

Phase 2: Scaled Deployment

Roll out FireRedASR across departments or critical applications, providing training and support to end-users. Establish monitoring and feedback loops for continuous improvement.

Phase 3: Advanced Integration & Expansion

Explore deeper integration with LLM-based speech interaction systems, leverage FireRedASR's multilingual capabilities, and expand to new use cases like voice analytics and content moderation.

Ready to Transform Your Speech Interactions?

Unlock the power of industrial-grade Mandarin ASR. Book a free consultation to discuss how FireRedASR can elevate your enterprise's voice applications.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking