Enterprise AI Analysis

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

Schedule Your AI Strategy Session

As an enterprise, leveraging cutting-edge AI for speech synthesis can revolutionize customer interactions, content creation, and accessibility. Spark-TTS offers a pathway to highly controllable and efficient voice generation.

Executive Impact & Key Metrics

35% Efficiency Gain

60% Reduced Complexity

100,000+ Hrs Data Foundation

High Control Granularity

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

BiCodec: Unified Speech Tokenization

BiCodec Unified Speech Tokenization

Spark-TTS introduces BiCodec, a novel single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This allows for both compact representation and fine-grained acoustic attribute control, simplifying LLM integration for TTS.

Spark-TTS LLM Integration Flow

Enterprise Process Flow

Text Input

→

Attribute Labels (Optional)

→

Qwen2.5 LLM + CoT

→

BiCodec Generated Tokens

→

Decoder

→

Generated Speech

Spark-TTS leverages the Qwen2.5 LLM with a chain-of-thought (CoT) approach to predict speech tokens. It supports both zero-shot voice cloning from reference audio and novel speaker generation via coarse- or fine-grained attribute control (gender, pitch, speed).

Codec Reconstruction Performance Comparison

Model	STOI↑	PESQ NB↑	PESQ WB↑	UTMOS↑	SIM↑	Key Feature
BiCodec (Spark-TTS)	0.92	3.13	2.51	4.18	0.80	SOTA Reconstruction, low-bitrate (650 bps)
X-codec2	0.92	3.04	2.43	4.13	0.82	FSQ, larger code space
StableCodec	0.91	2.91	2.24	4.23	0.62	Residual FSQ
Encodec (low-bitrate)	0.84	1.94	1.56	1.58	0.6	RVQ-based universal audio codec

Spark-TTS achieves state-of-the-art reconstruction quality in the sub-1 kbps range, outperforming most existing codecs like Encodec and ranking competitively with specialized acoustic codecs like X-codec2 and StableCodec, particularly in STOI, PESQ, and UTMOS scores. This is attributed to its BiCodec architecture.

Zero-shot TTS Intelligibility Prowess

1.20% CER (ZH) State-of-the-Art Chinese Intelligibility

Spark-TTS demonstrates significant superiority in intelligibility for zero-shot TTS scenarios. On Chinese test sets, it achieves a Character Error Rate (CER) of 1.20%, ranking second only to closed-source models. For English, it achieves a WER of 1.98%, also highly competitive. This highlights its strong zero-shot voice cloning capability.

VoxBox: A Foundation for Controllable TTS

VoxBox: A 100,000-Hour Open-Source Dataset for Controllable TTS

To facilitate research in controllable TTS, Spark-TTS introduces VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations including gender, pitch, and speaking rate. This open-source resource establishes a new benchmark for reproducible TTS research, enabling both coarse-grained (e.g., 'male, low pitch, slow speed') and fine-grained control (e.g., 'pitch=150 Mel, speed=3 SPS') for voice creation beyond reference audio.

The paper introduces VoxBox, a meticulously curated 100,000-hour open-source dataset with comprehensive attribute annotations for gender, pitch, and speaking rate. This dataset serves as a benchmark for controllable TTS research, allowing Spark-TTS to generate highly customizable voices that surpass the limitations of reference-based synthesis.

Advanced ROI Calculator

Estimate the potential return on investment for integrating Spark-TTS into your enterprise operations.

Your Industry

Number of Employees (Impacted)

Avg. Hours/Week on Manual Voice Tasks

Avg. Hourly Rate of Staff

Estimated Annual Savings

Annual Hours Reclaimed

Your Implementation Roadmap

A typical rollout plan for integrating Spark-TTS into your enterprise ecosystem.

Phase 1: Foundation Setup & BiCodec Integration

Duration: 2-4 Weeks

Integrate BiCodec into existing LLM infrastructure and establish initial tokenization pipelines. This involves setting up the necessary hardware/software environment and connecting BiCodec for efficient speech encoding/decoding.

Phase 2: LLM Fine-tuning & Attribute Control

Duration: 4-8 Weeks

Fine-tune the Qwen2.5 LLM with the VoxBox dataset. Focus on enabling and refining both coarse-grained (gender, style) and fine-grained (pitch, speed) attribute control mechanisms for diverse voice generation.

Phase 3: Zero-Shot & Voice Creation Rollout

Duration: 3-6 Weeks

Implement and extensively test zero-shot voice cloning capabilities. Roll out attribute-based novel voice creation, ensuring high fidelity and customizability, ready for internal and external applications.

Phase 4: Advanced Customization & Scalability

Duration: Ongoing

Explore further disentanglement of timbre characteristics, optimize the system for real-time performance, and scale the solution for multilingual deployment and integration into various enterprise platforms.

Ready to Transform Your Voice AI?

Connect with our experts to explore how Spark-TTS can drive innovation and efficiency in your enterprise.

Book a Free Consultation

Enterprise AI Analysis

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Executive Impact & Key Metrics

Deep Analysis & Enterprise Applications

BiCodec: Unified Speech Tokenization

Spark-TTS LLM Integration Flow

Enterprise Process Flow

Codec Reconstruction Performance Comparison

Zero-shot TTS Intelligibility Prowess

VoxBox: A Foundation for Controllable TTS

VoxBox: A 100,000-Hour Open-Source Dataset for Controllable TTS

Advanced ROI Calculator

Your Implementation Roadmap

Phase 1: Foundation Setup & BiCodec Integration

Phase 2: LLM Fine-tuning & Attribute Control

Phase 3: Zero-Shot & Voice Creation Rollout

Phase 4: Advanced Customization & Scalability

Ready to Transform Your Voice AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai