AI RESEARCH PAPER ANALYSIS

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Authors: Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan
Publication Year: 2026
Abstract: Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. This paper formalizes this task as Speech Generation Speaker Poisoning (SGSP), where models are modified to prevent the generation of specific identities while preserving utility for other speakers. The study evaluates inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). The framework achieves strong privacy for up to 15 speakers but reveals scalability limits at 100 speakers due to increased identity overlap, introducing a novel problem and evaluation framework toward further advances in generative voice privacy.

Schedule Your Strategy Session

Executive Impact: Key Findings for Enterprise AI

This research introduces a critical framework for enhancing privacy in zero-shot TTS, directly impacting enterprise applications requiring secure voice synthesis and identity protection. Below are the core findings and their implications.

0.95 Peak AUC for Single Speaker Erasure

15 Strong Privacy Achieved Up To

0.91 Max FSSIM at 100 Speakers

0% External Filtering Robustness

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Rising Challenge of Voice Cloning Privacy

The rapid advancement of generative AI, particularly in zero-shot Text-to-Speech (TTS), allows for high-fidelity voice cloning from minimal audio prompts. While powerful, this capability presents significant privacy and security risks, including impersonation and the spread of misinformation. This paper directly addresses the critical need to prevent TTS models from replicating specific speaker identities.

The core problem, formalized as Speech Generation Speaker Poisoning (SGSP), involves modifying trained TTS models to disable the synthesis of voices from a defined 'forget set' while maintaining performance for a 'retain set'. Unlike traditional machine unlearning, SGSP tackles the unique challenge of zero-shot generalization, where models can clone unseen voices, requiring more robust identity erasure methods.

Advancing Speaker Erasure: From TGP to EGP+Triplet

The research explores several methods for achieving targeted speaker poisoning:

Naive Baselines: Initial attempts involved pre-processing strategies like "Pretrained + Speaker Filtering" and "Pretrained + Ground Truth Filtering." However, these external filtering methods are vulnerable if model weights are public, as adversaries can bypass the pipeline.
Teacher-Guided Poisoning (TGP): Building on the knowledge distillation paradigm, TGP adapts a teacher model to guide the student model in generating random speaker identities from the retain set when prompted with a forget set speaker.
Encoder-Guided Poisoning (EGP): EGP refines TGP by directly using the style encoder's output as the fine-tuning target, bypassing teacher-generated outputs. This provides a cleaner optimization signal, leading to superior performance.
Contrastive Learning (Triplet Loss): To further enhance privacy, a triplet loss objective is incorporated. This loss explicitly penalizes embeddings that remain similar to the forget set, pushing generated outputs away from negative references sampled from the forget set.

Robust Evaluation: AUC and FSSIM

To rigorously assess the effectiveness of the proposed methods, a comprehensive evaluation framework was developed, focusing on both utility and privacy.

Utility Metrics:

Word Error Rate (WER): Measures speech intelligibility.
UTMOS: An automated proxy for Mean Opinion Score, assessing naturalness.
Speaker Similarity (SSIM): Cosine similarity between reference and synthesized speech embeddings, measuring identity preservation for the retain set.

Privacy Metrics:

Area Under the Curve (AUC): Quantifies the separability of similarity distributions between retain and forget sets, providing a distribution-level assessment of prompt-output dissimilarity. An AUC of 1.0 indicates perfect separation.
Forget Set Similarity (FSSIM): A stronger privacy metric that measures similarity between each generated sample and every speaker in the forget set. It includes Avg-FSSIM and Max-FSSIM, with the latter addressing worst-case leakage.

This framework ensures a nuanced understanding of privacy-utility trade-offs, moving beyond raw similarity scores to comprehensive, distribution-aware metrics.

Performance Highlights & Scalability Challenges

The research presents detailed results across settings with 1, 15, and 100 forgotten speakers. Key findings include:

Effective Single-Speaker Erasure: EGP+Triplet achieves an AUC of 0.95 for a single forgotten speaker, demonstrating strong privacy without significant utility degradation.
Scalability to 15 Speakers: EGP+Triplet maintains good performance for up to 15 speakers, showing a measurable similarity gap between retain and forget sets.
The 100-Speaker Challenge: At 100 forgotten speakers, the distinction largely collapses due to increased identity overlap. Max-FSSIM remains high (around 0.91), indicating persistent worst-case leakage and the diminishing effectiveness of triplet loss.
EGP Outperforms TGP: EGP consistently shows better performance, attributed to a cleaner optimization signal by directly targeting original style encoder representations, avoiding generative noise introduced by a teacher model.

These results highlight the potential of parameter-modifying approaches for targeted speaker erasure but also underscore the significant challenges in scaling these methods to a large number of identities due to latent space crowding.

Enterprise Process Flow: Targeted Speaker Poisoning

Formalize SGSP (Forget/Retain Sets)

→

Implement TGP (Teacher-Guided Poisoning)

→

Enhance with EGP (Encoder-Guided Poisoning)

→

Integrate Triplet Loss (Contrastive Objective)

→

Fine-Tune Diffusion Module

0.95 Peak AUC for 1 Speaker (EGP+Triplet)

The Encoder-Guided Poisoning with Triplet Loss achieved a 95% Area Under the Curve (AUC) for one forgotten speaker, indicating excellent separation between retain and forget set distributions. This demonstrates strong individual voice privacy.

Comparison of Methods: Privacy & Utility Metrics
Method	# Forg. Spk.	SSIM (F) ↓	AUC ↑	Max FSSIM ↓	Key Strengths
PT + GTF (Baseline)	1	0.63	0.90	N/A	Good privacy for 1 speaker
TGP+Trip.	1	0.74	0.74	0.74	Improved privacy over TGP
EGP+Trip.	1	0.48	0.95	0.48	Highest Privacy (1 Spk) Best AUC & Max FSSIM
EGP+Trip.	15	0.74	0.91	0.91	Maintains strong privacy
EGP+Trip.	100	0.72	0.64	0.91	Utility maintained Shows scalability limits

The Scalability Wall: Challenges with 100 Forgotten Speakers

While the proposed framework demonstrates robust targeted speaker poisoning for up to 15 speakers, scaling to 100 speakers reveals significant limitations. The research observed increased identity overlap between retain and forget sets, making large-scale SGSP substantially difficult.

The Max-FSSIM metric, designed to capture worst-case leakage, remained high at 0.91 even with advanced methods like EGP+Triplet for 15 and 100 speakers. This indicates that while average similarity might be reduced, there's still a persistent risk of an individual generated sample closely resembling a forgotten speaker. This challenge is attributed to latent space crowding, where pushing an embedding away from one negative sample inadvertently pushes it toward another in a high-dimensional space.

For enterprises, this implies that while protecting a small, critical set of identities (e.g., key executives) is achievable, ensuring privacy across a very large employee base remains an open research problem requiring new approaches to disentangle speaker identities in complex latent spaces.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could realize by implementing advanced AI solutions, tailored to your operational context.

Your Industry Sector:

Number of Employees:

Avg. Manual Hours / Week / Employee:

Avg. Hourly Cost / Employee ($):

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A typical enterprise AI journey, from initial strategy to scaled deployment, involves several key phases. Our approach ensures a seamless and impactful integration.

Phase 1: Discovery & Strategy

In-depth analysis of current operations, identification of high-impact AI opportunities, and development of a tailored AI strategy aligned with business goals. This includes data readiness assessment.

Phase 2: Pilot & Proof-of-Concept

Design and implementation of a targeted pilot program or proof-of-concept for a selected use case. Focus on demonstrating tangible ROI and refining the solution based on initial results.

Phase 3: Development & Integration

Full-scale development of AI solutions, robust model training, and seamless integration with existing enterprise systems. Emphasis on scalability, security, and performance.

Phase 4: Deployment & Optimization

Go-live of AI solutions across relevant departments. Continuous monitoring, performance optimization, and iterative improvements based on real-world feedback and evolving business needs.

Phase 5: Scaling & Innovation

Expansion of AI capabilities to new areas of the business, exploring advanced applications, and fostering an internal culture of AI-driven innovation for sustained competitive advantage.

Ready to Transform Your Enterprise with AI?

The future of enterprise operations is intelligent. Discuss how targeted AI solutions can enhance efficiency, security, and innovation within your organization.

Book Your Free Consultation Now

AI RESEARCH PAPER ANALYSIS

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Executive Impact: Key Findings for Enterprise AI

Deep Analysis & Enterprise Applications

The Rising Challenge of Voice Cloning Privacy

Advancing Speaker Erasure: From TGP to EGP+Triplet

Robust Evaluation: AUC and FSSIM

Performance Highlights & Scalability Challenges

Enterprise Process Flow: Targeted Speaker Poisoning

The Scalability Wall: Challenges with 100 Forgotten Speakers

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Proof-of-Concept

Phase 3: Development & Integration

Phase 4: Deployment & Optimization

Phase 5: Scaling & Innovation

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai