Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

08/28/2023
by   Hyungchan Yoon, et al.
0

For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.

READ FULL TEXT
research
10/12/2022

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Several recently proposed text-to-speech (TTS) models achieved to genera...
research
04/01/2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Adaptive text to speech (TTS) can synthesize new voices in zero-shot sce...
research
01/25/2022

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...
research
03/30/2022

Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE

Variational auto-encoder(VAE) is an effective neural network architectur...
research
03/02/2023

Improving Transformer-based End-to-End Speaker Diarization by Assigning Auxiliary Losses to Attention Heads

Transformer-based end-to-end neural speaker diarization (EEND) models ut...
research
05/15/2020

ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network

We propose a neural network for zero-shot voice conversion (VC) without ...
research
06/05/2022

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

We present a novel way of conditioning a pretrained denoising diffusion ...

Please sign up or login with your details

Forgot password? Click here to reset