DeepAI AI Chat
Log In Sign Up

Sample Efficient Adaptive Text-to-Speech

by   Yutian Chen, et al.

We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.


page 1

page 2

page 3

page 4


Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Personalizing a speech synthesis system is a highly desired application,...

Adapting TTS models For New Speakers using Transfer Learning

Training neural text-to-speech (TTS) models for a new speaker typically ...

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

Fine-tuning is a popular method for adapting text-to-speech (TTS) models...

Fitting New Speakers Based on a Short Untranscribed Sample

Learning-based Text To Speech systems have the potential to generalize f...

Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation

Adapting a neural text-to-speech (TTS) model to a target speaker typical...

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...

Supervised online diarization with sample mean loss for multi-domain data

Recently, a fully supervised speaker diarization approach was proposed (...