Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

06/06/2021
by   Dongchan Min, et al.
0

With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

READ FULL TEXT

page 8

page 14

research
02/14/2018

Neural Voice Cloning with a Few Samples

Voice cloning is a highly desired feature for personalized speech interf...
research
08/28/2022

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Transfer tasks in text-to-speech (TTS) synthesis - where one or more asp...
research
01/13/2021

Whispered and Lombard Neural Speech Synthesis

It is desirable for a text-to-speech system to take into account the env...
research
07/06/2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

While recent text to speech (TTS) models perform very well in synthesizi...
research
04/03/2018

Unsupervised Learning of Sequence Representations by Autoencoders

Traditional machine learning models have problems with handling sequence...
research
12/14/2020

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

The style of the speech varies from person to person and every person ex...
research
09/07/2022

AudioLM: a Language Modeling Approach to Audio Generation

We introduce AudioLM, a framework for high-quality audio generation with...

Please sign up or login with your details

Forgot password? Click here to reset