FastSpeech: Fast, Robust and Controllable Text to Speech

05/22/2019
by   Yi Ren, et al.
0

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of target mel-sprectrogram sequence for parallel mel-sprectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the skipped words and repeated words, and can adjust voice speed smoothly. Most importantly, compared with autoregressive models, our model speeds up the mel-sprectrogram generation by 270x. Therefore, we call our model FastSpeech. We will release the code on Github.

READ FULL TEXT

page 7

page 8

research
08/15/2022

Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0

Neural network-based Text-to-Speech has significantly improved the quali...
research
06/08/2020

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Advanced text to speech (TTS) models such as FastSpeech can synthesize s...
research
10/29/2020

DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech

With the number of smart devices increasing, the demand for on-device te...
research
07/30/2020

Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning

This paper proposes a controllable end-to-end text-to-speech (TTS) syste...
research
01/19/2022

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Neural network based end-to-end Text-to-Speech (TTS) has greatly improve...
research
06/15/2021

MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Recent developments in deep learning have significantly improved the qua...
research
05/21/2019

Parallel Neural Text-to-Speech

In this work, we propose a non-autoregressive seq2seq model that convert...

Please sign up or login with your details

Forgot password? Click here to reset