Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

05/22/2020
by   Jaehyeon Kim, et al.
0

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantages, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. We introduce Monotonic Alignment Search (MAS), an internal alignment search algorithm for training Glow-TTS. By leveraging the properties of flows, MAS searches for the most probable monotonic alignment between text and the latent representation of speech. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive TTS model, Tacotron 2, at synthesis with comparable speech quality, requiring only 1.5 seconds to synthesize one minute of speech in end-to-end. We further show that our model can be easily extended to a multi-speaker setting. Our demo page and code are available at public.

READ FULL TEXT
research
12/07/2020

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

In this work, we address the Text-to-Speech (TTS) task by proposing a no...
research
05/21/2019

Parallel Neural Text-to-Speech

In this work, we propose a non-autoregressive seq2seq model that convert...
research
09/30/2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 ...
research
04/28/2022

Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss

Recent deep learning Text-to-Speech (TTS) systems have achieved impressi...
research
05/30/2022

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Text-to-Speech (TTS) has recently seen great progress in synthesizing hi...
research
08/23/2021

One TTS Alignment To Rule Them All

Speech-to-text alignment is a critical component of neural textto-speech...
research
08/30/2021

Neural HMMs are all you need (for high-quality attention-free TTS)

Neural sequence-to-sequence TTS has achieved significantly better output...

Please sign up or login with your details

Forgot password? Click here to reset