One TTS Alignment To Rule Them All

08/23/2021
by   Rohan Badlani, et al.
0

Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durations extracted from external sources. In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework, easily applicable to a variety of neural TTS models. The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior. In our experiments, the alignment learning framework improves all tested TTS architectures, both autoregressive (Flowtron, Tacotron 2) and non-autoregressive (FastPitch, FastSpeech 2, RAD-TTS). Specifically, it improves alignment convergence speed of existing attention-based mechanisms, simplifies the training pipeline, and makes the models more robust to errors on long utterances. Most importantly, the framework improves the perceived speech synthesis quality, as judged by human evaluators.

READ FULL TEXT
research
10/23/2019

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Despite the ability to produce human-level speech for in-domain text, at...
research
05/18/2020

MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search

To speed up the inference of neural speech synthesis, non-autoregressive...
research
07/07/2021

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

This paper describes a variational auto-encoder based non-autoregressive...
research
12/07/2020

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

In this work, we address the Text-to-Speech (TTS) task by proposing a no...
research
05/22/2020

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet hav...
research
02/12/2021

VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-spee...
research
11/02/2020

FeatherTTS: Robust and Efficient attention based Neural TTS

Attention based neural TTS is elegant speech synthesis pipeline and has ...

Please sign up or login with your details

Forgot password? Click here to reset