Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

10/23/2019
by   Eric Battenberg, et al.
0

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2022

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...
research
08/23/2021

One TTS Alignment To Rule Them All

Speech-to-text alignment is a critical component of neural textto-speech...
research
08/30/2019

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

End-to-end text-to-speech (TTS) synthesis is a method that directly conv...
research
10/23/2020

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to...
research
06/24/2015

Attention-Based Models for Speech Recognition

Recurrent sequence generators conditioned on input data through an atten...
research
01/30/2021

Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet

In this work, a robust and efficient text-to-speech system, named Triple...
research
11/10/2019

Location Attention for Extrapolation to Longer Sequences

Neural networks are surprisingly good at interpolating and perform remar...

Please sign up or login with your details

Forgot password? Click here to reset