VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

by   Peng Liu, et al.

This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global alignment from previous attention layer as an extra input, the following attention layer can produce a refined version of alignment. This amortizes the burden of learning the textual-to-acoustic alignment among multiple attention layers and outperforms the use of only a single attention layer in robustness. An utterance-level speaking speed factor is computed by a jointly-trained speaking speed predictor, which takes the mean-pooled latent variables of the coarsest layer as input, to determine number of acoustic frames at inference. Experimental results show that VARA-TTS achieves slightly inferior speech quality to an AR counterpart Tacotron 2 but an order-of-magnitude speed-up at inference; and outperforms an analogous non-AR model, BVAE-TTS, in terms of speech quality.


VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

This paper describes a variational auto-encoder based non-autoregressive...

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

This paper proposes a hierarchical and multi-scale variational autoencod...

Parallel Tacotron: Non-Autoregressive and Controllable TTS

Although neural end-to-end text-to-speech models can synthesize highly n...

Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis

This paper proposes a forward attention method for the sequenceto- seque...

One TTS Alignment To Rule Them All

Speech-to-text alignment is a critical component of neural textto-speech...

MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search

To speed up the inference of neural speech synthesis, non-autoregressive...

FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion

This paper proposes a non-autoregressive extension of our previously pro...

Please sign up or login with your details

Forgot password? Click here to reset