End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

10/19/2020
by   Yusuke Yasuda, et al.
8

Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods (Transformer-TTS, Tacotron2) and explicit duration modeling-based methods (Fastspeech).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2022

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

Text-to-speech synthesis (TTS) is a task to convert texts into speech. T...
research
05/09/2022

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Text to speech (TTS) has made rapid progress in both academia and indust...
research
07/07/2021

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

This paper describes a variational auto-encoder based non-autoregressive...
research
08/30/2019

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

End-to-end text-to-speech (TTS) synthesis is a method that directly conv...
research
03/14/2022

Modeling Tie Duration in ERGM-Based Dynamic Network Models

Krivitsky and Handcock (2014) proposed a Separable Temporal ERGM (STERGM...
research
06/24/2022

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

The recent text-to-speech (TTS) has achieved quality comparable to that ...

Please sign up or login with your details

Forgot password? Click here to reset