Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

06/08/2019
by   Eric Battenberg, et al.
0

Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model, we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, the proposed model is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2022

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Expressive text-to-speech has shown improved performance in recent years...
research
11/02/2022

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

Disentanglement of a speaker's timbre and style is very important for st...
research
12/11/2018

Learning latent representations for style control and transfer in end-to-end speech synthesis

In this paper, we introduce the Variational Autoencoder (VAE) to an end-...
research
08/04/2018

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

Global Style Tokens (GSTs) are a recently-proposed method to learn laten...
research
11/06/2018

Robust and fine-grained prosody control of end-to-end speech synthesis

We propose prosody embeddings for emotional and expressive speech synthe...
research
08/30/2019

Implicit Deep Latent Variable Models for Text Generation

Deep latent variable models (LVM) such as variational auto-encoder (VAE)...
research
06/22/2018

A Variational Prosody Model for the decomposition and synthesis of speech prosody

The quest for comprehensive generative models of intonation that link li...

Please sign up or login with your details

Forgot password? Click here to reset