Learning Robust Latent Representations for Controllable Speech Synthesis

05/10/2021
by   Shakti Kumar, et al.
1

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30% over LSTM-VAE and by at least 7% over vanilla Transformer-VAE.

READ FULL TEXT
research
06/25/2021

InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents

We introduce a simple and effective method for learning VAEs with contro...
research
04/02/2020

Guided Variational Autoencoder for Disentanglement Learning

We propose an algorithm, guided variational autoencoder (Guided-VAE), th...
research
12/23/2019

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

In this paper, we are interested in unsupervised speech enhancement usin...
research
03/25/2023

Beta-VAE has 2 Behaviors: PCA or ICA?

Beta-VAE is a very classical model for disentangled representation learn...
research
04/07/2022

Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

In this paper, we propose a novel prosody disentangle method for prosodi...
research
05/16/2020

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful repres...
research
03/27/2019

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

The field of Text-to-Speech has experienced huge improvements last years...

Please sign up or login with your details

Forgot password? Click here to reset