Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

03/09/2020
by   Ting-yao Hu, et al.
12

We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use the leaked content and ignore some of the input text to minimize the reconstruction loss. At inference time, when the reference speech does not match the content input, the output may not contain all of the content of the input text. We refer to this problem as "content leakage", which we address by explicitly estimating and minimizing the mutual information between the style and the content through an adversarial training formulation. We call our method MIST - Mutual Information based Style Content Separation. The main goal of the method is to preserve the input content in the synthesized speech signal, which we measure by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2021

Using multiple reference audios and style embedding constraints for speech synthesis

The end-to-end speech synthesis model can directly take an utterance as ...
research
08/04/2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Expressive neural text-to-speech (TTS) systems incorporate a style encod...
research
01/14/2020

Adversarial Disentanglement with Grouped Observations

We consider the disentanglement of the representations of the relevant a...
research
08/30/2019

Maximizing Mutual Information for Tacotron

End-to-end speech synthesis method such as Tacotron, Tacotron2 and Trans...
research
08/15/2018

Recycle-GAN: Unsupervised Video Retargeting

We introduce a data-driven approach for unsupervised video retargeting t...
research
01/31/2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Expressive text-to-speech (TTS) aims to synthesize different speaking st...
research
08/14/2019

Dual Adversarial Inference for Text-to-Image Synthesis

Synthesizing images from a given text description involves engaging two ...

Please sign up or login with your details

Forgot password? Click here to reset