Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

08/04/2021
by   Xudong Dai, et al.
0

Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate (WER) demonstrate the effectiveness of this model in mitigating content leakage. Listening tests indicate that the model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and ASR-guided Tacotron.

READ FULL TEXT
research
03/09/2020

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

We present a method to generate speech from input text and a style vecto...
research
11/08/2020

Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

This paper presents a novel neural model for fine-grained style modeling...
research
10/09/2021

Using multiple reference audios and style embedding constraints for speech synthesis

The end-to-end speech synthesis model can directly take an utterance as ...
research
11/01/2017

Uncovering Latent Style Factors for Expressive Speech Synthesis

Prosodic modeling is a core problem in speech synthesis. The key challen...
research
09/28/2018

Open-Ended Content-Style Recombination Via Leakage Filtering

We consider visual domains in which a class label specifies the content ...
research
10/12/2021

Fine-grained style control in Transformer-based Text-to-speech Synthesis

In this paper, we present a novel architecture to realize fine-grained s...
research
08/26/2018

Analyzing Learned Representations of a Deep ASR Performance Prediction Model

This paper addresses a relatively new task: prediction of ASR performanc...

Please sign up or login with your details

Forgot password? Click here to reset