Hierarchical Generative Modeling for Controllable Speech Synthesis

10/16/2018
by   Wei-Ning Hsu, et al.
0

This paper proposes a neural end-to-end text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.

READ FULL TEXT

page 7

page 15

page 16

page 18

page 20

page 21

page 22

page 23

research
09/17/2020

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

This paper proposes a hierarchical generative model with a multi-grained...
research
04/11/2022

Fine-grained Noise Control for Multispeaker Speech Synthesis

A text-to-speech (TTS) model typically factorizes speech attributes such...
research
10/03/2019

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

We present a novel generative model that combines state-of-the-art neura...
research
02/06/2020

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

This paper proposes a hierarchical, fine-grained and interpretable laten...
research
08/17/2020

Deep Variational Generative Models for Audio-visual Speech Separation

In this paper, we are interested in audio-visual speech separation given...
research
10/18/2022

Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

In this paper, we propose a method for intermediating multiple speakers'...
research
07/30/2020

Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning

This paper proposes a controllable end-to-end text-to-speech (TTS) syste...

Please sign up or login with your details

Forgot password? Click here to reset