Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

02/06/2020
by   Guangzhi Sun, et al.
0

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with an auto-regressive structure. Evaluation of reconstruction performance illustrates that the new structure does not degrade the model while allowing better interpretability. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.

READ FULL TEXT
research
09/17/2020

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

This paper proposes a hierarchical generative model with a multi-grained...
research
04/11/2022

Fine-grained Noise Control for Multispeaker Speech Synthesis

A text-to-speech (TTS) model typically factorizes speech attributes such...
research
07/15/2020

Learning Invariances for Interpretability using Supervised VAE

We propose to learn model invariances as a means of interpreting a model...
research
11/01/2022

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

This paper proposes an Expressive Speech Synthesis model that utilizes t...
research
10/16/2018

Hierarchical Generative Modeling for Controllable Speech Synthesis

This paper proposes a neural end-to-end text-to-speech (TTS) model which...
research
04/29/2020

Asking without Telling: Exploring Latent Ontologies in Contextual Representations

The success of pretrained contextual encoders, such as ELMo and BERT, ha...
research
05/17/2019

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

The prosodic aspects of speech signals produced by current text-to-speec...

Please sign up or login with your details

Forgot password? Click here to reset