Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

06/15/2021
by   Devang S Ram Mohan, et al.
0

Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: F_0, energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop modification of the predicted acoustic features can significantly further increase naturalness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2022

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

In this paper, we propose a novel unsupervised text-to-speech (UTTS) fra...
research
05/17/2019

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

The prosodic aspects of speech signals produced by current text-to-speec...
research
04/23/2023

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...
research
04/28/2020

Conditional Spoken Digit Generation with StyleGAN

This paper adapts a StyleGAN model for speech generation with minimal or...
research
07/28/2017

Improving coreference resolution with automatically predicted prosodic information

Adding manually annotated prosodic information, specifically pitch accen...
research
02/19/2021

Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

The prosody of a spoken word is determined by its surrounding context. I...
research
06/10/2019

Using generative modelling to produce varied intonation for speech synthesis

Unlike human speakers, typical text-to-speech (TTS) systems are unable t...

Please sign up or login with your details

Forgot password? Click here to reset