DeepAI AI Chat
Log In Sign Up

Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS

10/06/2021
by   Tuomo Raitio, et al.
0

Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise emphasis control, while maintaining equal or better quality to the baseline model.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/14/2020

Controllable neural text-to-speech synthesis using intuitive prosodic features

Modern neural text-to-speech (TTS) synthesis can generate speech that is...
11/12/2020

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

Prosody modeling is an essential component in modern text-to-speech (TTS...
10/06/2021

Emphasis control for parallel neural TTS

The semantic information conveyed by a speech signal is strongly influen...
04/08/2022

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

This paper proposes a hierarchical and multi-scale variational autoencod...
03/26/2021

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

This paper introduces Parallel Tacotron 2, a non-autoregressive neural t...
05/21/2019

Parallel Neural Text-to-Speech

In this work, we propose a non-autoregressive seq2seq model that convert...