Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS

10/06/2021
by   Tuomo Raitio, et al.
0

Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise emphasis control, while maintaining equal or better quality to the baseline model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2020

Controllable neural text-to-speech synthesis using intuitive prosodic features

Modern neural text-to-speech (TTS) synthesis can generate speech that is...
research
11/12/2020

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

Prosody modeling is an essential component in modern text-to-speech (TTS...
research
10/06/2021

Emphasis control for parallel neural TTS

The semantic information conveyed by a speech signal is strongly influen...
research
04/16/2021

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

We propose TalkNet, a non-autoregressive convolutional neural model for ...
research
04/08/2022

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

This paper proposes a hierarchical and multi-scale variational autoencod...
research
03/26/2021

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

This paper introduces Parallel Tacotron 2, a non-autoregressive neural t...
research
05/21/2019

Parallel Neural Text-to-Speech

In this work, we propose a non-autoregressive seq2seq model that convert...

Please sign up or login with your details

Forgot password? Click here to reset