Enhancing audio quality for expressive Neural Text-to-Speech

08/13/2021
by   Abdelhamid Ezzerg, et al.
0

Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39 terms of MUSHRA scores for an expressive celebrity voice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2019

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GS...
research
05/03/2023

Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research

Modelling of early language acquisition aims to understand how infants b...
research
06/24/2021

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Whilst recent neural text-to-speech (TTS) approaches produce high-qualit...
research
06/16/2021

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

This paper proposes a general enhancement to the Normalizing Flows (NF) ...
research
01/26/2023

On granularity of prosodic representations in expressive text-to-speech

In expressive speech synthesis it is widely adopted to use latent prosod...
research
10/24/2021

Discrete acoustic space for an efficient sampling in neural text-to-speech

We present an SVQ-VAE architecture using a split vector quantizer for NT...
research
05/20/2023

EE-TTS: Emphatic Expressive TTS with Linguistic Information

While Current TTS systems perform well in synthesizing high-quality spee...

Please sign up or login with your details

Forgot password? Click here to reset