Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

08/25/2023
by   Xuyuan Li, et al.
0

Neural networks have been able to generate high-quality single-sentence speech with substantial expressiveness. However, it remains a challenge concerning paragraph-level speech synthesis due to the need for coherent acoustic features while delivering fluctuating speech styles. Meanwhile, training these models directly on over-length speech leads to a deterioration in the quality of synthesis speech. To address these problems, we propose a high-quality and expressive paragraph speech synthesis system with a multi-step variational autoencoder. Specifically, we employ multi-step latent variables to capture speech information at different grammatical levels before utilizing these features in parallel to generate speech waveform. We also propose a three-step training method to improve the decoupling ability. Our model was trained on a single-speaker French audiobook corpus released at Blizzard Challenge 2023. Experimental results underscore the significant superiority of our system over baseline models.

READ FULL TEXT
research
07/13/2022

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

Some recent studies have demonstrated the feasibility of single-stage ne...
research
11/07/2022

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Accent plays a significant role in speech communication, influencing und...
research
04/06/2018

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Recent advances in neural autoregressive models have improve the perform...
research
05/20/2023

EE-TTS: Emphatic Expressive TTS with Linguistic Information

While Current TTS systems perform well in synthesizing high-quality spee...
research
06/30/2022

TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder

Recent advances in synthetic speech quality have enabled us to train tex...
research
04/07/2021

Learning robust speech representation with an articulatory-regularized variational autoencoder

It is increasingly considered that human speech perception and productio...
research
06/29/2022

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Generating expressive and contextually appropriate prosody remains a cha...

Please sign up or login with your details

Forgot password? Click here to reset