Low-resource expressive text-to-speech using data augmentation

11/11/2020
by   Goeric Huybrechts, et al.
0

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings. First, we augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers. Next, we use that synthetic data on top of the available recordings to train a TTS model. Finally, we fine-tune that model to further increase quality. Our evaluations show that the proposed changes bring significant improvements over non-augmented models across many perceived aspects of synthesised speech. We demonstrate the proposed approach on 2 styles (newscaster and conversational), on various speakers, and on both single and multi-speaker models, illustrating the robustness of our approach.

READ FULL TEXT
research
07/29/2022

Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

The availability of data in expressive styles across languages is limite...
research
06/24/2021

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Whilst recent neural text-to-speech (TTS) approaches produce high-qualit...
research
02/10/2022

Cross-speaker style transfer for text-to-speech using data augmentation

We address the problem of cross-speaker style transfer for text-to-speec...
research
09/02/2020

Efficient neural speech synthesis for low-resource languages throughmultilingual modeling

Recent advances in neural TTS have led to models that canprodu...
research
10/18/2022

Risk of re-identification for shared clinical speech recordings

Large, curated datasets are required to leverage speech-based tools in h...
research
01/11/2023

Modelling low-resource accents without accent-specific TTS frontend

This work focuses on modelling a speaker's accent that does not have a d...
research
06/16/2021

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

This paper proposes a general enhancement to the Normalizing Flows (NF) ...

Please sign up or login with your details

Forgot password? Click here to reset