TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis

10/26/2020
by   Min-Jae Hwang, et al.
0

In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR models, such as FastSpeech 2, have successfully achieved fast speech synthesis system. However, their quality is not satisfactory, especially when the amount of training data is insufficient. To address this problem, we propose an effective data augmentation method using a well-designed AR TTS system. In this method, large-scale synthetic corpora including text-waveform pairs with phoneme duration are generated by the AR TTS system and then used to train the target non-AR model. Perceptual listening test results showed that the proposed method significantly improved the quality of the non-AR TTS system. In particular, we augmented five hours of a training database to 179 hours of a synthetic one. Using these databases, our TTS system consisting of a FastSpeech 2 acoustic model with a Parallel WaveGAN vocoder achieved a mean opinion score of 3.74, which is 40 by the conventional method.

READ FULL TEXT
research
10/29/2018

Neural source-filter-based waveform model for statistical parametric speech synthesis

Neural waveform models such as the WaveNet are used in many recent text-...
research
06/16/2023

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Many neural text-to-speech architectures can synthesize nearly natural s...
research
04/07/2018

A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

Recent advances in speech synthesis suggest that limitations such as the...
research
02/15/2021

PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components

We propose PeriodNet, a non-autoregressive (non-AR) waveform generation ...
research
05/28/2021

Differentiable Artificial Reverberation

We propose differentiable artificial reverberation (DAR), a family of ar...
research
07/14/2021

High-Speed and High-Quality Text-to-Lip Generation

As a key component of talking face generation, lip movements generation ...
research
06/30/2022

TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder

Recent advances in synthetic speech quality have enabled us to train tex...

Please sign up or login with your details

Forgot password? Click here to reset