JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

03/31/2022
by   Dan Lim, et al.
0

In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2022

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Speech synthesis has come a long way as current text-to-speech (TTS) mod...
research
04/05/2022

AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

The quality of end-to-end neural text-to-speech (TTS) systems highly dep...
research
10/28/2022

Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

Several fully end-to-end text-to-speech (TTS) models have been proposed ...
research
07/05/2023

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

Despite previous efforts in melody-to-lyric generation research, there i...
research
10/30/2022

Adaptive Speech Quality Aware Complex Neural Network for Acoustic Echo Cancellation with Supervised Contrastive Learning

Acoustic echo cancellation (AEC) is designed to remove echoes, reverbera...
research
07/11/2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Current text to speech (TTS) systems usually leverage a cascaded acousti...
research
02/13/2019

Recurrent Neural Networks with Stochastic Layers for Acoustic Novelty Detection

In this paper, we adapt Recurrent Neural Networks with Stochastic Layers...

Please sign up or login with your details

Forgot password? Click here to reset