DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

07/11/2022
by   Yanqing Liu, et al.
0

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS) and the vocoder (the decoder of VQ-GAN), with an auxiliary loss on the acoustic model to predict intermediate speech representations. Experiments show that DelightfulTTS 2 achieves a CMOS gain +0.14 over DelightfulTTS, and more method analyses further verify the effectiveness of the developed system.

READ FULL TEXT
research
03/06/2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Neural text-to-speech (TTS) generally consists of cascaded architecture ...
research
06/21/2021

Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis

Current two-stage TTS framework typically integrates an acoustic model w...
research
03/31/2022

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

In neural text-to-speech (TTS), two-stage system or a cascade of separat...
research
07/07/2022

End-to-end Speech-to-Punctuated-Text Recognition

Conventional automatic speech recognition systems do not produce punctua...
research
07/08/2022

End-to-End Binaural Speech Synthesis

In this work, we present an end-to-end binaural speech synthesis system ...
research
08/16/2020

Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders

Unsupervised representation learning of speech has been of keen interest...
research
04/02/2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, ...

Please sign up or login with your details

Forgot password? Click here to reset