ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

07/19/2018
by   Wei Ping, et al.
0

In this work, we propose an alternative solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a novel regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we propose the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2017). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2019

Parallel Neural Text-to-Speech

In this work, we propose a non-autoregressive seq2seq model that convert...
research
11/06/2020

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

We describe a sequence-to-sequence neural network which can directly gen...
research
07/09/2019

Multi-Speaker End-to-End Speech Synthesis

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end...
research
10/22/2020

Parallel Tacotron: Non-Autoregressive and Controllable TTS

Although neural end-to-end text-to-speech models can synthesize highly n...
research
06/08/2020

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Advanced text to speech (TTS) models such as FastSpeech can synthesize s...
research
06/19/2018

End-to-End Speech Recognition From the Raw Waveform

State-of-the-art speech recognition systems rely on fixed, hand-crafted ...
research
10/28/2022

Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

Several fully end-to-end text-to-speech (TTS) models have been proposed ...

Please sign up or login with your details

Forgot password? Click here to reset