A Spectral Energy Distance for Parallel Speech Synthesis

08/03/2020
by   Alexey A. Gritsenko, et al.
3

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2019

High Fidelity Speech Synthesis with Adversarial Networks

Generative adversarial networks have seen rapid development in recent ye...
research
10/30/2018

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

The state-of-the-art in text-to-speech synthesis has recently improved c...
research
04/09/2019

Probability density distillation with generative adversarial networks for high-quality parallel waveform generation

This paper proposes an effective probability density distillation (PDD) ...
research
04/08/2019

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

Recent advances in neural network -based text-to-speech have reached hum...
research
12/03/2019

WaveFlow: A Compact Flow-based Model for Raw Audio

In this work, we present WaveFlow, a small-footprint generative flow for...
research
10/25/2019

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

We propose Parallel WaveGAN, a distillation-free, fast, and small-footpr...
research
06/26/2022

Your Autoregressive Generative Model Can be Better If You Treat It as an Energy-Based One

Autoregressive generative models are commonly used, especially for those...

Please sign up or login with your details

Forgot password? Click here to reset