Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

by   Yuki Saito, et al.

A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks (DNNs) techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an over-smoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence (i.e., distribution difference) between the natural and generated speech parameters, the proposed method effectively alleviates the over-smoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and F_0 than conventional minimum generation error training algorithm regardless its hyper-parameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving synthetic speech quality.


page 6

page 9


Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

We propose a novel training algorithm for a multi-speaker neural text-to...

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

Recent studies have shown that text-to-speech synthesis quality can be i...

WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

We propose a learning-based filter that allows us to directly modify a s...

Sampling-based speech parameter generation using moment-matching networks

This paper presents sampling-based speech parameter generation using mom...

A comparison of Vietnamese Statistical Parametric Speech Synthesis Systems

In recent years, statistical parametric speech synthesis (SPSS) systems ...

Pattern Detection in the Activation Space for Identifying Synthesized Content

Generative Adversarial Networks (GANs) have recently achieved unpreceden...

Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

This paper proposes voicing-aware conditional discriminators for Paralle...

Please sign up or login with your details

Forgot password? Click here to reset