Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

by   Takuhiro Kaneko, et al.

In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminators allow synthesized speech to adequately approach real speech; however, they require an increase in the model size and computation time according to the increase in the number of discriminators. Alternatively, this study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture. This discriminator is unique; it can assess a waveform in a sample-wise manner with the same resolution as the input signal, while extracting multilevel features via an encoder and decoder with skip connections. This architecture provides a generator with sufficiently rich information for the synthesized speech to be closely matched to the real speech. During the experiments, the proposed ideas were applied to a representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS). The results demonstrate that the proposed models can achieve comparable speech quality with a 2.31 times faster and 14.5 times more lightweight discriminator when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight discriminator when used in VITS. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/.


page 1

page 2

page 3

page 4


Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Neural vocoders based on the generative adversarial neural network (GAN)...

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Synthesizing high-fidelity complex images from text is challenging. Base...

UU-Nets Connecting Discriminator and Generator for Image to Image Translation

Adversarial generative model have successfully manifest itself in image ...

EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient microphones

In this paper, we present Extreme Bandwidth Extension Network (EBEN), a ...

DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better

We present a new end-to-end generative adversarial network (GAN) for sin...

DVGAN: Stabilize Wasserstein GAN training for time-domain Gravitational Wave physics

Simulating time-domain observations of gravitational wave (GW) detector ...

Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

This paper presents a configurable version of Extreme Bandwidth Extensio...

Please sign up or login with your details

Forgot password? Click here to reset