BigVGAN: A Universal Neural Vocoder with Large-Scale Training

06/09/2022
by   Sang-gil Lee, et al.
8

Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality. Based on our improved generator and the state-of-the-art discriminators, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments. We will release our code and model at: https://github.com/NVIDIA/BigVGAN

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Recent development of neural vocoders based on the generative adversaria...
research
11/01/2021

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

Most GAN(Generative Adversarial Network)-based approaches towards high-f...
research
09/06/2023

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Generative adversarial network (GAN)-based vocoders have been intensivel...
research
01/12/2021

Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis

Training Generative Adversarial Networks (GAN) on high-fidelity images u...
research
04/30/2020

Jukebox: A Generative Model for Music

We introduce Jukebox, a model that generates music with singing in the r...
research
06/06/2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be...
research
05/09/2023

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

Despite recent advances in syncing lip movements with any audio waves, c...

Please sign up or login with your details

Forgot password? Click here to reset