Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

05/11/2020
by   Geng Yang, et al.
0

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation, which will be open-resourced shortly, can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/12/2020

FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

In this paper, we propose the FeatherWave, yet another variant of WaveRN...
research
02/17/2020

Lifter Training and Sub-band Modeling for Computationally Efficient and High-Quality Voice Conversion Using Spectral Differentials

In this paper, we propose computationally efficient and high-quality met...
research
10/23/2022

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Entertainment-oriented singing voice synthesis (SVS) requires a vocoder ...
research
06/15/2021

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Most neural vocoders employ band-limited mel-spectrograms to generate wa...
research
06/27/2022

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Neural vocoders based on the generative adversarial neural network (GAN)...
research
11/08/2022

Improving performance of real-time full-band blind packet-loss concealment with predictive network

Packet loss concealment (PLC) is a tool for enhancing speech degradation...
research
09/04/2019

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

In this paper, we present a generic and robust multimodal synthesis syst...

Please sign up or login with your details

Forgot password? Click here to reset