WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

05/15/2020
by   Po-chun Hsu, et al.
0

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 5000 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2018

WaveGlow: A Flow-based Generative Network for Speech Synthesis

In this paper we propose WaveGlow: a flow-based network capable of gener...
research
08/16/2020

Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

In recent works, a flow-based neural vocoder has shown significant impro...
research
08/09/2020

SpeedySpeech: Efficient Neural Speech Synthesis

While recent neural sequence-to-sequence models have greatly improved th...
research
09/21/2020

DiffWave: A Versatile Diffusion Model for Audio Synthesis

In this work, we propose DiffWave, a versatile Diffusion probabilistic m...
research
10/17/2021

Taming Visually Guided Sound Generation

Recent advances in visually-induced audio generation are based on sampli...
research
04/01/2021

Fast DCTTS: Efficient Deep Convolutional Text-to-Speech

We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesize...
research
04/14/2022

Streamable Neural Audio Synthesis With Non-Causal Convolutions

Deep learning models are mostly used in an offline inference fashion. Ho...

Please sign up or login with your details

Forgot password? Click here to reset