HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

09/18/2023
by   Yinghao Aaron Li, et al.
0

Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only 1/6 of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Recent development of neural vocoders based on the generative adversaria...
research
10/28/2022

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

We propose a lightweight end-to-end text-to-speech model using multi-ban...
research
08/27/2019

Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis

Neural source-filter (NSF) models are deep neural networks that produce ...
research
08/14/2023

iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

The inverse short-time Fourier transform network (iSTFTNet) has garnered...
research
11/25/2022

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

We present a neural vocoder designed with low-powered Alternative and Au...
research
05/12/2022

Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation

This paper introduces a unified source-filter network with a harmonic-pl...
research
03/04/2022

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

In recent text-to-speech synthesis and voice conversion systems, a mel-s...

Please sign up or login with your details

Forgot password? Click here to reset