Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

11/13/2022
by   Jacob J Webber, et al.
0

Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computationally expensive machine learning: a neural vocoder. Our proposed “autovocoder” reverses this arrangement. We use machine learning to obtain a representation that replaces the mel-spectrogram, and that can be inverted back to a waveform using simple, fast operations including a differentiable implementation of the inverse STFT. The autovocoder generates a waveform 5 times faster than the DSP-based Griffin-Lim algorithm, and 14 times faster than the neural vocoder HiFi-GAN. We provide perceptual listening test results to confirm that the speech is of comparable quality to HiFi-GAN in the copy synthesis task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2018

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

The state-of-the-art in text-to-speech synthesis has recently improved c...
research
07/11/2020

Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis

The performance of text-to-speech (TTS) systems heavily depends on spect...
research
11/25/2022

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

We present a neural vocoder designed with low-powered Alternative and Au...
research
06/21/2021

Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis

Current two-stage TTS framework typically integrates an acoustic model w...
research
08/20/2018

Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks

We propose the multi-head convolutional neural network (MCNN) architectu...
research
10/25/2019

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

We propose Parallel WaveGAN, a distillation-free, fast, and small-footpr...
research
04/22/2021

Restoring degraded speech via a modified diffusion model

There are many deterministic mathematical operations (e.g. compression, ...

Please sign up or login with your details

Forgot password? Click here to reset