Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks

08/20/2018
by   Sercan O. Arik, et al.
0

We propose the multi-head convolutional neural network (MCNN) architecture for waveform synthesis from spectrograms. Nonlinear interpolation in MCNN is employed with transposed convolution layers in parallel heads. MCNN achieves more than an order of magnitude higher compute intensity than commonly-used iterative algorithms like Griffin-Lim, yielding efficient utilization for modern multi-core processors, and very fast (more than 300x real-time) waveform synthesis. For training of MCNN, we use a large-scale speech recognition dataset and losses defined on waveforms that are related to perceptual audio quality. We demonstrate that MCNN constitutes a very promising approach for high-quality speech synthesis, without any iterative algorithms or autoregression in computations.

READ FULL TEXT
research
03/28/2019

A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet

Neural speech synthesis algorithms are a promising new approach for codi...
research
07/11/2020

Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis

The performance of text-to-speech (TTS) systems heavily depends on spect...
research
11/13/2022

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as ...
research
04/30/2020

A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications

Auditory models are commonly used as feature extractors for automatic sp...
research
09/19/2019

WEnets: A Convolutional Framework for Evaluating Audio Waveforms

We describe a new convolutional framework for waveform evaluation, WEnet...
research
11/15/2018

Comprehensive evaluation of statistical speech waveform synthesis

Statistical TTS systems that directly predict the speech waveform have r...
research
09/21/2022

An Initial study on Birdsong Re-synthesis Using Neural Vocoders

Modern speech synthesis uses neural vocoders to model raw waveform sampl...

Please sign up or login with your details

Forgot password? Click here to reset