R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

06/30/2022
by   Kyle Kastner, et al.
0

This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.

READ FULL TEXT

page 2

page 5

research
06/17/2021

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

This paper introduces WaveGrad 2, a non-autoregressive generative model ...
research
10/22/2020

NU-GAN: High resolution neural upsampling with GAN

In this paper, we propose NU-GAN, a new method for resampling audio from...
research
04/25/2021

Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

Speech synthesis and music audio generation from symbolic input differ i...
research
09/12/2016

WaveNet: A Generative Model for Raw Audio

This paper introduces WaveNet, a deep neural network for generating raw ...
research
10/31/2018

WaveGlow: A Flow-based Generative Network for Speech Synthesis

In this paper we propose WaveGlow: a flow-based network capable of gener...
research
02/09/2020

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Autoregressive convolutional neural networks (CNNs) have been widely exp...
research
04/05/2017

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Generative models in vision have seen rapid progress due to algorithmic ...

Please sign up or login with your details

Forgot password? Click here to reset