Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

12/16/2017
by   Jonathan Shen, et al.
0

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2017

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages,...
research
09/14/2020

Controllable neural text-to-speech synthesis using intuitive prosodic features

Modern neural text-to-speech (TTS) synthesis can generate speech that is...
research
11/24/1998

Text-To-Speech Conversion with Neural Networks: A Recurrent TDNN Approach

This paper describes the design of a neural network that performs the ph...
research
04/08/2019

GELP: GAN-Excited Liner Prediction for Speech Synthesis from Mel-spectrogram

Recent advances in neural network -based text-to-speech have reached hum...
research
07/03/2019

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

This paper describes a conditional neural network architecture for Manda...
research
05/20/2020

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce h...
research
11/24/1998

Speech Synthesis with Neural Networks

Text-to-speech conversion has traditionally been performed either by con...

Please sign up or login with your details

Forgot password? Click here to reset