End-to-End Adversarial Text-to-Speech

06/05/2020
by   Jeff Donahue, et al.
0

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable monotonic interpolation scheme to predict the duration of each input token. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2022

Differentiable Duration Modeling for End-to-End Text-to-Speech

Parallel text-to-speech (TTS) models have recently enabled fast and high...
research
09/25/2019

High Fidelity Speech Synthesis with Adversarial Networks

Generative adversarial networks have seen rapid development in recent ye...
research
06/11/2021

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Several recent end-to-end text-to-speech (TTS) models enabling single-st...
research
03/29/2017

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages,...
research
03/04/2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Targeting at both high efficiency and performance, we propose AlignTTS t...
research
11/19/2021

Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis (DWTS) is a technique for neural audi...
research
02/18/2019

End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model

Time-aligned lyrics can enrich the music listening experience by enablin...

Please sign up or login with your details

Forgot password? Click here to reset