Neural HMMs are all you need (for high-quality attention-free TTS)

08/30/2021
by   Shivam Mehta, et al.
0

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and the use of non-monotonic attention both increases training time and introduces "babbling" failure modes that are unacceptable in production. This paper demonstrates that the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing the attention in Tacotron 2 with an autoregressive left-right no-skip hidden Markov model defined by a neural network. This leads to an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximations. We discuss how to combine innovations from both classical and contemporary TTS for best results. The final system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving the same naturalness prior to the post-net. Unlike Tacotron 2, our system also allows easy control over speaking rate. Audio examples and code are available at https://shivammehta007.github.io/Neural-HMM/

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2022

OverFlow: Putting flows on top of neural transducers for better TTS

Neural HMMs are a type of neural transducer recently proposed for sequen...
research
08/09/2020

SpeedySpeech: Efficient Neural Speech Synthesis

While recent neural sequence-to-sequence models have greatly improved th...
research
12/07/2020

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

In this work, we address the Text-to-Speech (TTS) task by proposing a no...
research
05/29/2018

Can DNNs Learn to Lipread Full Sentences?

Finding visual features and suitable models for lipreading tasks that ar...
research
05/22/2020

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet hav...
research
05/10/2020

CTC-synchronous Training for Monotonic Attention Model

Monotonic chunkwise attention (MoChA) has been studied for the online st...
research
11/25/2020

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Nowadays more and more applications can benefit from edge-based text-to-...

Please sign up or login with your details

Forgot password? Click here to reset