Close to Human Quality TTS with Transformer

09/19/2018
by   Naihan Li, et al.
0

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

READ FULL TEXT

page 4

page 5

research
09/05/2019

Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

Due to the highly parallelizable architecture, Transformer is faster to ...
research
11/01/2018

Hybrid Self-Attention Network for Machine Translation

The encoder-decoder is the typical framework for Neural Machine Translat...
research
02/24/2020

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Transformer-based models have brought a radical change to neural machine...
research
12/29/2021

Universal Transformer Hawkes Process with Adaptive Recursive Iteration

Asynchronous events sequences are widely distributed in the natural worl...
research
03/04/2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Targeting at both high efficiency and performance, we propose AlignTTS t...
research
07/27/2021

PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution

Since the superiority of Transformer in learning long-term dependency, t...

Please sign up or login with your details

Forgot password? Click here to reset