Self-Attention Linguistic-Acoustic Decoder

by   Santiago Pascual, et al.
Universitat Politècnica de Catalunya

The conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, we try to overcome the limitations of recursive structure by using a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder network is competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU inference time. On average, it increases Mel cepstral distortion between 0.1 and 0.3 dB, but it is over an order of magnitude faster on average. Fast inference is important for the deployment of speech synthesis systems on devices with restricted resources, like mobile phones or embedded systems, where speaking virtual assistants are gaining importance.


page 1

page 2

page 3

page 4


A High Quality Text-To-Speech System Composed of Multiple Neural Networks

While neural networks have been employed to handle several different tex...

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

The Transformer has shown impressive performance in automatic speech rec...

Towards Online End-to-end Transformer Automatic Speech Recognition

The Transformer self-attention network has recently shown promising perf...

UFANS: U-shaped Fully-Parallel Acoustic Neural Structure For Statistical Parametric Speech Synthesis With 20X Faster

Neural networks with Auto-regressive structures, such as Recurrent Neura...

Recurrent Neural Network Postfilters for Statistical Parametric Speech Synthesis

In the last two years, there have been numerous papers that have looked ...

FeatherTTS: Robust and Efficient attention based Neural TTS

Attention based neural TTS is elegant speech synthesis pipeline and has ...

Please sign up or login with your details

Forgot password? Click here to reset