Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

11/01/2019
by   Pan Zhou, et al.
0

Transformer showed promising results in many sequence to sequence transformation tasks recently. It utilizes a number of feed-forward self-attention layers in the encoder and decoder to replace recurrent neural networks (RNN) in attention-based encoder decoder (AED). Self-attention layer learns temporal dependence by incorporating sinusoidal positional embedding of tokens in sequences for parallel computing. Quicker iteration speed in training than sequential operation of RNN can be obtained. The deeper layer of transformer also makes it perform better than RNN-based AED. However, this parallelization makes it hard to apply schedule sampling training. Self-attention with sinusoidal positional embedding may also cause performance degradations for longer sequence that has similar acoustic or semantic information at different positions. To address these problems, we propose to use parallel schedule sampling (PSS) and relative positional embedding (RPE) to help transformer generalize to unseen data. Our proposed methods achieve 7 relative improvement for short utterances and 30 utterances on a 10,000-hour ASR task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2018

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Sequence-to-sequence attention-based models have recently shown very pro...
research
08/22/2021

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Recurrent neural network transducers (RNN-T) are a promising end-to-end ...
research
06/18/2020

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

The Transformer has shown impressive performance in automatic speech rec...
research
09/24/2022

A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion

Cyrillic and Traditional Mongolian are the two main members of the Mongo...
research
02/07/2020

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

In this paper we present an end-to-end speech recognition model with Tra...
research
11/17/2020

s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis

Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent...
research
02/04/2023

Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation

Prior work has attempted to understand the internal structures and funct...

Please sign up or login with your details

Forgot password? Click here to reset