Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

03/29/2021
by   Chengdong Liang, et al.
0

Self-attention (SA), which encodes vector sequences according to their pairwise similarity, is widely used in speech recognition due to its strong context modeling ability. However, when applied to long sequence data, its accuracy is reduced. This is caused by the fact that its weighted average operator may lead to the dispersion of the attention distribution, which results in the relationship between adjacent signals ignored. To address this issue, in this paper, we introduce relative-position-awareness self-attention (RPSA). It not only maintains the global-range dependency modeling ability of self-attention, but also improves the localness modeling ability. Because the local window length of the original RPSA is fixed and sensitive to different test data, here we propose Gaussian-based self-attention (GSA) whose window length is learnable and adaptive to the test data automatically. We further generalize GSA to a new residual Gaussian self-attention (resGSA) for the performance improvement. We apply RPSA, GSA, and resGSA to Transformer-based speech recognition respectively. Experimental results on the AISHELL-1 Mandarin speech recognition corpus demonstrate the effectiveness of the proposed methods. For example, the resGSA-Transformer achieves a character error rate (CER) of 5.86 SA-Transformer. Although the performance of the proposed resGSA-Transformer is only slightly better than that of the RPSA-Transformer, it does not have to tune the window length manually.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2019

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

We explore options to use Transformer networks in neural transducer for ...
research
07/13/2021

Conformer-based End-to-end Speech Recognition With Rotary Position Embedding

Transformer-based end-to-end speech recognition models have received con...
research
08/22/2021

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Recurrent neural network transducers (RNN-T) are a promising end-to-end ...
research
02/18/2021

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

Self-attention (SA) based models have recently achieved significant perf...
research
12/29/2021

Temporal Attention Augmented Transformer Hawkes Process

In recent years, mining the knowledge from asynchronous sequences by Haw...
research
10/01/2022

A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition

Phoneme recognition is a very important part of speech recognition that ...
research
03/26/2018

Self-Attentional Acoustic Models

Self-attention is a method of encoding sequences of vectors by relating ...

Please sign up or login with your details

Forgot password? Click here to reset