Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

05/21/2020
by   Haoneng Luo, et al.
0

Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model can achieve over 20 on the AISHELL-1 task. With impressively 20 shows no loss of recognition performance on the 20,000-hour large-scale task.

READ FULL TEXT
research
05/21/2020

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

End-to-end speech recognition has become popular in recent years, since ...
research
10/28/2019

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

We explore options to use Transformer networks in neural transducer for ...
research
10/29/2022

XNOR-FORMER: Learning Accurate Approximations in Long Speech Transformers

Transformers are among the state of the art for many tasks in speech, vi...
research
04/30/2019

Very Deep Self-Attention Networks for End-to-End Speech Recognition

Recently, end-to-end sequence-to-sequence models for speech recognition ...
research
03/19/2022

Similarity and Content-based Phonetic Self Attention for Speech Recognition

Transformer-based speech recognition models have achieved great success ...
research
07/06/2022

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Conformer has proven to be effective in many speech processing tasks. It...
research
01/06/2022

TransVPR: Transformer-based place recognition with multi-level attention aggregation

Visual place recognition is a challenging task for applications such as ...

Please sign up or login with your details

Forgot password? Click here to reset