Very Deep Self-Attention Networks for End-to-End Speech Recognition

04/30/2019
by   Ngoc-Quan Pham, et al.
0

Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9 finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/15/2020

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Recently, Transformer has gained success in automatic speech recognition...
research
05/21/2020

Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

Transformer models have been introduced into end-to-end speech recogniti...
research
12/09/2021

Are E2E ASR models ready for an industrial usage?

The Automated Speech Recognition (ASR) community experiences a major tur...
research
08/02/2019

Universal Transforming Geometric Network

The recurrent geometric network (RGN), the first end-to-end differentiab...
research
12/29/2022

Unsupervised construction of representations for oil wells via Transformers

Determining and predicting reservoir formation properties for newly dril...
research
05/21/2023

Temporal Fusion Transformers for Streamflow Prediction: Value of Combining Attention with Recurrence

Over the past few decades, the hydrology community has witnessed notable...
research
03/12/2019

End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model

Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) ...

Please sign up or login with your details

Forgot password? Click here to reset