Deep Sparse Conformer for Speech Recognition

09/01/2022
by   Xianchao Wu, et al.
0

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwich the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, sparser and deeper. We adapt a sparse self-attention mechanism with 𝒪(LlogL) in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52%, 4.03% and 4.50% on the three evaluation sets and 4.16%, 2.84% and 3.20% when ensembling five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2022

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-he...
research
09/01/2022

Attention Enhanced Citrinet for Speech Recognition

Citrinet is an end-to-end convolutional Connectionist Temporal Classific...
research
10/28/2019

DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Self-attention networks (SAN) have been introduced into automatic speech...
research
06/02/2022

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

The recently proposed Conformer model has become the de facto backbone m...
research
11/08/2020

On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Self-attention models such as Transformers, which can capture temporal r...
research
10/31/2022

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

The recently proposed Conformer architecture which combines convolution ...
research
08/22/2021

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Recurrent neural network transducers (RNN-T) are a promising end-to-end ...

Please sign up or login with your details

Forgot password? Click here to reset