Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

06/02/2022
by   Sehoon Kim, et al.
29

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After reexamining the design choices for both the macro and micro-architecture of Conformer, we propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure, which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of feed-forward module, followed up by multi-head attention or convolution modules, instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depth-wise downsampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5 Librispeech test-other without external language models. This is 3.1 and 0.6 open-sourced and available online.

READ FULL TEXT

page 1

page 2

research
10/31/2022

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

The recently proposed Conformer architecture which combines convolution ...
research
09/01/2022

Attention Enhanced Citrinet for Speech Recognition

Citrinet is an end-to-end convolutional Connectionist Temporal Classific...
research
09/01/2022

Deep Sparse Conformer for Speech Recognition

Conformer has achieved impressive results in Automatic Speech Recognitio...
research
08/31/2021

Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition

The recently proposed Conformer architecture has shown state-of-the-art ...
research
05/29/2023

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

State-of-the-art ASR systems have achieved promising results by modeling...
research
06/10/2021

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in stat...
research
11/23/2021

Romanian Speech Recognition Experiments from the ROBIN Project

One of the fundamental functionalities for accepting a socially assistiv...

Please sign up or login with your details

Forgot password? Click here to reset