DeepAI AI Chat
Log In Sign Up

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

by   Andrei Andrusenko, et al.
ITMO University
Speech Technology Center

Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence issue connected with such a significant reduction of the time dimension, we use upsampling blocks like in the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference speed but also shows better WER compared to the baseline Conformer. Our best Uconv-Conformer model shows 47.8 acceleration on the CPU and GPU, respectively. Relative WER reduction is 7.3 and 9.2


page 1

page 2

page 3

page 4


Improved Mask-CTC for Non-Autoregressive End-to-End ASR

For real-world deployment of automatic speech recognition (ASR), the sys...

Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions

This paper proposes a method to relax the conditional independence assum...

A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

It's challenging to customize transducer-based automatic speech recognit...

Conformer-based End-to-end Speech Recognition With Rotary Position Embedding

Transformer-based end-to-end speech recognition models have received con...

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition

In Uyghur speech, consonant and vowel reduction are often encountered, e...

Avoid Overthinking in Self-Supervised Models for Speech Recognition

Self-supervised learning (SSL) models reshaped our approach to speech, l...

SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

End-to-end automatic speech recognition systems represent the state of t...