On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

06/26/2022
by   Raviraj Joshi, et al.
0

The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model. In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16 overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.

READ FULL TEXT
research
10/27/2020

Cascaded encoders for unifying streaming and non-streaming ASR

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have...
research
10/31/2022

Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

Recently, there has been an increasing interest in two-pass streaming en...
research
11/15/2021

Attention based end to end Speech Recognition for Voice Search in Hindi and English

We describe here our work with automatic speech recognition (ASR) in the...
research
10/06/2022

WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger

End-to-end models have gradually become the main technical stream for vo...
research
05/29/2023

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based a...
research
09/20/2023

Speak While You Think: Streaming Speech Synthesis During Text Generation

Large Language Models (LLMs) demonstrate impressive capabilities, yet in...
research
06/29/2022

On the Prediction Network Architecture in RNN-T for ASR

RNN-T models have gained popularity in the literature and in commercial ...

Please sign up or login with your details

Forgot password? Click here to reset