A comparison of end-to-end models for long-form speech recognition

by   Chung-Cheng Chiu, et al.

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.


End-to-end attention-based distant speech recognition with Highway LSTM

End-to-end attention-based models have been shown to be competitive alte...

A review of on-device fully neural end-to-end automatic speech recognition algorithms

In this paper, we review various end-to-end automatic speech recognition...

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Improving the performance of end-to-end ASR models on long utterances ra...

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

While end-to-end models have shown great success on the Automatic Speech...

Attention-based Transducer for Online Speech Recognition

Recent studies reveal the potential of recurrent neural network transduc...

Recognizing long-form speech using streaming end-to-end models

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems t...

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...

Please sign up or login with your details

Forgot password? Click here to reset