On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

05/28/2020
by   Jinyu Li, et al.
0

Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...
research
07/30/2020

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Because of its streaming nature, recurrent neural network transducer (RN...
research
11/09/2020

Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

In this work, to measure the accuracy and efficiency for a latency-contr...
research
06/29/2022

On the Prediction Network Architecture in RNN-T for ASR

RNN-T models have gained popularity in the literature and in commercial ...
research
05/18/2020

Attention-based Transducer for Online Speech Recognition

Recent studies reveal the potential of recurrent neural network transduc...
research
08/12/2020

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Transfer learning (TL) is widely used in conventional hybrid automatic s...
research
04/22/2020

Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

While end-to-end ASR systems have proven competitive with the convention...

Please sign up or login with your details

Forgot password? Click here to reset