Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

05/01/2020
by   Hu Hu, et al.
0

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10 reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2019

Improving RNN Transducer Modeling for End-to-End Speech Recognition

In the last few years, an emerging trend in automatic speech recognition...
research
11/03/2020

Improving RNN transducer with normalized jointer network

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) mode...
research
04/02/2021

HMM-Free Encoder Pre-Training for Streaming RNN Transducer

This work describes an encoder pre-training procedure using frame-wise l...
research
11/13/2018

Exploring RNN-Transducer for Chinese Speech Recognition

End-to-end approaches have drawn much attention recently for significant...
research
11/26/2020

Streaming end-to-end multi-talker speech recognition

End-to-end multi-talker speech recognition is an emerging research trend...
research
03/31/2022

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

RNN-Transducer has been one of promising architectures for end-to-end au...
research
11/04/2019

Supervised level-wise pretraining for recurrent neural network initialization in multi-class classification

Recurrent Neural Networks (RNNs) can be seriously impacted by the initia...

Please sign up or login with your details

Forgot password? Click here to reset