High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

03/17/2020
by   Jinyu Li, et al.
0

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft's 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency.

READ FULL TEXT
research
10/27/2020

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

In this paper, we summarize the application of transformer and its strea...
research
03/09/2020

Toward Cross-Domain Speech Recognition with End-to-End Models

In the area of multi-domain speech recognition, research in the past foc...
research
11/06/2017

Improved training for online end-to-end speech recognition systems

Achieving high accuracy with end-to-end speech recognizers requires care...
research
03/21/2017

Deep LSTM for Large Vocabulary Continuous Speech Recognition

Recurrent neural networks (RNNs), especially long short-term memory (LST...
research
08/19/2020

Cross-Utterance Language Models with Acoustic Error Sampling

The effective exploitation of richer contextual information in language ...
research
10/21/2020

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

This paper proposes an efficient memory transformer Emformer for low lat...
research
03/12/2019

End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model

Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) ...

Please sign up or login with your details

Forgot password? Click here to reset