Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

04/25/2021
by   Thibault Doutre, et al.
0

Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41 Spanish, 27

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

Streaming end-to-end automatic speech recognition (ASR) models are widel...
research
08/31/2023

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

Streaming automatic speech recognition (ASR) models are restricted from ...
research
10/27/2022

Contextual-Utterance Training for Automatic Speech Recognition

Recent studies of streaming automatic speech recognition (ASR) recurrent...
research
09/20/2023

Speak While You Think: Streaming Speech Synthesis During Text Generation

Large Language Models (LLMs) demonstrate impressive capabilities, yet in...
research
05/27/2017

Lifelong Generative Modeling

Lifelong learning is the problem of learning multiple consecutive tasks ...
research
06/13/2023

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer

Conformer-based end-to-end models have become ubiquitous these days and ...
research
07/20/2023

Globally Normalising the Transducer for Streaming Speech Recognition

The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates a...

Please sign up or login with your details

Forgot password? Click here to reset