Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

10/22/2020
by   Thibault Doutre, et al.
0

Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4 non-streaming teacher model trained on the same amount of labeled data as the baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/25/2021

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

Streaming end-to-end automatic speech recognition (ASR) systems are wide...
research
07/06/2022

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming ...
research
05/07/2020

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

In recent years, all-neural end-to-end approaches have obtained state-of...
research
08/16/2023

Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals

Millimeter wave (mmWave) based speech recognition provides more possibil...
research
01/11/2023

Dual Learning for Large Vocabulary On-Device ASR

Dual learning is a paradigm for semi-supervised machine learning that se...
research
01/10/2023

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition

While speech recognition Word Error Rate (WER) has reached human parity ...
research
10/11/2022

Streaming Punctuation for Long-form Dictation with Transformers

While speech recognition Word Error Rate (WER) has reached human parity ...

Please sign up or login with your details

Forgot password? Click here to reset