Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling

by   Jiahui Yu, et al.

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Universal ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation. The Universal ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and an internal large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Universal ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Universal ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.


page 1

page 2

page 3

page 4


Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Recently, online end-to-end ASR has gained increasing attention. However...

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

History and future contextual information are known to be important for ...

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer

Conformer-based end-to-end models have become ubiquitous these days and ...

Multi-mode Transformer Transducer with Stochastic Future Context

Automatic speech recognition (ASR) models make fewer errors when more su...

Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals

Millimeter wave (mmWave) based speech recognition provides more possibil...

Deformable TDNN with adaptive receptive fields for speech recognition

Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based...

Semi-Autoregressive Streaming ASR With Label Context

Non-autoregressive (NAR) modeling has gained significant interest in spe...

Please sign up or login with your details

Forgot password? Click here to reset