Cascaded encoders for unifying streaming and non-streaming ASR

10/27/2020
by   Arun Narayanan, et al.
0

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10 when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/26/2022

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

The streaming automatic speech recognition (ASR) models are more popular...
02/17/2022

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

This paper proposes CTC-based non-autoregressive ASR with self-condition...
04/13/2022

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

In this paper, we propose a dynamic cascaded encoder Automatic Speech Re...
02/02/2022

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a ...
05/14/2020

Streaming keyword spotting on mobile devices

In this work we explore the latency and accuracy of keyword spotting (KW...
03/29/2022

Streaming parallel transducer beam search with fast-slow cascaded encoders

Streaming ASR with strict latency constraints is required in many speech...
06/29/2022

On the Prediction Network Architecture in RNN-T for ASR

RNN-T models have gained popularity in the literature and in commercial ...