Streaming parallel transducer beam search with fast-slow cascaded encoders

03/29/2022
by   Jay Mahadeokar, et al.
0

Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders. This work improves upon this cascaded encoders framework by leveraging two streaming non-causal encoders with variable input context sizes that can produce outputs at different audio intervals (e.g. fast and slow). We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders, where the slow encoder corrects the mistakes generated from the fast encoder. The proposed algorithm, achieves up to 20 delays on the public Librispeech dataset and in-house datasets. We also explore techniques to reduce the computation by distributing processing between the fast and slow encoders. Lastly, we explore sharing the parameters in the fast encoder to reduce the memory footprint. This enables low latency processing on edge devices with low computation cost and a low memory footprint.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2020

A Better and Faster End-to-End Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conven...
research
04/13/2022

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

In this paper, we propose a dynamic cascaded encoder Automatic Speech Re...
research
02/17/2022

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

This paper proposes CTC-based non-autoregressive ASR with self-condition...
research
09/13/2022

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Language identification is critical for many downstream tasks in automat...
research
10/27/2020

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Recently, online end-to-end ASR has gained increasing attention. However...
research
11/02/2022

Conversation-oriented ASR with multi-look-ahead CBS architecture

During conversations, humans are capable of inferring the intention of t...
research
03/31/2023

Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

In end-to-end (E2E) speech recognition models, a representational tight-...

Please sign up or login with your details

Forgot password? Click here to reset