A Better and Faster End-to-End Model for Streaming ASR

11/21/2020
by   Bo Li, et al.
0

End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm we called Cascaded Encoders. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2019

Two-Pass End-to-End Speech Recognition

The requirements for many applications of state-of-the-art speech recogn...
research
03/29/2022

Streaming parallel transducer beam search with fast-slow cascaded encoders

Streaming ASR with strict latency constraints is required in many speech...
research
11/28/2022

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

We explore unifying a neural segmenter with two-pass cascaded encoder AS...
research
03/28/2020

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

Thus far, end-to-end (E2E) models have not been shown to outperform stat...
research
03/31/2023

Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

Conformer models maintain a large number of internal states, the vast ma...
research
03/31/2023

Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

In end-to-end (E2E) speech recognition models, a representational tight-...
research
04/13/2022

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

In this paper, we propose a dynamic cascaded encoder Automatic Speech Re...

Please sign up or login with your details

Forgot password? Click here to reset