FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

10/21/2020
by   Jiahui Yu, et al.
5

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the sequence-level optimization of transducer models for streaming ASR by applying it on various end-to-end streaming ASR networks including RNN-Transducer, Transformer-Transducer, ConvNet-Transducer and Conformer-Transducer. We achieve 150-300 ms latency reduction with significantly better accuracy over previous techniques on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4 latency from 210 ms to only 30 ms on LibriSpeech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/06/2021

Reducing Streaming ASR Model Delay with Self Alignment

Reducing prediction delay for streaming end-to-end ASR models with minim...
research
11/01/2022

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

In this paper, we present TrimTail, a simple but effective emission regu...
research
11/01/2022

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Automatic speech recognition (ASR) systems typically rely on an external...
research
12/22/2022

Alignment Entropy Regularization

Existing training criteria in automatic speech recognition(ASR) permit t...
research
05/29/2023

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based a...
research
08/19/2023

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

Automatic speech recognition (ASR) based on transducers is widely used. ...
research
07/04/2023

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Connectionist Temporal Classification (CTC) is a widely used criterion f...

Please sign up or login with your details

Forgot password? Click here to reset