FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

by   Jiahui Yu, et al.

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the sequence-level optimization of transducer models for streaming ASR by applying it on various end-to-end streaming ASR networks including RNN-Transducer, Transformer-Transducer, ConvNet-Transducer and Conformer-Transducer. We achieve 150-300 ms latency reduction with significantly better accuracy over previous techniques on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4 latency from 210 ms to only 30 ms on LibriSpeech.


page 1

page 2

page 3

page 4


Reducing Streaming ASR Model Delay with Self Alignment

Reducing prediction delay for streaming end-to-end ASR models with minim...

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

In this paper, we present TrimTail, a simple but effective emission regu...

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Automatic speech recognition (ASR) systems typically rely on an external...

Alignment Entropy Regularization

Existing training criteria in automatic speech recognition(ASR) permit t...

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based a...

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

Automatic speech recognition (ASR) based on transducers is widely used. ...

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Connectionist Temporal Classification (CTC) is a widely used criterion f...

Please sign up or login with your details

Forgot password? Click here to reset