Reducing Streaming ASR Model Delay with Self Alignment

by   Jaeyoung Kim, et al.

Reducing prediction delay for streaming end-to-end ASR models with minimal performance regression is a challenging problem. Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models. On the contrary, recently proposed FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks without any reference alignments. Although all these schemes are successful in reducing delay, ASR word error rate (WER) often severely degrades after applying these delay constraining schemes. In this paper, we propose a novel delay constraining method, named self alignment. Self alignment does not require external alignment models. Instead, it utilizes Viterbi forced-alignments from the trained model to find the lower latency alignment direction. From LibriSpeech evaluation, self alignment outperformed existing schemes: 25 at the similar word error rate. For Voice Search evaluation,12 reductions were achieved compared to FastEmit and constrained alignment with more than 2


page 1

page 2

page 3

page 4


FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...

Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

This article describes an efficient training method for online streaming...

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Recently, a few novel streaming attention-based sequence-to-sequence (S2...

A Better and Faster End-to-End Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conven...

Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

Contextual knowledge is essential for reducing speech recognition errors...

Improving the Performance of Online Neural Transducer Models

Having a sequence-to-sequence model which can operate in an online fashi...