Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

by   Hirofumi Inaguma, et al.

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously, especially for long-form and noisy speech. We also compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information. The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T).


page 1

page 4


CTC-synchronous Training for Monotonic Attention Model

Monotonic chunkwise attention (MoChA) has been studied for the online st...

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

While attention-based encoder-decoder (AED) models have been successfull...

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Recently, a few novel streaming attention-based sequence-to-sequence (S2...

Mutually-Constrained Monotonic Multihead Attention for Online ASR

Despite the feature of real-time decoding, Monotonic Multihead Attention...

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown state-...

Reducing Streaming ASR Model Delay with Self Alignment

Reducing prediction delay for streaming end-to-end ASR models with minim...

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme (G2P) conversion is an important task in automatic s...