Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

02/28/2021
by   Hirofumi Inaguma, et al.
0

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously, especially for long-form and noisy speech. We also compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information. The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T).

READ FULL TEXT

page 1

page 4

research
05/10/2020

CTC-synchronous Training for Monotonic Attention Model

Monotonic chunkwise attention (MoChA) has been studied for the online st...
research
07/01/2021

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

While attention-based encoder-decoder (AED) models have been successfull...
research
04/10/2020

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Recently, a few novel streaming attention-based sequence-to-sequence (S2...
research
05/04/2023

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

Transducer and Attention based Encoder-Decoder (AED) are two widely used...
research
03/26/2021

Mutually-Constrained Monotonic Multihead Attention for Online ASR

Despite the feature of real-time decoding, Monotonic Multihead Attention...
research
07/10/2020

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown state-...
research
04/06/2019

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme (G2P) conversion is an important task in automatic s...

Please sign up or login with your details

Forgot password? Click here to reset