Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

04/10/2020
by   Hirofumi Inaguma, et al.
0

Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and (4) directly minimize the differentiable expected latency loss. Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side. We also present some analysis to understand the behaviors of streaming S2S models.

READ FULL TEXT
research
02/28/2021

Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

This article describes an efficient training method for online streaming...
research
07/01/2021

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

While attention-based encoder-decoder (AED) models have been successfull...
research
12/05/2017

Improving the Performance of Online Neural Transducer Models

Having a sequence-to-sequence model which can operate in an online fashi...
research
10/14/2022

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a ...
research
11/04/2022

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Sequence transducers, such as the RNN-T and the Conformer-T, are one of ...
research
01/25/2022

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

A streaming style inference of encoder-decoder automatic speech recognit...
research
05/06/2021

Reducing Streaming ASR Model Delay with Self Alignment

Reducing prediction delay for streaming end-to-end ASR models with minim...

Please sign up or login with your details

Forgot password? Click here to reset