End-to-end automatic speech recognition (ASR) systems have been successfully developed and nowadays achieve the competitive performance of conventional hybrid systems. Among various end-to-end ASR models, attention-based sequence-to-sequence (S2S) models [5, 9], have been shown to perform superior [28, 4] to other models like connectionist temporal classification (CTC) [14, 15] and recurrent neural transducer (RNN-T) . However, unlike frame-synchronous models such as CTC and RNN-T, it is difficult to directly apply S2S models to the streaming scenario because of global attention score normalization over all the encoder memories.
To address this, several novel streaming attention-based S2S models have been proposed including local windowing methods [24, 6, 33, 17, 25], neural transducer , hard monotonic attention [29, 7], adaptive computation steps , triggered attention , and continuous integrate-and-fire . Among these well-designed frameworks, monotonic chunkwise attention (MoChA)  can be optimized efficiently in parallel and shows promising results on the large-scale ASR task [26, 12, 20, 13]. However, these ‘streaming’ models are still unusable in many online tasks as their latency regarding the token generation is not small enough. This is an important issue that current research has not addressed well. We have observed that the decision boundary on the token generation for these models is delayed from the actual acoustic boundary (see Figure 1
). This is because the unidirectional encoder lacks future information, and the model is optimized to utilize as many future frames as possible to maximize the log probabilities over the transcription. This leads to inevitable latency and hurts user experiences.
In this paper, we propose minimum latency training strategies to reduce the latency of the streaming attention-based S2S models. To the best of our knowledge, this has not been investigated for such ASR models so far. In this work, we focus on the algorithmic delay regarding the token generation and refer it to latency. Inspired by [30, 39], we leverage external hard alignments extracted from a hybrid model as supervision. The goal is to force the model to learn accurate alignments, which would reduce the latency, while keeping the recognition accuracy. We adopt MoChA as a streaming S2S model and explore to utilize the alignments both on the encoder and decoder sides. On the encoder side, we perform (1) multi-task learning and (2) pre-training with the framewise cross-entropy objective, which have shown to be effective to stabilize CTC training and further improve the ASR accuracy [31, 39, 38]. On the decoder side, we propose two novel methods: (3) delay constrained training (DeCoT) and (4) minimum latency training (MinLT). DeCoT is conducted by removing inappropriate alignments during the marginalization process of all possible alignments 
. Moreover, a regularization term is introduced to avoid the exponential decay of attention weights because of the sequential dependency in the decoder. MinLT is performed by directly minimizing a differentiable expected latency estimated from the expected boundary locations.
Experimental evaluations on Cortana 3.4k hours dataset demonstrate that our proposed methods significantly reduce the latency in both cases, and DeCoT and MinLT are more effective. Surprisingly, we confirm significant improvements of the ASR accuracy as well when leveraging the alignment information on the decoder side. Furthermore, the ablation study is conducted to understand the behaviors of streaming S2S models.
2 Streaming sequence-to-sequence ASR
2.1 Monotonic chunkwise attention (MoChA)
We build streaming attention-based sequence-to-sequence (S2S) models based on the monotonic chunkwise attention (MoChA) model 
. MoChA is an extension of the hard monotonic attention model and introduces additional soft chunkwise attention mechanism on top of it. During training, monotonic alignments are learnt by marginalizing over all possible alignments represented by selection probabilities as follows:
where , , , , , and are learnable parameters, is the -th decoder state, is the -th encoder state,
is a logistic sigmoid function, andis the monotonic energy activation. Additional soft chunkwise attention is performed over small chunks ( frames) from each boundary by normalizing the chunk energy activation (implemented with different parameters as in Eq. (2.1)):
The expected context vector is calculated as a weighted encoder memories byand the following token generation processes are the same as the global attention. Fortunately, Eq. (2) and (3) can be calculated efficiently in parallel with the cumulative sum and product, and moving sum operations during training. At test time, each token is generated once surpasses a threshold 0.5 and then is set to 1.0. See [29, 7] for more details.
2.2 Enhanced monotonic attention with 1D convolution
Although monotonic alignments are efficiently formulated in Eq. (2), its binary decision on whether to generate the next token is parameterized by and it depends on the corresponding encoder output . In theory, RNN encoder should capture past information over several timesteps. However, we found that using surrounding encoder outputs as additional key features in Eq. (2.1) is effective for the robust binary decision. Specifically, we introduce 1-dimensional convolutional layer before transforming into the attention space by . We set the kernel size to 5 and channel size to the dimension of the encoder output. Note that this enhanced MoChA decoder looks at () future frames at every input timestep for the boundary prediction.
3 Problem specification
3.1 Definition of latency
S2S models do not guarantee that token boundaries are aligned to the corresponding acoustic boundaries accurately . When the unidirectional encoder is used, monotonic attention weights are generally distributed several future frames ahead to maximize the log probabilities of the target sequence as much as possible, which causes the inevitable latency  (see Figure 1).
In this work, we focus on this issue and define this delay as latency. The goal of this work is to reduce the latency as much as possible while maintaining the recognition accuracy. The most relevant works are [30, 39], where the authors also tackle the same issue in the CTC acoustic model. However, to the best of our knowledge, this problem has not been investigated so far in the streaming attention-based S2S models, which are categorized as label-synchronous models and behave very differently from frame-synchronous models.
3.2 Evaluation metric
In this work, we adopt corpus-level latency and utterance-level latency
as evaluation metrics for latency. We regard input timesteps where monotonic attention weights are activated () as the predicted boundaries. Latency of each token in the -th utterance is calculated as the difference between the predicted boundary and the corresponding gold boundary . For the corpus-level latency, we take an average of all tokens in the evaluation set as follows:
where is the number of utterances in the evaluation set and is the -th reference. For the utterance-level latency, we take an average of the mean latency in each utterance as follows:
We mainly report average, median, 90th and 99th percentile of the corpus-level latency distributions. Since the length of hypothesis must be equal to that of the corresponding reference for these metrics, we conduct teacher-forcing when calculating the latency.
4 Strategies for minimum latency training
In this section, we propose minimum latency training strategies applicable in the encoder and decoder to reduce the latency of streaming S2S models. We leverage external hard alignments obtained from the acoustic model in the hybrid system. The acoustic model is trained to minimize senone-level framewise cross-entropy (CE) loss. Let ( is a -dimensional one-hot vector, : vocabulary size) be a hard alignment corresponding to the input sequence , and be a sequence of token boundaries (end points) for each reference . can be obtained from .
4.1 Leveraging hard alignments in the encoder
4.1.1 Multi-task learning with framewise CE objective (MTL-CE)
We first propose the multi-task learning with framewise cross-entropy (CE) objective by using the hard alignments. We hypothesize that framewise supervision regularize encoder representations so that each encoder output is aligned to the true acoustic location, which would be helpful for calculating accurate boundaries in Eq. (2.1
). We attach another softmax layer for framewise CE objective on top of the encoder and jointly optimize CE objective(S2S branch) and framewise CE objective (CE branch ():
where , and are posterior distributions from the CE branch. Following , we insert two linear projection layers after the top encoder layer for each branch as the bottleneck layers (see Figure 2). Both outputs from two projection layers are concatenated and fed into the S2S branch. The softmax layer in the framewise CE branch is discarded during inference.
4.1.2 Pre-training with the framewise CE objective (PT-CE)
Next, we propose a pre-training of the encoder with framewise CE objective. Specifically, we first train the encoder only until convergence by setting in Eq. (4) to 1.0 (stage-1), and then optimize the entire parameters except for the CE branch by setting to 0 (stage-2). By doing this, we do not have to carefully tune the weight for the framewise CE objective . In this method, we do not stack any linear projection layers on the encoder as in Section 4.1.1.
4.2 Leveraging hard alignments in the decoder
4.2.1 Delay constrained training (DeCoT)
The above two methods utilize the hard alignments on the encoder side. Here, we leverage them on the decoder side. Since is optimized by marginalizing all possible alignments during training, they can include arbitrary future contexts as long as the monotonicity is not violated, which leads to increasing the latency. Therefore, we remove inappropriate alignments whose boundaries surpass the acceptable latency [frame] in Eq. (2) as follows:
where is the -th gold boundary. This delay constrained training (DeCoT) is illustrated in the top left box in Figure 2.
DeCoT is investigated for the CTC acoustic models in  and we extend it to MoChA, which is also optimized by marginalizing all possible alignment paths. Unlike CTC, where alignments are calculated with the forward-backward algorithm, the expected boundaries in MoChA are calculated only in a forward direction because of the sequential dependency in the decoder. This causes the exponential decay of and leads to almost zero context vectors, especially in the latter part of the output sequence. To recover this, we introduce a regularization term to keep the number of boundaries as close as possible to the length of output tokens :
where () is a tunable hyperparameter. We name this quantity loss inspired by . Quantity loss emphasizes the valid alignments during the marginalization process and has the similar effect to re-normalizing of attention weights. Note that should not be explicitly normalized over the encoder outputs since we cannot see the entire outputs during the inference .
4.2.2 Minimum latency training (MinLT)
The above DeCoT assumes the fixed latency for each token by setting the tolerance to a constant value. However, the actual latency differs token by token depending on various factors such as speaking speed and the length of characters in each subword. Therefore, we next explore to directly minimize the expected latency over the target sequence during training. MinLT is investigated in simultaneous NMT to reduce the expected latency without any supervisions by considering the ratio of input and output lengths 
. The speeds of reading source tokens and writing target tokens are assumed to be almost constant during the whole translation process. However, this cannot be directly applied to the ASR task since non-silence frames, which are typically skipped by the decoder, are not uniformly distributed over the input speech. Hence, we design a differential expected latency objective for the ASR task and directly minimize it jointly with CE objective:
where represents the expected boundary location of the -th token and () is a tunable hyperparameter.
5.1 Experimental conditions
All experiments were conducted on Microsoft’s Cortana voice assistant task. The training data contains around 3.3 million utterances (3.4k hours) in US English. The test set contains about 5600 utterances (6 hours). All the data is anonymized with personally identifiable information removed. We used 80-channel log-mel filterbank coefficients computed with a 25ms window size and shifted every 10 ms. Three successive frames were stacked together to form the 240-dimension input features, which results in the time reduction by a factor of 3 (30ms per frame). The encoder consists of 6-layer unidirectional gated recurrent unit (GRU) with 1024 hidden units in each layer. The decoders of both offline and streaming S2S models were composed of 2-layer GRU with 512 units per layer. We performed layer normalization  after each layer both in the encoder and decoder to stabilize training. We used chunk size for MoChA. We used dropout regularization and label smoothing  with probability 0.1 and 0.2, respectively. We used the 34k mixed units for the output vocabulary . Training was performed using Adam optimizer  with learning rate and for the global attention and MoChA, respectively. We set both and to 1.0. Beam search decoding was performed with beam width 8. We did not use the external language model for decoding. We report word error rate (WER) on the test set and latency statistics on the validation set since we do not have alignments for the test set.
5.2.1 Baseline streaming S2S model
We first compare the offline and streaming S2S models in Table 1. There are large gaps between the bi- and uni-directional encoders, and also the offline and streaming S2S models. To try and bridge these gaps, the CTC objective , external language model integration [8, 19, 34], pre-training , and sequence training such as MBR  can be leveraged. However, we do not want to make that the focus of this work. In our experiments, we found that MoChA is very sensitive to hyperparameters such as learning rate. Tuning the clipping value in Eq. (2) was also critical to avoid numerical instabilities in our experiments . We confirmed 4.24% relative gain on the original MoChA with the proposed 1-dimensional convolutional (1D-Conv) layer. Therefore, we use the MoChA with 1D-Conv layer as our baseline in the following experiments.
|Offline||Bidirectional (global attention)||7.01|
|Unidirectional (global attention)||8.44|
|Streaming||MoChA (chunk: 4)||10.37|
|+ 1D-Conv (baseline)||9.93|
5.2.2 Leveraging hard alignments in the encoder
Next, we show results of leveraging hard alignments in the encoder of MoChA in Table 2. With multi-task learning with framewise CE objective (MTL-CE), the latency was significantly reduced at the cost of WER. As increasing the weight for framewise CE objective , further latency reduction was obtained while hurting WER more. With , we got 40% latency reduction (median) only with 5.6% relative WER degradation. Pre-training with framewise CE objective (PT-CE) also reduced the latency but sacrificed more WER than MTL-CE. These observations are contrary to the previous works leveraging framewise CE objective in the CTC model [30, 39, 38]. One possible explanation is that CTC is a frame-synchronous model while MoChA is a label-synchronous model, and we did not use phoneme-level supervision to avoid that accuracy gains come from joint optimization with lower-level labels [13, 35, 2].
|Model||WER||Corpus-level latency [frame] ()|
5.2.3 Leveraging hard alignments in the decoder
We then leverage the hard alignments on the decoder side. Results are shown in Table 3 and Figure 3. We initialized all models expect for the baseline with the baseline MoChA (warm start) since alignment constraints in the decoder makes training MoChA from scratch much harder. Note that we did not provide framewise supervision to the encoder in these experiments. Both DeCoT and MinLT significantly reduced the latency. An interesting observation was that WER also improved significantly at the same time with DeCoT with . We obtained 8.0% and 10.6% relative WER improvements with and
, respectively. One possible explanation is that clean paths were emphasized more when out-of-boundary paths (potentially noisy) were removed during training. DeCoT has the effect to reduce the outlier as confirmed from the drastic latency reduction in the tail parts, but too much constraint with the smallcollapsed the model. In contrast, MinLT is effective for moving the center of latency distributions to the left side since the median improved by 40%. Note that when the -th boundary corresponding to a non-EOS token is not activated until the last input timestep (i.e., ), we set the boundary to to calculate the latency.
Considering the fact that both the latency and WER improved, hard alignments can be used more efficiently on the decoder side. Since error signals regarding latency were directly connected to the decoder side, the model could balance the accuracy and latency more effectively than techniques on the encoder side.
|Model||WER||Corpus-level latency [frame] ()|
Finally, we conducted the ablation study on the decoder side in Table 4. Quantity loss was not necessary for the baseline MoChA and MinLT, but essential for DeCoT. Warm start training was also necessary for both DeCoT and MinLT. This is probably because incorrect error signals were propagated into the model in the early training stage when training from scratch. The combination of DeCoT and MinLT with warm start training degraded WER too much although the latency was reduced further. The baseline WER was indeed boosted thanks to more updates by warm start training at the cost of latency, but DeCoT with were still better than it with much smaller latency. We also tried to directly shift the boundary locations without boundary supervisions by setting to 0 for all in Eq. (5). However, this did not lead to the latency reduction, from which we can confirm the effectiveness of our proposed expected latency objective.
|Model||WER||Corpus-level latency [frame] ()|
|w/ warm start||9.21||12.27||11.00||22.23||43.16|
|w/ quantity loss||10.30||11.24||10.00||20.39||36.01|
|w/o warm start||10.72||6.28||7.00||11.12||36.03|
|w/o quantity loss||14.28||3.93||3.00||7.20||27.39|
|w/o warm start||13.60||11.83||10.00||21.41||45.06|
|w/ quantity loss||13.66||6.82||6.00||10.45||25.57|
|w/ in Eq. (5)||9.29||12.11||11.00||21.77||42.85|
In this paper, we tackled the delayed token generation problem for the streaming attention-based S2S ASR model. We explored to leverage external hard alignments obtained from the hybrid ASR model to make the decision for the next token generation as fast as possible while maintaining the recognition accuracy. We proposed several strategies which are applicable to the encoder and decoder subnetworks. Experimental evaluation demonstrated that hard alignments were effective in both subnetworks for latency reduction and further reduced word error rate when applied to the decoder.
-  (2019) Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of ACL, pp. 1313–1323. Cited by: §4.2.2.
-  (2017) Direct acoustics-to-word models for English conversational speech recognition. In Proceedings of Interspeech, pp. 959–963. Cited by: §5.2.2.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §5.1.
-  (2017) Exploring neural transducers for end-to-end speech recognition. In Proceedings of ASRU, pp. 206–213. Cited by: §1.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of ICASSP, pp. 4960–4964. Cited by: §1.
-  (2016) On online attention-based speech recognition and joint mandarin character-pinyin training. In Proceedings of Interspeech, pp. 3404–3408. Cited by: §1.
-  (2018) Monotonic chunkwise attention. In Proceedings of ICLR, Cited by: §1, §2.1.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of ICASSP, pp. 4774–4778. Cited by: §1, §5.2.1.
-  (2015) Attention-based models for speech recognition. In Proceedings of NeurIPS, pp. 577–585. Cited by: §1.
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §5.1.
-  (2019) CIF: Continuous integrate-and-fire for end-to-end speech recognition. arXiv preprint arXiv:1905.11235. Cited by: §1, §4.2.1.
-  (2019) An online attention-based model for speech recognition. In Proceedings of Interspeech, pp. 4390–4394. Cited by: §1, §5.2.1.
-  (2019) Improved multi-stage training of online attention-based encoder-decoder models. In Proceedings of ASRU, pp. 70–77. Cited by: §1, §5.2.2.
-  (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of ICML, pp. 369–376. Cited by: §1.
-  (2014) Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of ICML, pp. 1764–1772. Cited by: §1.
-  (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
-  (2017) Gaussian prediction based attention for online end-to-end speech recognition. In Proceedings of Interspeech, pp. 3692–3696. Cited by: §1.
-  (2015) A neural transducer. arXiv preprint arXiv:1511.04868. Cited by: §1.
-  (2017) An analysis of incorporating an external language model into a sequence-to-sequence model. In Proceedings of ICASSP, pp. 5824–5828. Cited by: §5.2.1.
-  (2019) Attention based on-device streaming speech recognition with large speech corpus. In Proceedings of ASRU, pp. 956–963. Cited by: §1, §3.1.
-  (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
-  (2018) Advancing acoustic-to-word CTC model. In Proceedings of ICASSP, pp. 5794–5798. Cited by: §5.1.
-  (2019) End-to-end speech recognition with adaptive computation steps. In Proceedings of ICASSP, pp. 6246–6250. Cited by: §1.
Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP, pp. 1412–1421. Cited by: §1.
-  (2019) An analysis of local monotonic attention variants. In Proceedings of Interspeech, pp. 1398–1402. Cited by: §1.
-  (2019) Online hybrid CTC/attention architecture for end-to-end speech recognition. In Proceedings of Interspeech, pp. 2623–2627. Cited by: §1.
-  (2019) Triggered attention for end-to-end speech recognition. In Proceedings of ICASSP, pp. 5666–5670. Cited by: §1.
-  (2017) A comparison of sequence-to-sequence models for speech recognition. In Proceedings of Interspeech, pp. 939–943. Cited by: §1.
-  (2017) Online and linear-time attention by enforcing monotonic alignments. In Proceedings of ICML, pp. 2837–2846. Cited by: §1, §2.1, §4.2.1, §5.2.1.
-  (2015) Acoustic modelling with CD-CTC-sMBR LSTM RNNs. In Proceedings of ASRU, pp. 604–609. Cited by: §1, §3.1, §4.2.1, §5.2.2.
-  (2015) Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Proceedings of ICASSP, pp. 4280–4284. Cited by: §1, §1, §3.1.
Rethinking the inception architecture for computer vision. In Proceedings of CVPR, pp. 2818–2826. Cited by: §5.1.
-  (2017) Local monotonic attention mechanism for end-to-end speech and language processing. In Proceedings of IJCNLP, pp. 431–440. Cited by: §1.
-  (2018) A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proceedings of SLT, pp. 369–375. Cited by: §5.2.1.
-  (2017) Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. In Proceedings of Interspeech, pp. 3532–3536. Cited by: §5.2.2.
-  (2017) Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §5.2.1.
-  (2018) Improving attention based sequence-to-sequence models for end-to-end english conversational speech recognition.. In Proceedings of Interspeech, pp. 761–765. Cited by: §5.2.1.
-  (2018) A multistage training framework for acoustic-to-word model. In Proceedings of Interspeech, pp. 786–790. Cited by: §1, §4.1.1, §5.2.2.
-  (2018) Acoustic modeling with DFSMN-CTC and joint CTC-CE learning. In Proceedings of Interspeech, pp. 771–775. Cited by: §1, §3.1, §5.2.2.