A comparison of end-to-end models for long-form speech recognition

by   Chung-Cheng Chiu, et al.

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.


End-to-end attention-based distant speech recognition with Highway LSTM

End-to-end attention-based models have been shown to be competitive alte...

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Improving the performance of end-to-end ASR models on long utterances ra...

An Online Attention-based Model for Speech Recognition

Attention-based end-to-end (E2E) speech recognition models such as Liste...

Recognizing long-form speech using streaming end-to-end models

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems t...

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

End-to-end models have achieved state-of-the-art results on several auto...

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

While end-to-end models have shown great success on the Automatic Speech...

Integrate Lattice-Free MMI into End-to-End Speech Recognition

In automatic speech recognition (ASR) research, discriminative criteria ...

1 Introduction

End-to-end models have become a popular choice for speech recognition, thanks to both the simplicity of building them and their superior performance over conventional systems [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2]. In contrast to conventional systems, which are comprised of separate acoustic, pronunciation, and language modeling components, end-to-end approaches formulate the speech recognition problem directly as a mapping from utterances to transcripts, which greatly simplifies the training and decoding processes. Popular end-to-end models fall into three broad classes: 1) those that are based on the connectionist temporal classification (CTC) [13] criteria, 2) those that are based on the RNN-T criteria, and 3) those that make use of an attention mechanism.

While recent studies have shown that end-to-end models are very competitive with conventional systems, they have focused mainly on short utterances, which last from a few seconds to a few tens of seconds at most. Few works have investigated long-form transcription, a capability that is fundamental to applications like continuous transcription of meetings, presentations, or lectures. In the limited literature on this topic that we are aware of [12], the authors show that end-to-end CTC models can generalize well on long utterances, but it remains unanswered whether RNN-T and the attention-based models can provide the same robustness. [7] evaluates attention-based models on long-form audio by concatenating multiple short utterances into long utterances, yet it is unclear whether observations on a synthetic data will generalize to real world cases.

In this work, we both evaluate and improve the performance of end-to-end models on long-form audio. Focusing on RNN-T and attention-based models, we first evaluate popular end-to-end models on a long-form ASR task: the transcription of Youtube videos [12]. The Youtube dataset contains a mix of long form audio that closely reflect both common real life use cases (conversations, lectures, TV shows, etc.) and a broad variety of domains (education, sports, etc.). The training data is automatically generated via the confidence island method as detailed in [14] and contain utterances with several seconds long, while the test set is human-transcribed and have utterances with a few minutes long. Comparison on this task shows that standard soft-attention-based models generalize poorly to long audio.

Next, we incorporate two mechanisms in order to improve the generalization of attention-based models to long utterances. The first mechanism is a monotonicity constraint in the attention model, exploiting the observation that in ASR, the target sequence (transcript) and source sequence (acoustic signal) are monotonically aligned. We explore a few different mechanisms to enforce monotonicity, including the monotonic attention model [15], the monotonic chunkwise attention model [16], the monotonic infinite lookback attention model [17], and the GMM-based monotonic attention model [18, 19]. Our results show that enforcing a monotonicity constraint does improve the generalization of attention-based models to long utterances, but is still not sufficient to fully solve the problem. Thus we also incorporate a novel decoding algorithm that breaks long utterances into overlapping segments. We show that the combination of these two mechanisms enables attention-based models to match the performance of RNN-T on a long-form task.

The organization of the rest of the paper is as follows: sections 2 describes the attention models, section 3 describes the RNN-T model evaluated in this work, section 4 gives a brief intro about the decoding strategy that helps the model to be robust to long-form utterances. In section 5 we show our evaluation results, and concludes the observations in section 6.

2 Attention-Based Models

(a) Soft attention.
(b) Monotonic attention.
(c) MoChA attention.
(d) MILk attention.
(e) GMM monotonic attention.
(f) RNN-T.
Figure 1: A simple diagram comparing the end-to-end approaches evaluated in this work. The horizontal axis corresponds to the encoder steps while the vertical axis corresponds to prediction steps. (a)-(e) are attention-based models. The GMM monotonic attention use mixture of multiple distribution, but in (e) for the clarity of the illustration we plot a single distribution case. The monotonic attention (b) and RNN-T (f) exhibit the same behavior in terms of selecting encoder hidden states for making prediction, but differ in how the selected encoder state being used by the decoder.

A popular and effective approach to building end-to-end models is with attention-based models. The common architecture of attention-based models consist of an encoder, a decoder, and an attention mechanism:


where is the encoder state at input timestep , is the decoder state at output timestep , and

is a context vector. The models explored in this work use encoders with bi-directional RNNs. The context vector is computed based on the encoder hidden states through the use of an attention mechanism. Within this class of models, a variety of underlying attention mechanisms can be used. Below we describe the computation of

with respect to each attention mechanism explored in this work.

2.1 Soft attention

In the standard soft attention model, attention context is computed based on the entire sequence of encoder hidden states, which fundamentally limits the length of sequences this attention model can scale to, for two reasons. Firstly, attention computation cost is linear in the sequence length. When the source sequence is very long, the cost of computing the attention context is too high for each decoding step. Secondly, when sequence is very long, attention mechanism can easily get confused, resulting in non-monotonically moving attention head. In our experiments, we show that soft attention model trained on short utterances has difficulty scaling to long utterances and suffers from a high deletion rate.

This problem with soft attention model can be mitigated by exploiting the fact that in ASR, alignment between source and target is always monotonic. Based on where the attention head was at the previous decoding step, in computing the attention context for the next decoding step, one can limit focus to only a subsequence of the encoder hidden states. In the rest of this section, we describe a few variants of this soft attention model that exploit this spatial constraint in different ways.

2.2 Monotonic attention


proposed an attention mechanism that scans the sequence of the encoder hidden states in a left-to-right order and selects a particular encoder state for computing the context vector. This selection probability is computed through the use of an energy function that is passed through a logistic function to parameterize a Bernoulli random variable. The hard monotonic decision process however prevents the attention mechanism from being trained with standard backpropagation. To solve this problem,

[15] proposed to replace this one-hot attention vector (it is for the chosen encoder state, and elsewhere) with a soft expected attention probability vector during training.

With the monotonic attention mechanism, at each decoding step the decision process starts from the previously selected state and makes a frame-by-frame decision sequentially. This focuses the attention decision to only a sub-sequence of the encoder output, and thus in theory has better potential to scale to long-form utterances compared to the standard soft attention mechanism.

2.3 Monotonic Chunkwise Attention

While the monotonic attention mechanism provides better scalability for the long sequences, it limits itself to consider only a single step of the encoder states and therefore reduces the power of the attention model. The monotonic chunkwise attention (MoChA) [16] mechanism remedies this by allowing an additional lookback window to apply soft attention.The context vector in MoChA is more similar to the standard soft attention which contains weighted combination of a set of encoder states, as opposed to the monotonic attention mechanism which uses only a single step’s encoder state.

2.4 Monotonic Infinite Lookback Attention

The MoChA mechanism extends the capability of the monotonic attention mechanism by allowing it to look back a fixed window of encoder states from the current attention head. This fixed window size may still limit the full potential of the attention mechanism. The monotonic infinite lookback attention (MILK) mechanism was proposed in  [17] to allow the attention window to look back all the way to the beginning of the sequence.

The MILK attention mechanism has to be coupled with a latency loss that encourages the model to make the emission decision earlier. To see why, without the latency loss, the model may decide to wait until the end of source sequence to make even the first prediction, which then effectively recovers the standard soft attention mechanism and loses the benefit brought by the monotonic attention mechanism.

2.5 GMM monotonic attention

[18] proposed GMM attention to explicitly enforce the mode of probability mass generated by the current attention modules that are always moving incrementally to the end of the source sequence. The selection probability into at timestep () is defined by the following a mixture of Gaussian functions:




The parameters of GMM (Eq. 5

) distribution are estimated by a single layer feedforward network. We added a variance floor (

to make training more stable.

3 RNN Transducer

Besides attention-based models, RNN-T [3, 4] has shown successful results on building end-to-end models for speech recognition [2]. RNN-T is most similar to the monotonic attention model in that both models scan the encoder states sequentially to select a particular encoder state as the next context vector. This sequential scanning property is essential in allowing RNN-T to scale well to long utterances.

At decoding time, given a new encoder state, both the RNN-T model and the monotonic attention model make a “predict/no-predict” decision. The two models however differ in how the “predict/no-predict” decision affects decoder’s token prediction. In the monotonic attention mechanism, if a “predict” decision was made, the decoder then takes the encoder state as attention context to make a token prediction. If a “no-predict” decision was made instead, then the decoder does nothing, and simply waits for the next encoder state. In a contrary, RNN-T takes “no-predict” as one of the output tokens. Essentially in RNN-T “predict/no-predict” decision happens at the output level.

In training the RNN-T model, we compute the sum of probabilities over all valid combinations of “predict/no-predict” choices with an efficient dynamic programming algorithm, see  [3, 4] for details. In training the monotonic attention model, we compute the expected attention probabilities over the source sequence in order to avoid backpropagating through discrete “predict/no-predict” choices, see [15] for more details.

The comparison of each model’s mechanism on selecting encoder state for predictions are shown in Fig 1.

4 Overlapping Inference

Figure 2: Overlapping inference. The algorithm first breaks a long utterance into overlapped segments, each of which is then transcribed independently. It then merges the transcripts from overlapped segments into a consensus transcript for the original long utterance. In case there are conflicts in predictions, it prefers the predictions further from the utterance boundary.

Overlapping inference is a decoding strategy that we proposed in order to further improve attention based model performance on long form audios. In general, due to various constraints, training of the end-to-end models is often on short utterances only. Hence, there is an inherent train and inference mismatch when a model trained on short utterances alone is used to transcribe long utterances. Overlapping inference is designed to bridge this train/inference mismatch.

A straightforward approach to this train/inference mismatch problem is to break a long utterance into fixed length segments, and then transcribe each segment independently. This however will result in deteriorated performance especially at the segment boundaries, for two reasons. First, a segment boundary may cut through the middle of a word, making it impossible to recover the original word from either of the segments, as illustrated in Fig 2. Second, the recognition quality can be poor at the beginning of a segment due to lack of context. A smarter segmenter can be used, e.g. based on some voice activity detection algorithms to segment only when there is a sufficiently long silence. However, those segmenters can still produce long segments when no sufficiently long pause/silence is detected.

Overlapping inference improves over the aforementioned fixed-length segmenter or smarter segmenter based approaches, with a simple trick: it breaks a long utterance into overlapping segments. In our experiments, we chose overlap, which means that any point of audio is covered by exactly two segments. The information loss at a boundary of a segment can always be recovered by referencing to the other overlapping segment.

4.1 Combine overlapping windows with 50% overlap

In overlapping inference, we create windows of a fixed length and fixed overlap

. A special property of this setup is any word in the utterance will always get recognized twice by two consecutive windows. Conveniently, two parallel hypotheses can be constructed as the concatenation of the odd numbered windows and the even numbered windows, respectively:

where denotes the recognized word in the window.

4.1.1 Matching and

Next, we search for the best matching between and . The problem closely resembles the editing distance minimization that is commonly used in WER calculation, with only a minor difference in that it constrains to disallow words that are more than one window away from being matched. In other words, the process only matches words where their windows overlap. The solution can be found efficiently using a dynamic programming algorithm. The result of matching is a sequence of word pairs , where

is the pair index, is the total number of matched pairs, denotes word at window from , denotes word at window from , denotes no predictions.

4.1.2 Tie-breaking

During inference models generally see more contextual information for words further away from window boundaries, and therefore our approach assign higher confidence for those words. Concretely, we define a confidence score based on the relative location of a word in the window:

where is the starting time of the window and is the starting time of the word at window . The score peaks at the center of the window and linearly decays towards boundaries on both sides. For the RNN-T model we define as the time step that the model decides to emit the word, and in the case of no prediction the process use the starting time of the matched word as the starting time of . For attention-based models, we use the relative position of the word and simplify the equation to

where denotes the number of matched words in window . The final hypothesis selects words with higher confidence score:

We note that with overlap, overlapping inference increases inference computation cost to 2x. We are exploring ways to cut down this computational cost by reducing the overlap at the boundaries.

5 Experiments

Model Original Segment Segment Overlapping inference Overlapping inference
Table 1: Word-error-rates of end-to-end models on YouTube test set. For the Original, the utterances are segmented based on the appearance of silence. The resulting segmented utterances ranges from a few seconds to a few minutes long. Segment corresponds to imposing an additional segmentation threshold on the silence-segmented utterances where utterances longer than seconds are force segmented. We compare the results of using the overlapping inference with chunk size and seconds. *: the word-error-rates of CTC models are from [12] and are with language models.

We conduct our experiments on the Youtube data set, the same data set as used in [12]. YouTube videos cover a wide variety of different domains [20], and have a wide range of length distributions, making it an ideal test bed for this long-form transcription study.

Same as in  [12], our training data consists of english utterances extracted according to an island of confidence approach [14]. To run the algorithm, a pre-existing ASR model is used, which is a conventional ASR system. The pre-existing ASR model is being adapted on a per-video basis with a per-video specific language model built using the user uploaded transcripts. The segments (also called ‘island’) where the ASR produced transcript matches the user uploaded transcript exactly are being extracted as our training data. In total, there were 125 thousand hours of data being extracted. Most of the extracted segments are short utterances, with percentile at seconds. At training time, we cap our training utterances to be at most seconds long.

The test set is comprised of 296 videos with length ranging from to minutes. The total duration of the test videos is hours. The videos in the test set are much longer than the training samples, and hence ASR models trained on short utterances need to be able to scale to long videos to be able to perform well.

The input uses -dimensional log-Mel features, computed with a ms window and shifted every ms. Each input time step stacks frames of these features, with frames from the left and frames from the right, and downsampled to a ms frame rate. We compared the soft attention, monotonic attention, monotonic chunkwise attention, monotonic infinite lookback attention, GMM-based monotonic attention, and RNN-T models. All our end-to-end models have an encoder composed of layers of bi-directional LSTMs with dimension ( each direction). The architecture of attention models follow the same design of the bi-directional model as described in [1], but do not use scheduled sampling, minimum word error rate training [10], and the second-pass language model rescoring. For the MoChA model we use chunk size of . In the MILK model we applied a latency loss different from the the one proposed in [17], as the original latency loss is tailored for machine translation where the source and target sequence have similar length. Our latency loss minimize the root-mean-square value of the interval between two consecutive emissions:


In the GMM monotonic attention model, there are mixture components. As of the RNN-T model, while the encoder is the same as attention-based models, the prediction network has LSTM layers with hidden units and a -dimensional projection [21] per layer. The output network has

hidden units and the softmax layer predicts graphemes which has

units. The prediction and output network architecture is the same as the RNN-T grapheme model described in [2]

. All models are implemented with Tensorflow-Lingvo

[22], and the RNN-T model further utilize techniques described in [23, 24] to improve training efficiency.

The results are summarized in Table 1. As shown from the word-error-rates (WERs), attention models have problems scaling up to long utterances. In particular, those high WERs are due to the high deletion errors. For example, of the 67.1 WERs of the soft attention model, it consists of 63.2 deletion errors, 0.7 insertion errors, and 3.1 substitution error. This implies that on recognizing long-form utterances, attention models failed to generalize to the whole utterance and only produce transcripts for a sub-sequence of the utterance. The attention approaches that utilize monotonic alignment property all perform better than the vanilla soft-attention model, but they still exhibit serious problems generalizing to long utterances. Among all the attention mechanisms, the GMM-based monotonic attention model perform the best. The RNN-T model is robust to long-form utterances, and in fact achieves the best quality when longer utterances are being preserved. In [12] they reported the CTC end-to-end models outperform phone-based models, and achieve WER with bi-directional LSTMS. The attention models are significantly worse compared to the CTC models on long-form speech recognition, but the RNN-T model is able to outperform them.

The issue of long-form speech recognition with attention models can be addressed with the use of overlapping inference. Through segmenting the utterances into smaller chunks that are closer to the training utterances’ length, the attention-based model can provide competitive quality. Simply segmenting utterances into smaller sub-sequences can lead to missing context at the segmentation boundaries, which explains the quality loss between the "Segment 16s" approach and the "Overlapping inference 16s" approach. With the help of the overlapping inference, the MILK model provides the best quality compared to other end-to-end models, though the delta between other attention based models aren’t large.

When segmenting utterances into shorter sub-sequences, the model also loses contextual information. The results of the RNN-T model provides a reference for measuring this information loss. On regular segmentation with seconds threshold the RNN-T model observe a relative quality loss. In comparison the overlapping inference with seconds segmentation was able to amend this quality loss.

6 Conclusions

In this work we compare various end-to-end models for long-form speech recognition. The end-to-end models are trained on short utterances and evaluated on much longer utterances. The evaluation results show that the RNN-T model is able to scale to recognize long utterances and provides very strong quality. The attention-based models in general can’t generalize to long-form utterances. The GMM-based monotonic attention model performs the best among all attention model on this task, but still significantly lags behind the RNN-T model. We show that by incorporating overlapping inference, we can improve the performance of the attention-based models to be very competitive to that of the RNN-T model.

7 Acknowledgement

We would like to thank Hagen Soltau’s contributions on the YouTube dataset and the recipe for training the RNN-T model.


  • [1] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  • [2] Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo yiin Chang, Kanishka Rao, and Alexander Gruenstein, “Streaming end-to-end speech recognition for mobile devices,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  • [3] A. Graves, “Sequence transduction with recurrent neural networks,” CoRR, vol. abs/1211.3711, 2012.
  • [4] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
  • [5] A. Graves and N. Jaitly, “Towards End-to-End Speech Recognition with Recurrent Neural Networks,” in Proc. ICML, 2014.
  • [6] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” CoRR, vol. abs/1508.01211, 2015.
  • [7] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech Recognition,” in Proc. NIPS, 2015.
  • [8] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-End Attention-based Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2016.
  • [9] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, “Exploring Neural Transducers for End-to-End Speech Recognition,” in Proc. ASRU, 2017.
  • [10] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A Comparison of Sequence-to-sequence Models for Speech Recognition,” in Proc. Interspeech, 2017.
  • [11] Y. Zhang, W. Chan, and N. Jaitly, “Very Deep Convolutional Networks for End-to-End Speech Recognition,” in Proc. ICASSP, 2017.
  • [12] Hagen Soltau, Hank Liao, and Hasim Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” in Interspeech 2017, 2017.
  • [13] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labeling Unsegmented Seuqnece Data with Recurrent Neural Networks,” in Proc. ICML, 2006.
  • [14] Hank Liao, Erik McDermott, and Andrew Senior, “Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013.
  • [15] Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck, “Online and linear-time attention by enforcing monotonic alignments,” in

    Proceedings of the 34th International Conference on Machine Learning

    , Doina Precup and Yee Whye Teh, Eds., International Convention Centre, Sydney, Australia, 06–11 Aug 2017, vol. 70 of Proceedings of Machine Learning Research, pp. 2837–2846, PMLR.
  • [16] Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,” in International Conference on Learning Representations, 2018.
  • [17] Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel, “Monotonic infinite lookback attention for simultaneous machine translation,” in ACL, 2019.
  • [18] Alex Graves, “Generating sequences with recurrent neural networks,” 2013.
  • [19] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Local monotonic attention mechanism for end-to-end speech and language processing,” arXiv preprint arXiv:1705.08091, 2017.
  • [20] Arun Narayanan, Ananya Misra, Khe Chai Sim, Golan Pundak, Anshuman Tripathi, Mohamed Elfeky, Parisa Haghani, Trevor Strohman, and Michiel Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018.
  • [21] Ruoming Pang, Tara Sainath, Rohit Prabhavalkar, Suyog Gupta, Yonghui Wu, Shuyuan Zhang, and Chung-Cheng Chiu, “Compression of end-to-end models,” in Interspeech 2018, 2018.
  • [22] Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, and et al., “Lingvo: a modular and scalable framework for sequence-to-sequence modeling,” 2019.
  • [23] Khe Chai Sim, Arun Narayanan, Tom Bagby, Tara N. Sainath, and Michiel Bacchiani, “Improving the efficiency of forward-backward algorithm using batched computation in tensorflow,” in ASRU, 2017.
  • [24] Tom Bagby, Kanishka Rao, and Khe Chai Sim, “Efficient implementation of recurrent neural network transducer in tensorflow,” 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 506–512, 2018.