The last decade has seen rapid improvements in automatic speech recognition (ASR) technology through advances in deep learning. Recently, there has been growing interest in building so-called end-to-end ASR systems – systems consisting of a single neural network, which directly output character-based or word-based units: e.g., connectionist temporal classification (CTC) [2, 3] with character [4, 5] or word [6, 7] targets; attention-based encoder-decoder models [8, 9, 10, 11, 12]; and the recurrent neural network transducer (RNN-T) [13, 14, 15].
Most previous works investigating end-to-end models have evaluated models in the setting where training and test utterances are relatively short (i.e., tens of seconds) and drawn from the same domain111In this context, we use ‘domain’ to refer to utterances which share a common property. E.g., audiobooks (long read speech utterances ), or voice search queries (short utterances). [11, 17, 18]. In previous work, we identified two problems that affect end-to-end models: first, we observe that end-to-end models are particularly sensitive to a domain-mismatch between training and inference, caused by overfitting to the training domain . Since end-to-end models learn all components jointly, the effect is more pronounced than would be expected in conventional models . A second problem – a specific kind of domain mismatch – is the observation that end-to-end models trained on short training segments do not perform well when decoding much longer utterances during inference (e.g., longer YouTube videos) [19, 21]; this problem is particularly acute for non-streaming attention-based models , but, somewhat surprisingly, also affects streaming end-to-end models such as RNN-T.222By streaming models, we refer to models which produce and update hypotheses for each input speech frame (e.g., the CTC, or RNN-T models with unidirectional encoders). Similarly, we refer to models which examine all of the input speech before producing an output hypothesis (e.g., RNN-T with a bi-directional encoder, or attention-based encoder-decoder models) as non-streaming models.
Our previous works proposed a number of solutions to address these problems: training on diverse domains ; simulating long-form speech by manipulating the encoder/decoder states ; or by performing inference over short overlapping segments which can be assembled into the complete hypothesis . Although our proposed solutions improved performance on out-of-domain and long-form audio, our previous works did not characterize the fundamental reasons for the degradation in performance. In the present work, we perform a detailed analysis of the RNN-T model to determine which models components are primarily responsible for this performance degradation, finding that the encoder network in the model is most susceptible to overfitting. In light of this observation, we reinterpret previously proposed solutions [19, 21] as additional regularization constraints imposed on the model to prevent overfitting, and find that combining multiple regularization techniques results in the best performance. In experimental evaluations, we decode a YouTube test set using three RNN-T models: a model trained using short-segments of YouTube data; a model trained using short-segments of Search data; and a model trained on the Librispeech dataset. We find that combining various regularization techniques improves the models trained on YouTube and Search data by and , respectively. In combination with our proposed dynamic overlapping inference technique (See Section 5), our mismatched Librispeech trained models show dramatic word error rate (WER) improvements from to on the YouTube test set.
2 RNN Transducer
The RNN-T model was proposed by Graves [13, 14] as an improvement over CTC . As with CTC, the RNN-T model introduces a special blank symbol, , which models the alignments between the speech frames, , and the output label sequence, . We denote the number of speech frames by , with each , and denotes the set of output labels with . The set of all valid frame-level alignments, can be written as: , where , such that is identical to after removing all blank symbols. During training, RNN-T uses the forward-backward algorithm to maximize , taking all valid alignments into consideration.
3 The Generalization Problem
This section characterizes the generalization problem for streaming and non-streaming models and presents experimental observations that identify components that contribute to poor generalization.
3.1 Non-streaming ASR Models
Our experimental setup is similar to . The training data is extracted from YouTube videos . The training utterances are generally short: the percentile length is seconds and the percentile is seconds. During training we filter out utterances that are longer than seconds. We evaluate the model on two YouTube test sets, YT-short and YT-long. YT-short is comprised of videos with length ranging from to minutes, with a total duration of hours. YT-long is comprised of videos with length ranging from seconds to minutes, with a total duration of hours. The videos in both test sets are much longer than the training samples, thus allowing us to test the long-form generalization of non-streaming ASR models.
Our RNN-T model’s encoder stacks a macro layer times, where the macro layer consists of -D convolution with filter width and
filters with stride, a
-D max pooling layer with widthand stride , and bidirectional LSTM layers with hidden units in each direction and a -dimensional projection per layer . The prediction network has an unidirectional LSTM with hidden units. The output network has hidden units and the final output uses a k word piece model . As input, the model uses -dimensional log-Mel features, computed with a ms window, shifted every ms.
Observations: As shown in Figure 2, the RNN-T model trained with short utterances exhibits high WER (due to deletion errors) when evaluated on both test sets. Analyzing overall word errors as a function of training steps, we observe that the model starts to introduce higher deletion errors as training proceeds; the phenomenon is particularly significant on YT-long, which has longer utterances. The model stops improving after k steps on YT-short and gets worse on YT-long, which indicates a form of overfitting.
To better understand this overfitting issue we compare various training setups that freeze parts of the model after initially training all components for k steps: updating only the encoder; or the prediction network; or both the prediction and the joint networks. As can be observed in Fig. 2, high word error rates (WERs) are correlated with models that update the encoder. Models with a smaller encoder, or the ones that do not update the encoder layers show better generalization than the baseline model. Furthermore, updating only the encoder results in worse performance on long-form sets than the baseline model, which indicates that the overfitting issue does not simply reflect the number of parameters being updated but is particularly associated with the encoder.
To analyze why encoder overfitting results in high deletion errors, we sample one video from YT-long, and segment the first seconds of audio. We then evaluate the model at k and k steps on these segments and compare the probability of blank prediction for the first steps. Note that the encoder is bidirectional, so the entire audio segment will influence the predictions for the first steps. Furthermore, the first frames contain no speech and the models should, ideally, predict blanks with high probability. As can be seen in Fig. 3, at k steps, the model has similar confidence amongst different utterance lengths. On the other hand, at k steps the model’s confidence varies a lot with respect to the utterance length, in particular for utterances longer than seconds. Note that during training the model has only seen utterances less than
seconds long. Thus, as training proceeds, the encoder’s prediction for blank symbols fails to generalize well on long utterances. Since this model has bidirectional LSTMs, the backward LSTM contributes towards the high variance of blank probability at the first few steps of predictions. This affects the model’s hypotheses in multiple ways: first, a high blank probability would result in partial hypotheses consisting of a sequence of blank tokens to have a higher probability than the correct sequence; eventually this blank sequence would also cause other partial hypotheses that have fewer blanks and are more accurate to be dropped from the beam, causing additional search errors. This results in the WER being dominated by deletions.
3.2 Streaming ASR Models
A typical streaming application is voice search on mobile phones. We, therefore, choose this task for the streaming use-case. Due to latency constraints the model size is more limited than the non-streaming cases, and our experimental setup mimics those in . 128-dimensional log-mel features from 4 contiguous frames are stacked to form a 512 dimensional input, which is then subsampled by a factor of 3 along the time dimension. The RNN-T model has 8 encoder layers made up of unidirectional LSTMs. Each layer has 2048 units and a projection layer with 640 outputs units . The decoder consists of 2 unidirectional LSTMs, also with 2048 units and 640 projections similar to the encoder layers. The joint network has a single layer with 640 units. The target is represented by a sequence of word piece tokens , with a vocabulary size of 4096.
The training data consists of anonymized and hand-transcribed utterances representative of the Google search traffic . We use multicondition training (MTR) to simulate noisy conditions, and randomly downsample the data from 16 kHz to 8 kHz to improve generalization to varying input sample rates. The training utterances are short, with mean and median duration of and seconds, respectively. The percentile is seconds. The and percentiles for the target sequences of word-pieces, are respectively, and tokens. As test sets, we use a mix of in-domain and out-domain data. A test set similar to the training domain and composed of hours of anonymized and hand transcribed search queries forms the in-domain test set (Search; median length is seconds). Our out-of-domain test set consists of 7 hours of speech generated using a text-to-speech system , which is acoustically simple but much longer than the training utterances (TTS-Audiobook; median length is seconds).
Observations: Similar to the non-streaming models, we observe that the streaming model also overfits to the training domain after approximately k steps as shown in Fig. 4. Unlike non-streaming models, however, performance keeps improving on the in-domain Search set with more training. Performance on the TTS-Audiobook set, in contrast, gets worse as training progresses. Next, we freeze parts of the model after k steps, as before, and continue training just the encoder; the prediction network; or both the prediction and joint networks. Confirming the observations made for the non-streaming models, freezing the encoder layers and only updating the prediction, or the prediction and the joint layers prevents this overfitting behavior. After k steps, both the baseline, which updates all model parameters, and the model that only updates encoder layers obtain WERs of and , respectively, on TTS-Audiobook; the model that only updates the prediction network obtains a WER of . Thus, the degradation in performance is significantly less severe when the model does not update the encoder after k steps.
The results presented in the streaming and the non-streaming case indicate that the encoder is most responsible for the overfitting behavior of RNN-T. In the next section, we explore various regularization strategies to reduce overfitting.
4 Regularization Cocktail
As the generalization issue is caused by encoder overfitting, it can be remedied effectively through regularization. Different domains and architectures can benefit from different regularization techniques, and thus we combine them during training to create a regularization cocktail:
Variational Weight Noise: Variational weight noise adds Gaussian noise to the weight matrix during training , and has been shown to be effective in improving generalization . In our approach we start the training process without noise, and start adding it after a predefined number of steps. The weight noise is re-sampled at every training step.
Random state sampling and random state passing: Random state sampling (RSS) and random state passing (RSP) were proposed in 
as a way to address generalization of streaming RNNT models to long-form speech. RSS assumes that LSTM states follow a normal distribution and samples initial LSTM states from it during training. RSS is readily applicable for bidirectional models as well. RSP, on the other hand, saves LSTM states from each mini-batch during training, and uses them as initial states for examples in the subsequent batch. When used with unidirectional models, this mimics random concatenation of examples during training.
5 Dynamic Overlapping Inference
In addition to improving generalization during training, we attempt to improve generalization during decoding with overlapping inference . This method segments a long utterance into multiple fixed-length segments which are decoded independently. Since each segment lacks context from neighboring segments, we allow some overlap between successive segments, and merge the decoded hypotheses in the overlapped region. The original method  was proposed in the context of models which do not have any alignment information for the hypothesis which required a 50% overlap between segments, and thus the computational cost compared to regular inference.
Here, we extend overlapping inference to relax the 50% overlap requirement. Our proposed algorithm – dynamic overlapping inference (DOI) – infers frame-level alignment obtained from each RNN-T hypothesis, , (i.e., we use the frame associated with each non-blank label) to match and merge hypotheses between segments. Thich allows us to significantly relax the 50% overlap requirement, thus greatly increasing computational efficiency. The process is illustrated in Fig 5.
We evaluate generalization performance of the proposed techniques for the models described in Sec. 3. The experiment setup is identical to those in Sec. 3. All models are implemented with Lingvo .
|+ RSS + VN||9.1||9.0||14.8||14.9||19.3||19.2|
Non-Streaming Models: The results using the regularization cocktail and dynamic overlapping inference (DOI) are shown in Tab. 1. All regularization techniques and their combinations help improve performance. In particular, SpecAugment + RSS + VN obtains a improvement on YT-short and a improvement on YT-long, and DOI obtains and improvement on YT-short and YT-long respectively. We further evaluate the model on a long-form call-center test set described in  to assess its robustness on unseen domain. The proposed regularization cocktail improves WER by and when using regular inference and DOI, respectively. In general, DOI provides significant improvement when models have generalization issue on the target domain, and provide similar quality as regular inference for models that do no have this issue.
|+ RSP + VN||11.9||25.3|
Streaming Models: Results are shown in Tab. 2. As with the non-streaming models, the model with multiple regularizations gave the best improvements. SpecAugmentVNRSP obtains a 76% improvement on TTS-Audiobook and a 62% improvement on YT-short. Other combinations also help, but are slightly worse than SpecAugmentVNRSP. It should be noted that some of these models still perform better at k checkpoint. For example, SpecAugVN obtains 14.5% and 22.5% on TTS-Audiobook and YT-short, respectively, at k steps. Although the combination doesn’t completely prevent overfitting, the degradation, as the model converges on the training data, is much lower than the baseline. We note that DOI does not help with streaming models, likely because it relies on alignment and end-to-end streaming models are know to produce poor alignments unless they are constrained during training .
|Librispeech test clean|
|Librispeech test other|
Librispeech: The final set of results are when the RNN-T model is trained on Librispeech . We follow the architecture of LAS-6-1280 described in . The prediction network has the same LSTM setup as the LAS decoder. The joint network has hidden units, and uses the same word piece model. The results are shown in Tab. 3. Despite achieving low WERs on the Librispeech test sets, the model exhibits high deletion errors on YT-short. DOI reduces the deletion error from to . The model still has a WER of after using DOI, mainly due to substitutions () caused by phonetically similar words. This is likely caused by the limited vocabulary the Librispeech model is exposed to during training.
The Librispeech RNN-T model exhibited WERs on YT-short even at early stages of training, and, therefore, multiple regularizations do not remedy the issue as well as DOI. The regularization cocktail mainly addresses generalization to new domains. When the gap between training and test domains is large, addressing the long-form issue during inference provides a more robust solution.
This work presents an analysis of the generalization problem observed in RNN-T based end-to-end ASR models. Our results demonstrate that the model’s affinity to predict blank sequences when there is a mismatch between training and test distributions causes this problem, which results in high deletion rates. Our analysis identified the root cause of this problem to be encoder overfitting. We proposed a regularization cocktail that significantly improves the performance of streaming and non-streaming RNN-T models trained with large-scale data. For models trained on a smaller dataset, where regularization alone doesn’t improve performance, we proposed a dynamic overlapping inference strategy that significantly improves generalization. Future work will explore alternative model architectures and regularization techniques that address the generalization of models trained on smaller datasets.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov 2012.
-  A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in Proc. of ICML, 2006, pp. 369–376.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent
neural networks,” in
International conference on machine learning, 2014, pp. 1764–1772.
-  A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling Up End-to-End Speech Recognition,” arXiv preprint arXiv:1412.5567, 2014.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,” in Proc. of ICML, 2016, pp. 173–182.
-  H. Soltau, H. Liao, and H. Sak, “Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition,” in Proc. of Interspeech, 2017, pp. 3707–3711.
-  K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Nahamoo, “Direct Acoustics-to-Word Models for English Conversational Speech Recognition,” in Proc. of Interspeech, 2017, pp. 959–963.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,” in Proc. of ICASSP, 2016, pp. 4960–4964.
-  J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech Recognition,” in Proc. of NIPS, 2015, pp. 577–585.
-  S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
-  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP, 2018.
-  C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=Hko85plCW
-  A. Graves, “Sequence transduction with recurrent neural networks,” CoRR, vol. abs/1211.3711, 2012.
-  A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. of ICASSP, 2013.
-  K. Rao, H. Sak, and R. Prabhavalkar, “Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer,” in Proc. of ASRU, 2017, pp. 193–199.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in Proc. of ICASSP, 2015, pp. 5206–5210.
-  D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. of Interspeech, 2019.
-  C. Lüscher, E. Beck, K. Irie, M. Kitza, M. W, A. Zeyer, R. Schlüter, and H. Ney, “Rwth asr systems for librispeech: Hybrid vs attention,” in Proc. of Interspeech, 2019.
-  A. Narayanan, R. Prabhavalkar, C. Chiu, D. Rybach, T. Sainath, and T. Strohman, “Recognizing Long-Form Speech Using Streaming End-to-End Models,” in to appear in Proc. ASRU, 2019.
-  A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in Proc. Of SLT, 2018.
-  C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. Sainath, and Y. Wu, “A Comparison of End-to-end Models for Long-form Speech Recognition,” in Proc. of ASRU, 2019.
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov 1997.
-  H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” in Interspeech 2017, 2017.
-  R. Pang, T. Sainath, R. Prabhavalkar, S. Gupta, Y. Wu, S. Zhang, and C.-C. Chiu, “Compression of end-to-end models,” in Proc. of Interspeech, 2018.
-  M. Schuster and K. Nakajima, “Japanese and Korean Voice Search,” in Proc. of ICASSP, 2012, pp. 5149–5152.
-  R. Prabhavalkar, O. Alsharif, A. Bruguier, and I. McGraw, “On the compression of recurrent neural networks with an application to lvcsr acoustic modeling for embedded speech recognition,” in Proc. of ICASSP, 2016, pp. 5970–5974.
-  X. Gonzalvo, S. Tazari, C.-A. Chan, M. Becker, A. Gutkin, and H. Silen, “Recent Advances in Google Real-time HMM-driven Unit Selection Synthesizer,” in Proc. of Interspeech, 2016.
-  A. Graves, “Practical variational inference for neural networks,” in Proc. of NeurIPS, 2011, pp. 2348–2356.
-  D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V. Le, and Y. Wu, “Specaugment on large scale datasets,” in Proc. of ICASSP, 2020.
-  J. Shen, P. Nguyen, Y. Wu, Z. Chen, and et al., “Lingvo: a modular and scalable framework for sequence-to-sequence modeling,” 2019.
-  A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao, “Acoustic modelling with cd-ctc-smbr lstm rnns,” in Proc. of ASRU, 2015.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. of ICASSP, 2015.