1 Introduction & Related Work
, recurrent neural network transducer (RNN-T) and attention-based encoder-decoder models [5, 6]. While E2E models are only trained on paired audio-transcriptions in general, an external language model (LM) trained on much larger amount of text data (from possibly better-matched domain) can further boost the performance. Without modifying the model structure, shallow fusion (SF)  has been a widely-used effective LM integration approach for E2E models, which simply applies a log-linear model combination.
However, with context dependency directly included in the posterior distribution, both RNN-T and attention models implicitly learn an internal LM (ILM) as a sequence prior restricted to the audio transcription only. This ILM usually has a strong mismatch to the external LM, which can limit the performance of LM integration such as simple SF. Existing approaches to handle the ILM fall into 3 major categories:
ILM suppression requires training modification and performs similarly as ILM correction . ILM adaptation has a even higher complexity and may still need ILM correction for further improvement . Therefore, ILM correction appears to be the most simple and effective approach for LM integration, which also has a better mathematical justification [10, 11].
proposed to estimate the ILM from the transducer neural network (NN) by excluding the impact of the encoder, which is justified with a detailed proof under certain approximation. This approach is further investigated in and extended by including the average encoder output in . In [11, 17], ILM training (ILMT) is applied to include the ILM into the E2E model training for joint optimization. Recently,  conducted a comprehensive study on ILM methods for the attention model, and proposed the mini-LSTM approach that outperforms all previous methods.
In this work, we study various ILM correction-based LM integration methods for the RNN-T model. We formulate different ILM approaches proposed for different E2E systems into a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis. Additionally, we extend the HAT proof  and propose an exact-ILM training framework which enables a theoretical justification for other ILM approaches. Systematic comparison is conducted for both in-domain and cross-domain evaluation on the Librispeech  and TED-LIUM Release 2 (TLv2)  corpora, respectively. The effect of ILMT on these approaches is also investigated.
Given a speech utterance, let denote the output (sub)word sequence of length from a vocabulary . Let denote the corresponding acoustic feature sequence and denote the encoder output, which transforms the input into high-level representations.
The RNN-T model  defines the sequence posterior as:
as the RNN-T NN parameters. The probability inEq. 1 can also be represented from the lattice representation of the RNN-T topology . By denoting as a path reaching a node , we have:
Without an external LM, can be directly applied into the maximum a posteriori (MAP) decoding to obtain the optimal output sequence:
as modularized components, the Bayes’ theorem needs to be applied:
3 Internal LM
is effectively the ILM, which is implicitly learned and contained in after training. It should be most accurately obtained by marginalization . Since the exact summation is intractable, an estimated is usually applied for approximation.
3.1 ILM Estimation
One straightforward way is to assume that closely captures the statistics of the acoustic training transcription. This is the density ratio approach  which trains a separate on the audio transcription.
To be more consistent with the computation, another popular direction is to partially reuse the RNN-T NN for computing . A general formulation can be given as:
excluding the blank logit output. This is also done in. In fact, we recommend to always use a single distribution for , since it is equivalent to the separate distributions in  after renormalization, but has the advantage of further discrimination between blank and speech labels. This can partially prevent blank being too dominant which may lead to a sensitive decoding behavior with high deletion errors as observed in . Existing ILM estimation approaches are then categorized by the way of representing :
where an additional NN is introduced to generate based on . The mini-LSTM method  falls into this category (denoted as ).
All these approaches are based on fixed . For the approach, an LM-like loss based on Eq. 6 and Eq. 7 is used to train the additional over the audio transcription. This effectively combines the advantage of using transcription statistics and partial RNN-T NN.
3.2 ILM Training
The RNN-T model is commonly trained with a full-sum loss over all alignment paths as in Eq. 1.
When reusing the RNN-T NN for , one can also combine and as a multi-task training to train all parameters including jointly:
3.3 Decoding Interpretation
Since both and are only defined over , we can further expand Eq. 5 as:
The label distribution of is rebalanced with the prior removal, so that we rely more on the external LM for context modeling, which is a desired behavior.
The division by boosts the label probability against the (usually high) blank probability, so that the importance of the external LM can be increased without suffering a large increment of deletion errors.
The R2 explains why SF (
) can only achieve a limited performance. It may also alleviate the necessity of heuristic approaches in decoding such as length-normalization[4, 20] and length-reward [21, 10, 22]. However, an increasing with more boosting can also lead to an increment of insertion and/or substitution errors. Therefore, both scales require careful tuning in practice.
3.4 Discussion: Exact-ILM Training
In the appendix A of , a detailed proof is given to show:
Here we further generalize this assumption to be:
This exact-ILM training brings various possibilities. We can directly plug in the approach by defining:
Experimental verification is done on the 960h Librispeech corpus  for in-domain evaluation, and on the 207h TLv2 corpus  for out-of-domain evaluation. We use 5k acoustic data-driven subword modeling (ADSM) units 
trained on the Librispeech corpus. We follow the ADSM text segmentation and train two individual external LMs on the corresponding LM training data of each corpus. Both LMs contains 4 long short-term memory (LSTM) layers with 2048 units.
contains 2 convolutional layers followed by 6 bidirectional-LSTM (BLSTM) layers with 640 units for each direction. A subsampling of factor 4 is applied via 2 max-pooling layers in the middle of BLSTM stacks. The prediction networkcontains an embedding layer of size 256 and 2 LSTM layers with 640 units. We use the standard additive joint network for which contains 1 linear layer of size 1024 with the tanh activation, and another linear layer followed by the final softmax. The RNN-T model is only trained on the Librispeech corpus. We firstly apply a Viterbi training variant 
for 30 full epochs and then continue to fine-tune the model withfor 15 full epochs. This converged model is used as the base model for all further experiments.
The density ratio  LM uses the same structure as . The for the approach follows the same structure of . For both and , we only train for 0.5-1 full epoch on Librispeech. For the , we use the CE loss for , and and for in-domain and cross-domain evaluation, respectively. Additionally for , we use the base RNN-T model to generate a Viterbi alignment and only consider those encoding frames where labels occur.
For , we follow  to use and initialize with the base model. To avoid the potential improvement just form a much longer re-training with learning rate reset , we only apply fine-tuning with for additional 5-10 full epochs on Librispeech. Since is only relevant for and , we also freeze during this procedure.
The decoding follows the description in Section 3.3. We apply alignment-synchronous search  with score-based pruning and a beam limit of 128. We explicitly do not apply any heuristic approach for decoding to better reflect the effect of each LM integration method. All scales are optimized on the dev sets.
4.2 LM Integration Evaluation
Table 1 shows the word error rate (WER) results of the aforementioned LM integration methods evaluated on the in-domain Librispeech and out-of-domain TLv2 tasks. As expected, the external LMs bring significant improvement over the standalone RNN-T, and all ILM correction approaches improves further over the simple SF. For both tasks, all -based methods outperform the density ratio approach. Both and show similar performance, and the trained with performs slightly better than the two. The proposed further improves the approach.
The brings consistent improvement for all 3 -based approaches on the cross-domain TLv2 task, while the overall impact on the in-domain Librispeech is much smaller. This is in line with the observation in HAT , where a decreasing leads to no improvement on the overall performance. The approach again performs the best.
To verify the 2 decoding-perspective benefits of ILM as claimed in Section 3.3, we conduct additional analytical experiments on the Librispeech dev-other set using the base RNN-T model.
To simulate the effect of boosting label probability (R2) without the effect of rebalanced label distribution (R1), we apply a constant length reward upon SF for decoding. To simulate the effect of R1 without the effect of R2, we apply a modified evaluation as following. For each , we firstly apply a renormalization as:
For the baseline SF, the optimal already leads to a high Del error and we can not increase it further. Boosting the label probability with length reward largely reduces the Del error to the same level as Ins. It also allows a slight increase of the external LM importance () for better performance. This verifies the individual effect of R2. Rebalancing the label distribution with reduces the Sub error as we rely more on the external LM for context modeling. However, it still suffers the high Del error without the boosting effect. This verifies the individual effect of R1. When combining length reward and , we see that the benefits are complementary. Finally, applying the ILM correction allows further enlarging the effect of R1 and R2 with larger scales, and thus achieves further improvement. It also eliminates the need of length reward.
|+ length reward||0.65||5.6||4.4||0.6||0.6|
|+ length reward||0.65||5.4||4.3||0.6||0.5|
|+ length reward||0.95||5.2||4.1||0.6||0.5|
In this work, we provided a detailed formulation to compare various ILM correction-based LM integration methods in a common RNN-T framework. We explained two major reasons for performance improvement with ILM correction from a decoding interpretation, which are experimentally verified with detailed analysis. Moreover, we proposed an exact-ILM training framework by extending the proof in HAT , which enables a theoretical justification for other ILM approaches. All investigated LM integration methods are systematically compared on the in-domain Librispeech and out-of-domain TLv2 tasks. The recently proposed ILM approach for the attention model also performs the best for the RNN-T model. Our proposed exact-ILM training can further improve its performance.
This work was partly funded by the Google Faculty Research Award for “Label Context Modeling in Automatic Speech Recognition”. We thank Mohammad Zeineldeen and Wilfried Michel for useful discussion.
Zoltán Tüske, George Saon, Kartik Audhkhasi, and Brian Kingsbury,
“Single Headed Attention based Sequence-to-sequence Model for State-of-the-Art Results on Switchboard,”in Proc. Interspeech, 2020, pp. 551–555.
-  Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen
“Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks,”
Proc. Int. Conf. on Machine Learning (ICML), 2006, pp. 369–376.
-  Alex Graves, “Sequence Transduction with Recurrent Neural Networks,” 2012, https://arxiv.org/abs/1211.3711.
-  Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-End Attention-based Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2016, pp. 4945–4949.
-  William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
-  Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “On Using Monolingual Corpora in Neural Machine Translation,” 2015, http://arxiv.org/abs/1503.03535.
-  Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models,” in Proc. Interspeech, 2021, pp. 2856–2860.
-  Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Early Stage LM Integration Using Local and Global Log-Linear Combination,” in Proc. Interspeech, 2020, pp. 3605–3609.
-  Erik McDermott, Hasim Sak, and Ehsan Variani, “A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition,” in IEEE ASRU, 2019, pp. 434–441.
-  Ehsan Variani, David Rybach, Cyril Allauzen, and Michael Riley, “Hybrid Autoregressive Transducer (HAT),” in Proc. ICASSP, 2020, pp. 6139–6143.
-  Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, and Yifan Gong, “Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition,” in IEEE SLT, 2021, pp. 243–250.
-  Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Librispeech Transducer Model with Internal Language Model Prior Correction,” in Proc. Interspeech, 2021.
-  Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, and Lei He, “Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS,” in Proc. Interspeech, 2021, pp. 751–755.
-  Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, and Zoltán Tüske, “Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio,” in Proc. Interspeech, 2021, pp. 2027–2031.
-  Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, and Hannes Heikinheimo, “Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network,” in Proc. Interspeech, 2021, pp. 1882–1886.
-  Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, and Yifan Gong, “Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition,” in Proc. ICASSP, 2021, pp. 7338–7342.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
-  Anthony Rousseau, Paul Deléglise, and Yannick Estève, “Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks,” in Proc. LREC, 2014, pp. 3935–3939.
-  Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, and Yifan Gong, “On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer,” in Proc. Interspeech, 2021, pp. 3435–3439.
-  Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” 2014, http://arxiv.org/abs/1412.5567.
-  George Saon, Zoltán Tüske, Daniel Bolaños, and Brian Kingsbury, “Advancing RNN Transducer Technology for Speech Recognition,” in Proc. ICASSP, 2021, pp. 5654–5658.
-  Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, and Hermann Ney, “Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition,” in Proc. Interspeech, 2021, pp. 2886–2890.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Ralf Schlüter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney, “Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2007, pp. 649–652.
-  Anshuman Tripathi, Han Lu, Hasim Sak, and Hagen Soltau, “Monotonic Recurrent Neural Network Transducer and Decoding Strategies,” in IEEE ASRU, 2019, pp. 944–948.
-  Wei Zhou, Simon Berger, Ralf Schlüter, and Hermann Ney, “Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2021, pp. 5644–5648.
Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter, and
“On Architectures and Training for Raw Waveform Feature Extraction in ASR,”in IEEE ASRU, 2021, (to appear).
-  George Saon, Zoltán Tüske, and Kartik Audhkhasi, “Alignment-Length Synchronous Decoding for RNN Transducer,” in Proc. ICASSP, 2020, pp. 7804–7808.