1 Introduction & Related Work
Endtoend (E2E) speech recognition has shown great simplicity and stateoftheart performance [1, 2]. Common E2E approaches include connectionist temporal classification (CTC) [3]
, recurrent neural network transducer (RNNT)
[4] and attentionbased encoderdecoder models [5, 6]. While E2E models are only trained on paired audiotranscriptions in general, an external language model (LM) trained on much larger amount of text data (from possibly bettermatched domain) can further boost the performance. Without modifying the model structure, shallow fusion (SF) [7] has been a widelyused effective LM integration approach for E2E models, which simply applies a loglinear model combination.However, with context dependency directly included in the posterior distribution, both RNNT and attention models implicitly learn an internal LM (ILM) as a sequence prior restricted to the audio transcription only. This ILM usually has a strong mismatch to the external LM, which can limit the performance of LM integration such as simple SF. Existing approaches to handle the ILM fall into 3 major categories:

[leftmargin=*, itemsep=0.5mm]
ILM suppression requires training modification and performs similarly as ILM correction [8]. ILM adaptation has a even higher complexity and may still need ILM correction for further improvement [14]. Therefore, ILM correction appears to be the most simple and effective approach for LM integration, which also has a better mathematical justification [10, 11].
The density ratio [10] approach estimates the ILM directly from the transcription statistics. The hybrid autoregressive transducer (HAT) [11]
proposed to estimate the ILM from the transducer neural network (NN) by excluding the impact of the encoder, which is justified with a detailed proof under certain approximation. This approach is further investigated in
[12] and extended by including the average encoder output in [13]. In [11, 17], ILM training (ILMT) is applied to include the ILM into the E2E model training for joint optimization. Recently, [8] conducted a comprehensive study on ILM methods for the attention model, and proposed the miniLSTM approach that outperforms all previous methods.In this work, we study various ILM correctionbased LM integration methods for the RNNT model. We formulate different ILM approaches proposed for different E2E systems into a common RNNT framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis. Additionally, we extend the HAT proof [11] and propose an exactILM training framework which enables a theoretical justification for other ILM approaches. Systematic comparison is conducted for both indomain and crossdomain evaluation on the Librispeech [18] and TEDLIUM Release 2 (TLv2) [19] corpora, respectively. The effect of ILMT on these approaches is also investigated.
2 RNNTransducer
Given a speech utterance, let denote the output (sub)word sequence of length from a vocabulary . Let denote the corresponding acoustic feature sequence and denote the encoder output, which transforms the input into highlevel representations.
The RNNT model [4] defines the sequence posterior as:
as the RNNT NN parameters. The probability in
Eq. 1 can also be represented from the lattice representation of the RNNT topology [4]. By denoting as a path reaching a node , we have:Without an external LM, can be directly applied into the maximum a posteriori (MAP) decoding to obtain the optimal output sequence:
as modularized components, the Bayes’ theorem needs to be applied:
3 Internal LM
is effectively the ILM, which is implicitly learned and contained in after training. It should be most accurately obtained by marginalization . Since the exact summation is intractable, an estimated is usually applied for approximation.
3.1 ILM Estimation
One straightforward way is to assume that closely captures the statistics of the acoustic training transcription. This is the density ratio approach [10] which trains a separate on the audio transcription.
To be more consistent with the computation, another popular direction is to partially reuse the RNNT NN for computing . A general formulation can be given as:
excluding the blank logit output. This is also done in
[12]. In fact, we recommend to always use a single distribution for , since it is equivalent to the separate distributions in [11] after renormalization, but has the advantage of further discrimination between blank and speech labels. This can partially prevent blank being too dominant which may lead to a sensitive decoding behavior with high deletion errors as observed in [20]. Existing ILM estimation approaches are then categorized by the way of representing :
[leftmargin=*, itemsep=0.5mm]

where an additional NN is introduced to generate based on . The miniLSTM method [8] falls into this category (denoted as ).
All these approaches are based on fixed . For the approach, an LMlike loss based on Eq. 6 and Eq. 7 is used to train the additional over the audio transcription. This effectively combines the advantage of using transcription statistics and partial RNNT NN.
3.2 ILM Training
The RNNT model is commonly trained with a fullsum loss over all alignment paths as in Eq. 1.
When reusing the RNNT NN for , one can also combine and as a multitask training to train all parameters including jointly:
3.3 Decoding Interpretation
Since both and are only defined over , we can further expand Eq. 5 as:

[itemsep=0.5mm]

The label distribution of is rebalanced with the prior removal, so that we rely more on the external LM for context modeling, which is a desired behavior.

The division by boosts the label probability against the (usually high) blank probability, so that the importance of the external LM can be increased without suffering a large increment of deletion errors.
The R2 explains why SF (
) can only achieve a limited performance. It may also alleviate the necessity of heuristic approaches in decoding such as lengthnormalization
[4, 20] and lengthreward [21, 10, 22]. However, an increasing with more boosting can also lead to an increment of insertion and/or substitution errors. Therefore, both scales require careful tuning in practice.3.4 Discussion: ExactILM Training
In the appendix A of [11], a detailed proof is given to show:
Here we further generalize this assumption to be:
This exactILM training brings various possibilities. We can directly plug in the approach by defining:
4 Experiments
4.1 Setup
Experimental verification is done on the 960h Librispeech corpus [18] for indomain evaluation, and on the 207h TLv2 corpus [19] for outofdomain evaluation. We use 5k acoustic datadriven subword modeling (ADSM) units [23]
trained on the Librispeech corpus. We follow the ADSM text segmentation and train two individual external LMs on the corresponding LM training data of each corpus. Both LMs contains 4 long shortterm memory (LSTM)
[24] layers with 2048 units.We use 50dimensional gammatone features [25] and a strictly monotonic () version of RNNT [26]. The encoder
contains 2 convolutional layers followed by 6 bidirectionalLSTM (BLSTM) layers with 640 units for each direction. A subsampling of factor 4 is applied via 2 maxpooling layers in the middle of BLSTM stacks. The prediction network
contains an embedding layer of size 256 and 2 LSTM layers with 640 units. We use the standard additive joint network for which contains 1 linear layer of size 1024 with the tanh activation, and another linear layer followed by the final softmax. The RNNT model is only trained on the Librispeech corpus. We firstly apply a Viterbi training variant [27]for 30 full epochs and then continue to finetune the model with
for 15 full epochs. This converged model is used as the base model for all further experiments.The density ratio [10] LM uses the same structure as . The for the approach follows the same structure of [8]. For both and , we only train for 0.51 full epoch on Librispeech. For the , we use the CE loss for , and and for indomain and crossdomain evaluation, respectively. Additionally for , we use the base RNNT model to generate a Viterbi alignment and only consider those encoding frames where labels occur.
For , we follow [17] to use and initialize with the base model. To avoid the potential improvement just form a much longer retraining with learning rate reset [28], we only apply finetuning with for additional 510 full epochs on Librispeech. Since is only relevant for and , we also freeze during this procedure.
The decoding follows the description in Section 3.3. We apply alignmentsynchronous search [29] with scorebased pruning and a beam limit of 128. We explicitly do not apply any heuristic approach for decoding to better reflect the effect of each LM integration method. All scales are optimized on the dev sets.
Model Train  Evaluation  Librispeech  TLv2  

dev  test  dev  test  
clean  other  clean  other  
no LM  3.3  9.7  3.6  9.5  19.8  20.3  
SF  2.2  5.9  2.4  6.3  15.5  16.4  
density ratio  2.1  5.7  2.4  6.0  14.1  15.0  
2.0  5.1  2.3  5.6  13.6  14.4  
2.1  5.0  2.3  5.5  13.5  14.6  
+  2.0  5.0  2.2  5.3  13.4  14.4  
+  2.0  4.9  2.2  5.2  13.2  14.0  
2.0  5.1  2.2  5.4  13.3  14.2  
2.1  5.2  2.3  5.5  13.5  14.3  
2.0  5.0  2.2  5.4  13.2  14.1 
4.2 LM Integration Evaluation
Table 1 shows the word error rate (WER) results of the aforementioned LM integration methods evaluated on the indomain Librispeech and outofdomain TLv2 tasks. As expected, the external LMs bring significant improvement over the standalone RNNT, and all ILM correction approaches improves further over the simple SF. For both tasks, all based methods outperform the density ratio approach. Both and show similar performance, and the trained with performs slightly better than the two. The proposed further improves the approach.
The brings consistent improvement for all 3 based approaches on the crossdomain TLv2 task, while the overall impact on the indomain Librispeech is much smaller. This is in line with the observation in HAT [11], where a decreasing leads to no improvement on the overall performance. The approach again performs the best.
4.3 Analysis
To verify the 2 decodingperspective benefits of ILM as claimed in Section 3.3, we conduct additional analytical experiments on the Librispeech devother set using the base RNNT model.
To simulate the effect of boosting label probability (R2) without the effect of rebalanced label distribution (R1), we apply a constant length reward upon SF for decoding. To simulate the effect of R1 without the effect of R2, we apply a modified evaluation as following. For each , we firstly apply a renormalization as:
For the baseline SF, the optimal already leads to a high Del error and we can not increase it further. Boosting the label probability with length reward largely reduces the Del error to the same level as Ins. It also allows a slight increase of the external LM importance () for better performance. This verifies the individual effect of R2. Rebalancing the label distribution with reduces the Sub error as we rely more on the external LM for context modeling. However, it still suffers the high Del error without the boosting effect. This verifies the individual effect of R1. When combining length reward and , we see that the benefits are complementary. Finally, applying the ILM correction allows further enlarging the effect of R1 and R2 with larger scales, and thus achieves further improvement. It also eliminates the need of length reward.
Evaluation  Librispeech devother  
WER  Sub  Del  Ins  
SF  0.6  0  5.9  4.4  1.0  0.5 
+ length reward  0.65  5.6  4.4  0.6  0.6  
0.6  0.3  5.7  4.2  1.0  0.5  
+ length reward  0.65  5.4  4.3  0.6  0.5  
0.85  0.4  5.1  4.0  0.6  0.5  
+ length reward  0.95  5.2  4.1  0.6  0.5 
5 Conclusion
In this work, we provided a detailed formulation to compare various ILM correctionbased LM integration methods in a common RNNT framework. We explained two major reasons for performance improvement with ILM correction from a decoding interpretation, which are experimentally verified with detailed analysis. Moreover, we proposed an exactILM training framework by extending the proof in HAT [11], which enables a theoretical justification for other ILM approaches. All investigated LM integration methods are systematically compared on the indomain Librispeech and outofdomain TLv2 tasks. The recently proposed ILM approach for the attention model also performs the best for the RNNT model. Our proposed exactILM training can further improve its performance.
6 Acknowledgements
This work was partly funded by the Google Faculty Research Award for “Label Context Modeling in Automatic Speech Recognition”. We thank Mohammad Zeineldeen and Wilfried Michel for useful discussion.
References

[1]
Zoltán Tüske, George Saon, Kartik Audhkhasi, and Brian Kingsbury,
“Single Headed Attention based Sequencetosequence Model for StateoftheArt Results on Switchboard,”
in Proc. Interspeech, 2020, pp. 551–555.  [2] Anmol Gulati, James Qin, ChungCheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolutionaugmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.

[3]
Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen
Schmidhuber,
“Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks,”
in
Proc. Int. Conf. on Machine Learning (ICML)
, 2006, pp. 369–376.  [4] Alex Graves, “Sequence Transduction with Recurrent Neural Networks,” 2012, https://arxiv.org/abs/1211.3711.
 [5] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “EndtoEnd Attentionbased Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2016, pp. 4945–4949.
 [6] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
 [7] Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, HueiChi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “On Using Monolingual Corpora in Neural Machine Translation,” 2015, http://arxiv.org/abs/1503.03535.
 [8] Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Investigating Methods to Improve Language Model Integration for Attentionbased EncoderDecoder ASR Models,” in Proc. Interspeech, 2021, pp. 2856–2860.
 [9] Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Early Stage LM Integration Using Local and Global LogLinear Combination,” in Proc. Interspeech, 2020, pp. 3605–3609.
 [10] Erik McDermott, Hasim Sak, and Ehsan Variani, “A Density Ratio Approach to Language Model Fusion in EndtoEnd Automatic Speech Recognition,” in IEEE ASRU, 2019, pp. 434–441.
 [11] Ehsan Variani, David Rybach, Cyril Allauzen, and Michael Riley, “Hybrid Autoregressive Transducer (HAT),” in Proc. ICASSP, 2020, pp. 6139–6143.
 [12] Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, and Yifan Gong, “Internal Language Model Estimation for DomainAdaptive EndtoEnd Speech Recognition,” in IEEE SLT, 2021, pp. 243–250.
 [13] Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Librispeech Transducer Model with Internal Language Model Prior Correction,” in Proc. Interspeech, 2021.
 [14] Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, and Lei He, “Improving RNNT for Domain Scaling Using SemiSupervised Training with Neural TTS,” in Proc. Interspeech, 2021, pp. 751–755.
 [15] Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, and Zoltán Tüske, “Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio,” in Proc. Interspeech, 2021, pp. 2027–2031.
 [16] Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, and Hannes Heikinheimo, “Fast TextOnly Domain Adaptation of RNNTransducer Prediction Network,” in Proc. Interspeech, 2021, pp. 1882–1886.
 [17] Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, and Yifan Gong, “Internal Language Model Training for DomainAdaptive EndToEnd Speech Recognition,” in Proc. ICASSP, 2021, pp. 7338–7342.
 [18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
 [19] Anthony Rousseau, Paul Deléglise, and Yannick Estève, “Enhancing the TEDLIUM Corpus with Selected Data for Language Modeling and More TED Talks,” in Proc. LREC, 2014, pp. 3935–3939.
 [20] Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, and Yifan Gong, “On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer,” in Proc. Interspeech, 2021, pp. 3435–3439.
 [21] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, “Deep Speech: Scaling up endtoend speech recognition,” 2014, http://arxiv.org/abs/1412.5567.
 [22] George Saon, Zoltán Tüske, Daniel Bolaños, and Brian Kingsbury, “Advancing RNN Transducer Technology for Speech Recognition,” in Proc. ICASSP, 2021, pp. 5654–5658.
 [23] Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, and Hermann Ney, “Acoustic DataDriven Subword Modeling for EndtoEnd Speech Recognition,” in Proc. Interspeech, 2021, pp. 2886–2890.
 [24] Sepp Hochreiter and Jürgen Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [25] Ralf Schlüter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney, “Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2007, pp. 649–652.
 [26] Anshuman Tripathi, Han Lu, Hasim Sak, and Hagen Soltau, “Monotonic Recurrent Neural Network Transducer and Decoding Strategies,” in IEEE ASRU, 2019, pp. 944–948.
 [27] Wei Zhou, Simon Berger, Ralf Schlüter, and Hermann Ney, “Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2021, pp. 5644–5648.

[28]
Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter, and
Hermann Ney,
“On Architectures and Training for Raw Waveform Feature Extraction in ASR,”
in IEEE ASRU, 2021, (to appear).  [29] George Saon, Zoltán Tüske, and Kartik Audhkhasi, “AlignmentLength Synchronous Decoding for RNN Transducer,” in Proc. ICASSP, 2020, pp. 7804–7808.
Comments
There are no comments yet.