On Language Model Integration for RNN Transducer based Speech Recognition

10/13/2021 ∙ by Wei Zhou, et al. ∙ 0

The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpretation suggests to remove this sequence prior as ILM correction. In this work, we study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis. We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer, which enables a theoretical justification for other ILM approaches. Systematic comparison is conducted for both in-domain and cross-domain evaluation on the Librispeech and TED-LIUM Release 2 corpora, respectively. Our proposed exact-ILM training can further improve the best ILM method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction & Related Work

End-to-end (E2E) speech recognition has shown great simplicity and state-of-the-art performance [1, 2]. Common E2E approaches include connectionist temporal classification (CTC) [3]

, recurrent neural network transducer (RNN-T)

[4] and attention-based encoder-decoder models [5, 6]. While E2E models are only trained on paired audio-transcriptions in general, an external language model (LM) trained on much larger amount of text data (from possibly better-matched domain) can further boost the performance. Without modifying the model structure, shallow fusion (SF) [7] has been a widely-used effective LM integration approach for E2E models, which simply applies a log-linear model combination.

However, with context dependency directly included in the posterior distribution, both RNN-T and attention models implicitly learn an internal LM (ILM) as a sequence prior restricted to the audio transcription only. This ILM usually has a strong mismatch to the external LM, which can limit the performance of LM integration such as simple SF. Existing approaches to handle the ILM fall into 3 major categories:

  • [leftmargin=*, itemsep=-0.5mm]

  • ILM suppression: The ILM can be suppressed during the E2E model training via limiting context/model size [8] or introducing an external LM at early stage [9].

  • ILM correction: The ILM can be estimated in various ways

    [10, 11, 12, 13, 8] and then corrected from the posterior in decoding, which fits into a Bayesian interpretation.

  • ILM adaptation: The ILM can be adapted on the same text data used by the external LM to alleviate the mismatch. This can be done via text to speech [14, 15] to train the E2E model or partial model update directly on the text [16].

ILM suppression requires training modification and performs similarly as ILM correction [8]. ILM adaptation has a even higher complexity and may still need ILM correction for further improvement [14]. Therefore, ILM correction appears to be the most simple and effective approach for LM integration, which also has a better mathematical justification [10, 11].

The density ratio [10] approach estimates the ILM directly from the transcription statistics. The hybrid autoregressive transducer (HAT) [11]

proposed to estimate the ILM from the transducer neural network (NN) by excluding the impact of the encoder, which is justified with a detailed proof under certain approximation. This approach is further investigated in

[12] and extended by including the average encoder output in [13]. In [11, 17], ILM training (ILMT) is applied to include the ILM into the E2E model training for joint optimization. Recently, [8] conducted a comprehensive study on ILM methods for the attention model, and proposed the mini-LSTM approach that outperforms all previous methods.

In this work, we study various ILM correction-based LM integration methods for the RNN-T model. We formulate different ILM approaches proposed for different E2E systems into a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis. Additionally, we extend the HAT proof [11] and propose an exact-ILM training framework which enables a theoretical justification for other ILM approaches. Systematic comparison is conducted for both in-domain and cross-domain evaluation on the Librispeech [18] and TED-LIUM Release 2 (TLv2) [19] corpora, respectively. The effect of ILMT on these approaches is also investigated.

2 RNN-Transducer

Given a speech utterance, let denote the output (sub)word sequence of length from a vocabulary . Let denote the corresponding acoustic feature sequence and denote the encoder output, which transforms the input into high-level representations. The RNN-T model [4] defines the sequence posterior as:

Here is the blank -augmented alignment sequence, which is uniquely mapped to via the collapsing function to remove all blanks. The common NN structure of RNN-T contains an encoder , a prediction network and a joint network followed by a softmax activation. We denote

as the RNN-T NN parameters. The probability in

Eq. 1 can also be represented from the lattice representation of the RNN-T topology [4]. By denoting as a path reaching a node , we have:
where reaches if or otherwise.

Without an external LM, can be directly applied into the maximum a posteriori (MAP) decoding to obtain the optimal output sequence:

Within a Bayesian framework, to integrate the RNN-T model and an external LM jointly into Eq. 3

as modularized components, the Bayes’ theorem needs to be applied:

(4) (5)
which suggests the removal of the RNN-T model’s internal sequence prior . Here and are scales applied in common practice. The SF approach [7] essentially omits completely with .

3 Internal LM

is effectively the ILM, which is implicitly learned and contained in after training. It should be most accurately obtained by marginalization . Since the exact summation is intractable, an estimated is usually applied for approximation.

3.1 ILM Estimation

One straightforward way is to assume that closely captures the statistics of the acoustic training transcription. This is the density ratio approach [10] which trains a separate on the audio transcription.

To be more consistent with the computation, another popular direction is to partially reuse the RNN-T NN for computing . A general formulation can be given as:

where is defined over and is some global representation. Here shows a strong correspondence to Eq. 2 except that is defined over in general. In [11], a separate blank distribution in is proposed to directly use its label distribution for . However, the same can always be achieved by a simple renormalization:
where stands for the joint network

excluding the blank logit output. This is also done in

[12]. In fact, we recommend to always use a single distribution for , since it is equivalent to the separate distributions in [11] after renormalization, but has the advantage of further discrimination between blank and speech labels. This can partially prevent blank being too dominant which may lead to a sensitive decoding behavior with high deletion errors as observed in [20]. Existing ILM estimation approaches are then categorized by the way of representing :

  1. [leftmargin=*, itemsep=-0.5mm]

  2.    [11, 12]

  3.    [13, 8]

  4. where an additional NN is introduced to generate based on . The mini-LSTM method [8] falls into this category (denoted as ).

All these approaches are based on fixed . For the approach, an LM-like loss based on Eq. 6 and Eq. 7 is used to train the additional over the audio transcription. This effectively combines the advantage of using transcription statistics and partial RNN-T NN.

3.2 ILM Training

The RNN-T model is commonly trained with a full-sum loss over all alignment paths as in Eq. 1. When reusing the RNN-T NN for , one can also combine and as a multi-task training to train all parameters including jointly:

where is a scaling factor. This joint training is applied to the approach in [11, 17], which we also apply to the and approaches in this work.

3.3 Decoding Interpretation

Since both and are only defined over , we can further expand Eq. 5 as:

which reflects the exact scoring used at each search step. This also reveals two reasons from decoding perspective why applying ILM can improve the recognition performance:

  1. [itemsep=-0.5mm]

  2. The label distribution of is rebalanced with the prior removal, so that we rely more on the external LM for context modeling, which is a desired behavior.

  3. The division by boosts the label probability against the (usually high) blank probability, so that the importance of the external LM can be increased without suffering a large increment of deletion errors.

The R2 explains why SF (

) can only achieve a limited performance. It may also alleviate the necessity of heuristic approaches in decoding such as length-normalization

[4, 20] and length-reward [21, 10, 22]. However, an increasing with more boosting can also lead to an increment of insertion and/or substitution errors. Therefore, both scales require careful tuning in practice.

3.4 Discussion: Exact-ILM Training

In the appendix A of [11], a detailed proof is given to show:

if the following assumption can hold:
Eq. 8 then leads to the ILM approach. However, this assumption usually does not hold if contains some non-linearity, which is mostly the case for better performance.

Here we further generalize this assumption to be:

where can be any function with output size . As long as is independent of , the proof still holds and we have:
Instead of relying on the assumption, we can train to fulfill Eq. 9 towards an exact ILM estimation eventually. Besides the mean squared error (MSE) loss, this objective can also be the cross entropy (CE) loss, since both sides of Eq. 9 are actually the logits output over , which we can directly apply softmax. The CE loss may introduce a constant shift to Eq. 9, which is still valid for the proof in HAT [11] to achieve Eq. 10. Note that requires a reasonable Viterbi alignment of the acoustic data to match and , which can be obtained via a pretrained RNN-T model.

This exact-ILM training brings various possibilities. We can directly plug in the approach by defining:

and extend it to use a multi-task training for as:
This extension directly enables a theoretical justification for the ILM approach. Similar as , we can also combine and to have a joint training of all parameters, where the model is additionally forced to better fulfill the assumption towards an exact ILM estimation. This may require a careful design of , which is not investigated in this work.

4 Experiments

4.1 Setup

Experimental verification is done on the 960h Librispeech corpus [18] for in-domain evaluation, and on the 207h TLv2 corpus [19] for out-of-domain evaluation. We use 5k acoustic data-driven subword modeling (ADSM) units [23]

trained on the Librispeech corpus. We follow the ADSM text segmentation and train two individual external LMs on the corresponding LM training data of each corpus. Both LMs contains 4 long short-term memory (LSTM)

[24] layers with 2048 units.

We use 50-dimensional gammatone features [25] and a strictly monotonic () version of RNN-T [26]. The encoder

contains 2 convolutional layers followed by 6 bidirectional-LSTM (BLSTM) layers with 640 units for each direction. A subsampling of factor 4 is applied via 2 max-pooling layers in the middle of BLSTM stacks. The prediction network

contains an embedding layer of size 256 and 2 LSTM layers with 640 units. We use the standard additive joint network for which contains 1 linear layer of size 1024 with the tanh activation, and another linear layer followed by the final softmax. The RNN-T model is only trained on the Librispeech corpus. We firstly apply a Viterbi training variant [27]

for 30 full epochs and then continue to fine-tune the model with

for 15 full epochs. This converged model is used as the base model for all further experiments.

The density ratio [10] LM uses the same structure as . The for the approach follows the same structure of [8]. For both and , we only train for 0.5-1 full epoch on Librispeech. For the , we use the CE loss for , and and for in-domain and cross-domain evaluation, respectively. Additionally for , we use the base RNN-T model to generate a Viterbi alignment and only consider those encoding frames where labels occur.

For , we follow [17] to use and initialize with the base model. To avoid the potential improvement just form a much longer re-training with learning rate reset [28], we only apply fine-tuning with for additional 5-10 full epochs on Librispeech. Since is only relevant for and , we also freeze during this procedure.

The decoding follows the description in Section 3.3. We apply alignment-synchronous search [29] with score-based pruning and a beam limit of 128. We explicitly do not apply any heuristic approach for decoding to better reflect the effect of each LM integration method. All scales are optimized on the dev sets.

 Model  Train Evaluation Librispeech TLv2
dev test dev test
clean other clean other
no LM 3.3 9.7 3.6 9.5 19.8 20.3
SF 2.2 5.9 2.4 6.3 15.5 16.4
density ratio 2.1 5.7 2.4 6.0 14.1 15.0
2.0 5.1 2.3 5.6 13.6 14.4
2.1 5.0 2.3 5.5 13.5 14.6
 + 2.0 5.0 2.2 5.3 13.4 14.4
 + 2.0 4.9 2.2 5.2 13.2 14.0
2.0 5.1 2.2 5.4 13.3 14.2
2.1 5.2 2.3 5.5 13.5 14.3
2.0 5.0 2.2 5.4 13.2 14.1
Table 1: WER[%] results of LM integration evaluation on the in-domain Librispeech and out-of-domain TLv2 corpora.

4.2 LM Integration Evaluation

Table 1 shows the word error rate (WER) results of the aforementioned LM integration methods evaluated on the in-domain Librispeech and out-of-domain TLv2 tasks. As expected, the external LMs bring significant improvement over the standalone RNN-T, and all ILM correction approaches improves further over the simple SF. For both tasks, all -based methods outperform the density ratio approach. Both and show similar performance, and the trained with performs slightly better than the two. The proposed further improves the approach.

The brings consistent improvement for all 3 -based approaches on the cross-domain TLv2 task, while the overall impact on the in-domain Librispeech is much smaller. This is in line with the observation in HAT [11], where a decreasing leads to no improvement on the overall performance. The approach again performs the best.

4.3 Analysis

To verify the 2 decoding-perspective benefits of ILM as claimed in Section 3.3, we conduct additional analytical experiments on the Librispeech dev-other set using the base RNN-T model. To simulate the effect of boosting label probability (R2) without the effect of rebalanced label distribution (R1), we apply a constant length reward upon SF for decoding. To simulate the effect of R1 without the effect of R2, we apply a modified evaluation as following. For each , we firstly apply a renormalization as:

Then we modify the probability of used for search as:
The first two terms of the product effectively restrict the label probability back to a single distribution w.r.t. the blank probability of , but still maintain the rebalanced label distribution to some extent. We denote this operation as . Table 2 shows the WER, substitution (Sub), deletion (Del) and insertion (Ins) error rate results together with the optimized scales for these experiments.

For the baseline SF, the optimal already leads to a high Del error and we can not increase it further. Boosting the label probability with length reward largely reduces the Del error to the same level as Ins. It also allows a slight increase of the external LM importance () for better performance. This verifies the individual effect of R2. Rebalancing the label distribution with reduces the Sub error as we rely more on the external LM for context modeling. However, it still suffers the high Del error without the boosting effect. This verifies the individual effect of R1. When combining length reward and , we see that the benefits are complementary. Finally, applying the ILM correction allows further enlarging the effect of R1 and R2 with larger scales, and thus achieves further improvement. It also eliminates the need of length reward.

Evaluation Librispeech dev-other
WER Sub Del Ins
SF 0.6 0 5.9 4.4 1.0 0.5
  + length reward 0.65 5.6 4.4 0.6 0.6
0.6 0.3 5.7 4.2 1.0 0.5
  + length reward 0.65 5.4 4.3 0.6 0.5
0.85 0.4 5.1 4.0 0.6 0.5
  + length reward 0.95 5.2 4.1 0.6 0.5
Table 2: WER[%] and Sub,Del,Ins error rate[%] results on the dev-other set of the in-domain Librispeech corpus. Analytical evaluation using the base RNN-T model.

5 Conclusion

In this work, we provided a detailed formulation to compare various ILM correction-based LM integration methods in a common RNN-T framework. We explained two major reasons for performance improvement with ILM correction from a decoding interpretation, which are experimentally verified with detailed analysis. Moreover, we proposed an exact-ILM training framework by extending the proof in HAT [11], which enables a theoretical justification for other ILM approaches. All investigated LM integration methods are systematically compared on the in-domain Librispeech and out-of-domain TLv2 tasks. The recently proposed ILM approach for the attention model also performs the best for the RNN-T model. Our proposed exact-ILM training can further improve its performance.

6 Acknowledgements

This work was partly funded by the Google Faculty Research Award for “Label Context Modeling in Automatic Speech Recognition”. We thank Mohammad Zeineldeen and Wilfried Michel for useful discussion.


  • [1] Zoltán Tüske, George Saon, Kartik Audhkhasi, and Brian Kingsbury,

    “Single Headed Attention based Sequence-to-sequence Model for State-of-the-Art Results on Switchboard,”

    in Proc. Interspeech, 2020, pp. 551–555.
  • [2] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  • [3] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in

    Proc. Int. Conf. on Machine Learning (ICML)

    , 2006, pp. 369–376.
  • [4] Alex Graves, “Sequence Transduction with Recurrent Neural Networks,” 2012, https://arxiv.org/abs/1211.3711.
  • [5] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-End Attention-based Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2016, pp. 4945–4949.
  • [6] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
  • [7] Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “On Using Monolingual Corpora in Neural Machine Translation,” 2015, http://arxiv.org/abs/1503.03535.
  • [8] Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models,” in Proc. Interspeech, 2021, pp. 2856–2860.
  • [9] Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Early Stage LM Integration Using Local and Global Log-Linear Combination,” in Proc. Interspeech, 2020, pp. 3605–3609.
  • [10] Erik McDermott, Hasim Sak, and Ehsan Variani, “A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition,” in IEEE ASRU, 2019, pp. 434–441.
  • [11] Ehsan Variani, David Rybach, Cyril Allauzen, and Michael Riley, “Hybrid Autoregressive Transducer (HAT),” in Proc. ICASSP, 2020, pp. 6139–6143.
  • [12] Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, and Yifan Gong, “Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition,” in IEEE SLT, 2021, pp. 243–250.
  • [13] Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Librispeech Transducer Model with Internal Language Model Prior Correction,” in Proc. Interspeech, 2021.
  • [14] Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, and Lei He, “Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS,” in Proc. Interspeech, 2021, pp. 751–755.
  • [15] Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, and Zoltán Tüske, “Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio,” in Proc. Interspeech, 2021, pp. 2027–2031.
  • [16] Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, and Hannes Heikinheimo, “Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network,” in Proc. Interspeech, 2021, pp. 1882–1886.
  • [17] Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, and Yifan Gong, “Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition,” in Proc. ICASSP, 2021, pp. 7338–7342.
  • [18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  • [19] Anthony Rousseau, Paul Deléglise, and Yannick Estève, “Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks,” in Proc. LREC, 2014, pp. 3935–3939.
  • [20] Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, and Yifan Gong, “On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer,” in Proc. Interspeech, 2021, pp. 3435–3439.
  • [21] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” 2014, http://arxiv.org/abs/1412.5567.
  • [22] George Saon, Zoltán Tüske, Daniel Bolaños, and Brian Kingsbury, “Advancing RNN Transducer Technology for Speech Recognition,” in Proc. ICASSP, 2021, pp. 5654–5658.
  • [23] Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, and Hermann Ney, “Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition,” in Proc. Interspeech, 2021, pp. 2886–2890.
  • [24] Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [25] Ralf Schlüter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney, “Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2007, pp. 649–652.
  • [26] Anshuman Tripathi, Han Lu, Hasim Sak, and Hagen Soltau, “Monotonic Recurrent Neural Network Transducer and Decoding Strategies,” in IEEE ASRU, 2019, pp. 944–948.
  • [27] Wei Zhou, Simon Berger, Ralf Schlüter, and Hermann Ney, “Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition,” in Proc. ICASSP, 2021, pp. 5644–5648.
  • [28] Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter, and Hermann Ney,

    “On Architectures and Training for Raw Waveform Feature Extraction in ASR,”

    in IEEE ASRU, 2021, (to appear).
  • [29] George Saon, Zoltán Tüske, and Kartik Audhkhasi, “Alignment-Length Synchronous Decoding for RNN Transducer,” in Proc. ICASSP, 2020, pp. 7804–7808.