Internal language model estimation through explicit context vector learning for attention-based encoder-decoder ASR

by   Yufei Liu, et al.

An end-to-end (E2E) speech recognition model implicitly learns a biased internal language model (ILM) during training. To fused an external LM during inference, the scores produced by the biased ILM need to be estimated and subtracted. In this paper we propose two novel approaches to estimate the biased ILM based on Listen-Attend-Spell (LAS) models. The simpler method is to replace the context vector of the LAS decoder at every time step with a learnable vector. The other more advanced method is to use a simple feed-forward network to directly map query vectors to context vectors, making the generation of the context vectors independent of the LAS encoder. Both the learnable vector and the mapping network are trained on the transcriptions of the training data to minimize the perplexity while all the other parameters of the LAS model is fixed. Experiments show that the ILMs estimated by the proposed methods achieve the lowest perplexity. In addition, they also significantly outperform the shallow fusion method and two previously proposed Internal Language Model Estimation (ILME) approaches on multiple datasets.



page 1

page 2

page 3

page 4


Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

The efficacy of external language model (LM) integration with existing e...

Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Attention-based encoder-decoder (AED) models learn an implicit internal ...

Exploring Neural Transducers for End-to-End Speech Recognition

In this work, we perform an empirical comparison among the CTC, RNN-Tran...

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Improving end-to-end speech recognition by incorporating external text d...

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Text-only adaptation of an end-to-end (E2E) model remains a challenging ...

Advancing Connectionist Temporal Classification With Attention Modeling

In this study, we propose advancing all-neural speech recognition by dir...

Residual Language Model for End-to-end Speech Recognition

End-to-end automatic speech recognition suffers from adaptation to unkno...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

footnotetext: This work is done when Yufei Liu is an intern in ByteDance

End-to-end (E2E) Automatic Speech Recognition (ASR) models are becoming more and more popular due to 1) its success in achieving state-of-the-art results; and 2) its compactness in that both the acoustic and language models are jointly learned with a single network. One of the most popular E2E models is the Listen-Attend-Spell (LAS) model

[1], which is also called the attention-based encoder-decoder (AED) model.

Though compact and effective in modeling, E2E models are inherently limited in taking full advantage of an external language model (LM) that is trained on a much larger text-only data. This can be explained by the Bayesian probabilistic theory that governs speech recognition models. Given an acoustic feature sequence and the corresponding word/sub-word sequence , an E2E model directly learns the posterior . can be decomposed into an acoustic model and a language model . The language model is trained by using only the training scripts and thus is suboptimal. When more external text data is available, one can train a much more robust external language model . To fuse with the external LM , the effect of the internal LM has to be removed first.

However, the difficulty lies in that we cannot easily estimate in an E2E model. Many simplified methods just ignore the effect of , such as component fusion [15], cold fusion [17] and cold fusion [18] have been proposed. As the internal language model is implicitly retained in , the fusion process could be biased, yielding sub-optimal results in both intra- and inter-domain ASR scenarios.

Recently, Hybrid Autoregressive Transducer (HAT) [21] and Density Ratio approaches [10, 19] are proposed as an extension of shallow fusion. The Density Ratio method uses a completely separate model trained with the E2E training transcript to approximate . On the other hand, HAT estimates its internal LM by removing the effect of the encoder from the network. Both methods have shown to outperform shallow fusion, especially in cross-domain tasks. Inspired by HAT, Meng et al. proposed an Internal Language Model Estimation (ILME) method [12] to estimate the prior for both the RNN-T and the AED models. Compared with HAT, ILME extends the LM fusion work to the AED model. The implementations for both ILME and HAT are similar.

Though performance improvement proposed by [12]

is significant, zeroing out the encoder’s output may potentially lead to mismatch during inference since an E2E model is trained to optimize a standard E2E loss function. To deal with the mismatch issue, Meng

et al. later proposed an internal language model training method in [11] to further update the model parameters engaged in the prediction of the internal LM score. To relax the constraint of zeroing out the encoder’s output, Zeineldeen et al. proposed a series of prior estimating methods for better external LM fusion [23]. Better performance was achieved, compared with the method proposed in  [12].

In this paper, we propose two novel ILME methods to better estimate the prior LM using AED models. One of the simpler methods is to explicitly treat the decoder as an LM that is responsible for the internal language model (ILM) score estimation. Different from previous works, we propose to learn such an ILM through optimizing the context vector using the training transcript after the normal training procedure. The advantage of the proposed method is simpler and yet can be applied to the LAS model with different types of encoders. Alternatively, as an improvement to the first method where the context vector is a constant vector during inference, we propose another method that allows different context vectors at each decoding step. To achieve this, we propose to replace the attention component of the decoder with a simple feed-forward network that map the output query vector directly to the context vector.

2 Related work

External LM fusion has long been studied in the end-to-end ASR community. Though the fusion methods are diversified, such as cold fusion [17], deep fusion [4], as well as component fusion [15] etc., shallow fusion [13, 2, 6, 7, 20]

, which is a log-linear interpolation of the scores produced by the E2E model and a separately trained LM, is a predominant approach due to its simplicity and effectiveness.

From the Bayesian probabilistic theory, people have come to realize the necessity of removing the effect of the implicitly learned internal LM (ILM) in an E2E model for better integration. This can be seen as removing the bias of the training text data. However, exact calculation of the internal LM is impossible and various approximate methods have been proposed. The Density Ratio method [10, 24, 19] uses a separately trained LM to approximate the internal LM. On the other hand, some researchers have argued that it is better to use the same set of E2E model parameters to estimate the internal LM. By assuming that the internal LM is mostly learned by the decoder in an AED model or the prediction network in a RNN-T model, Hybrid Autoregressive Transducer (HAT [21]), Internal LM Estimation (ILME [12]) and Internal LM Training (ILMT [11]) have been proposed by masking the context vector appropriately. Recently, Zeineldeen et al. in [23] propose a series of optimization methods based on ILME and ILMT. One of the methods is to average the attention-based context vector. This will need the paired audio-transcript from the training data to estimate the internal LM. The methods proposed in this paper are purely text-data driven to estimate the internal LM. From this perspective, the context vector calculated by using the proposed methods is more closely related to the internal LM. [23] also proposes to use a “Mini-LSTM” network to produce decoding-synchronous context vectors. This is similar to the adaptive context learning method proposed in Subsection 4.2. However, as shown in the experiments, the proposed adaptive context learning method significantly outperforms the “Mini-LSTM” approach. We suspect that this is because LSTM itself is a powerful sequence model and thus instead of estimating the biased ILM it may learn a “separate” LM on its own.

3 Attention-based encoder-decoder ASR

As aforementioned, all our work is based on the attention-based encoder-decoder (AED) model [1]. In the following experiments, we trained the LAS model with different encoders such as BLSTM, Transformer [8], and Conformer [3]. But the decoder was fixed with the LSTM architecture for simplicity.

The objective of an E2E model, including AED, is to predict the posterior of a word sequence , given the input feature sequence . In an AED model [1], the encoder learns to map the feature representation to the higher and better representation , where the dimension and sequence length between and are normally different due to the transformation and the down-sampling operations. The attention network determines which subset of the sequence are to be attended, given the decoder’s hidden state representation . That is


where is the attention weighting vector at step and is the context vector that will be employed by the decoder to predict the next token. With the obtained context vector, the decoder proceeds as follows:


where is the embedding vector corresponding to the output token . is normally a word-piece in practice. As a result, the word/token sequence posterior is obtained as the product of equation (5).

4 Proposed method

4.1 Non-adaptive context learning for ILME (NACL-ILME)

From what is presented in Section 3, if the context vector in Eq. (2) is not estimated from the encoder’s output, the decoder itself can be seen as an LM. This motivates us to learn the context vector by using only the training transcript. Specifically, during inference Eq. (2) is rewritten as


where is a learnable vector. In other words, in Eq. (3) is replaced with a constant vector. The learned context vector is trained to optimize the perplexity of the decoder on the training transcription. Furthermore, during the training of , all the other parameters of the AED model are kept fixed. The learned context vector, together with the fixed decoder, forms the estimated ILM.

4.2 Adaptive context learning for ILME (ACL-ILME)

What we propose in Section 4.1 is to learn a non-adaptive context vector, meaning that the context vector is fixed during inference at each time step. This may not be optimal. From Eq. (2), we can see that the context vector varies along with the decoding state at every decoding step. The actual internal LM may benefit from this variation. Thus, we propose an adaptive context vector learning method in this section. is allowed to change at each decoding time step as it normally does (see Eq. (3)). However, should not depend on the output of the encoder.

To achieve this, we propose to generate the context vector by using only the current decoding output . Specifically, from Eqs. (1) and (2), we can see that the encoder’s output affects the context vector through an attention mechanism. To train the internal language model, we propose to remove from Eq. (1) and (2). We then have , where is a nonlinear mapping function. The learned mapping function, together with the decoder, forms the internal LM. Furthermore,

should be as simple as possible to encourage the original decoder parameters to produce the ILM scores. We propose to use a simple feed-forward neural network for

, i.e.


where FFNN refers to a feed-forward neural network.

Training of the feed-forward neural network is similar to that described in Section 4.1. Once the training of the entire AED model is done. We continue to train the feed-forward neural network to minimize the perplexity on the training text data. The parameters for the trained AED model are kept fixed during the updating of the feed-forward neural network.

Zeineldeen et al. recently also proposed a similar context learning method, called “Mini-LSTM” in [23]. Our method is remarkably different from theirs. Firstly, other than using the decoder’s output as input to learn the context vector, we use the decoder’s state vector which acts as a query vector. Secondly, though they might also notice that the mapping function should be as simple as possible. They propose to use a “Mini-LSTM” as the mapping function. However, LSTM is a powerful sequence model. The LSTM itself might learn a different internal language model. Last but not least, our experiments show that the proposed model significantly outperforms the “Mini-LSTM” method proposed in [23].

4.3 ILME-based Language model fusion

During inference, we need three scores to perform external LM fusion. The three scores are, the output from the AED ASR decoder , the score produced by the estimated ILM , and finally the score calculated by an external LM . The final inference results are found by using:


where and are the estimated ILM and external LM weighting factors respectively.

5 Experiments

5.1 Data sets

To verify the effectiveness of the proposed methods for both intra- and inter-domain LM fusion. Three manual transcribed data sets, i.e. an 18k-hours in-house English data, the 960-hours Librispeech [14] ,and another 10k-hours in-house Mandarin data were used to train the LAS models. The test-other set of Librispeech (2939 utterances), the test set of TED-LIUM-V3 [5] (1155 utterances) and an in-house medical domain Mandarin test set (1092 utterances) were used to evaluate different models. The external LMs for three test sets are trained with the corresponding domain text data having 802, 229, and 4398 million English words or Chinese characters.

5.2 Models and experimental setups

Three different LAS models with different encoders, i.e., a BLSTM, a Transformer [22, 9], and a Conformer[3] were trained for comparison. The BLSTM encoder was configured the same as in  [1]. Together with the decoder, they form a conventional LAS model  [1]

. The Transformer’s main parameters {layer, dim, head} are {18, 512, 8}, and the intermediate GLU layer size is 2048 with 0.1 dropout. The Conformer’s parameters {layer, dim, head} are {12, 512, 8} and the intermediate SWISH layer size is also 2048 with 0.1 dropout. The convolution kernel size for the Conformer is 32. As for the decoder, we used the same architecture for different LAS models. The decoder is a 4 layers LSTM with 1024 hidden units. For LAS models trained on the English dataset, we use the Byte Pair Encoding (BPE) subword units with a vocabulary size of 7000. For LAS models trained on Chinese, 8046 Chinese characters are used as the modeling units. The model used for external LMs is a 3-layer LSTM with 4096 units per layer. The feed-forward neural network in ACL-ILME mapping is a fully connected 4 layers network with 512 units per layer. RELU is applied to the first three layers as the activation function. Both the learnable vector and the feed-forward network are trained with the initial learning rate of 0.001 and decay to the final learning rate of 0.0001 over 10000 steps. All the LM fusion parameters

and were tuned using grid search. All the experiments were conducted using the Lingvo [16] platform.

5.3 Results

5.3.1 ILM Perplexity evaluation

One way to evaluate the effectiveness of different ILME methods is to calculate the perplexity of the estimated ILM on the training text. Experiments were carried out on the Librispeech data only. All the estimated ILMs were trained with a subset containing 90% of the transcription of the training data. The perplexity of ILMs estimated by different ILME methods was evaluated using the remaining 10% of the transcription. Note that the perplexity of estimated ILM is evaluated on the dev set in most previous works [12, 23]. We prefer to evaluate them on a held-out training set instead since the main purpose of an ILM is to recover the biased ILM that the original E2E model implicitly learned from the limited training data. The perplexity on the training data will reflect the performance of an ILME method better. Table 1 shows the perplexity of LAS models with different encoders and different ILME methods.

Encoder Type BLSTM Transformer Conformer
Zero-out 387 6247 3271
Mini-LSTM 240 460 477
NACL-ILM 266 528 563
ACL-ILM 235 428 463
Table 1: Perplexity of different LAS models and different internal language modeling methods evaluated on the held-out training transcript

From Table 1, we can see that the proposed ACL-ILME method yields consistently the lowest perplexity among all methods. The performance is closely followed by the Mini-ILME method. However, the perplexity of the Zero-out method is very large, especially when Transformer or Conformer encoder is used. Since the context vector is a weighted summation of the encoder’s output in a LAS model, we suspect that this may relate to the distribution of the encoder’s outputs.

(a) BLSTM (b) Conformer (c) Transformer
Figure 1: The numeric distribution of three different encoders’ output.

Figure 1 shows the distribution of three different encoders’ outputs. We can see that the distribution of the BLSTM’s output is symmetric. In addition. all values fall in the range of [-1,1]. Therefore, a zero-context vector is a reasonable assumption if the BLSTM is used as the encoder. On the other hand, the distribution of for both the Transformer and the Conformer encoders are rather “messy”. The dynamic range is also relatively large, leading to an “unpredictable” context vector. Thus zeroing-out the context vector leads to very large perplexities due to assumption mismatch.

5.3.2 Cross-domain LM fusion

Table 2 reports the Word Error Rate (WER) of the proposed methods for cross-domain LM fusion. The Librispeech test-other test data was used as the target domain, while the source ASR models were trained using the 18k-hours in-house English data.

Encoder type None SF Zero LSTM NACL ACL
BLSTM WER 10.35 8.75 7.97 7.68 7.56 6.88
0.0 0.15 0.25 0.3 0.35 0.35
0.0 0.0 0.1 0.15 0.15 0.25
Transformer WER 8.94 7.44 7.11 6.35 6.29 5.99
0.0 0.15 0.25 0.3 0.4 0.4
0.0 0.0 0.05 0.15 0.2 0.25
Conformer WER 8.96 7.61 7.61 6.85 6.98 6.41
0.0 0.1 0.1 0.15 0.25 0.25
0.0 0.0 0.0 0.15 0.1 0.15
Table 2: WER(%) on Librispeech test-other. Three AED models were trained with the 18k-hours in-house English data. Different fusion methods, including no fusion (None), Shallow Fusion (SF), Zero-out context (Zero), Mini-LSTM (LSTM), the NACL-ILME (NACL), and the ACL-ILME (ACL) are compared.

As can be seen from Table 2, the proposed methods, particularly the ACL-ILME method, achieve consistently the lowest WER with different encoders. Though the proposed NACL-ILME method is very simple, it achieves comparable results with the Mini-LSTM method. The Zero-out method works slightly better than the shallow fusion method. Finally, compared with the shallow fusion method, the proposed ACL-ILME can achieve up to 22% relative WER reduction.

Encoder type None SF Zero LSTM NACL ACL
BLSTM WER 14.8 12.53 11.88 10.63 10.36 10.45
0.0 0.15 0.2 0.4 0.45 0.4
0.0 0.0 0.1 0.3 0.35 0.3
Transformer WER 15.05 14.35 14.35 14.35 14.35 14.35
0.0 0.05 0.05 0.05 0.05 0.05 0.05
0.0 0.0 0.0 0.0 0.0 0.0
Conformer WER 12.78 12.25 10.95 11.05 10.4 10.92
0.0 0.05 0.25 0.3 0.45 0.4
0.0 0.0 0.15 0.25 0.35 0.3
Table 3: WER(%) on TED-LIUM-V3 test data. Different fusion methods, including no fusion (None), Shallow Fusion (SF), Zero-out context (Zero), Mini-LSTM (LSTM), the NACL-ILME (NACL), and the ACL-ILME (ACL) are compared.

Table 3 shows the performance of different models trained on the 960-hours Librispeech data but tested on the TED-LIUM-V3 data. The source E2E model was fused with the same external LM by using different cross-domain LM fusion methods. The proposed NACL-ILME method achieves the best performance among all the LM fusion methods. Additionally, All ILME methods have no advantage over the simple shallow fusion method when the encoder is a Transformer. This might be because that all the ILME methods can only compensate the linguistic bias introduced by the decoder. However, the LAS model with a transformer as the encoder might learn most of the linguistic information in its encoder. Another possible reason is that the perplexity (i.e. 99.4) of the external language model on TED-LIUM-V3 is relatively large. So, the improvement by using ILME methods in Table 3 is rather limited, compared with what has been achieved in Table 2.

Table 4 shows the WERs on Bytedance’s in-house medical data set. The LAS model was trained on the 10k-hours Mandarin data. Only a Transformer was used as the encoder.

Encoder type None SF Zero LSTM NACL ACL
Transformer WER 6.72 5.93 5.93 5.50 5.50 4.80
0.0 0.15 0.15 0.35 0.35 0.45
0.0 0.0 0.0 0.25 0.25 0.4
Table 4: CERs(%) on a in-house Chinese medical test data. Different fusion methods, including no fusion (None), Shallow Fusion (SF), Zero-out context (Zero), Mini-LSTM (LSTM), the NACL-ILME (NACL), and the ACL-ILME (ACL) are compared.

Table 4 shows the Character Error Rate (CER) on Bytedance’s in-house medical data set. The source LAS model was trained on the 10k-hours Mandarin data. Only the Transformer was used as the encoder due to limit time. Again, the ACL-ILME method achieves the lowest CER. We found that the proposed ACL-ILME method is very effective to differentiate homophone Chinese characters, which is critical in Mandarin ASR. This is why the CER reduction is much more significant compared with English datasets.

5.3.3 Intra-domain LM fusion

Table 5 reports the WER results for intra-domain LM fusion. The source E2E models were trained on the 960-hours Librispeech data set and evaluated on the test-other test set. Both the proposed methods win an obvious margin over the Zero-out and the shallow fusion methods.

Encoder type None SF Zero LSTM NACL ACL
BLSTM WER 7.13 6.22 5.44 5.17 5.18 5.16
0.0 0.15 0.35 0.55 0.55 0.55
0.0 0.0 0.2 0.4 0.4 0.4
Transformer WER 7.6 7.06 6.98 6.51 6.71 6.63
0.0 0.1 0.25 0.25 0.2 0.2
0.0 0.0 0.15 0.15 0.15 0.1
Conformer WER 6.3 5.8 5.48 5.19 5.19 5.11
0.0 0.1 0.2 0.3 0.35 0.35
0.0 0.0 0.05 0.25 0.3 0.3
Table 5: WER(%) results on the Librispeech test-other dataset for intra-domain LM fusion. Different fusion methods, including no fusion (None), Shallow Fusion (SF), Zero-out context (Zero), Mini-LSTM (LSTM), the NACL-ILME (NACL), and the ACL-ILME (ACL) are compared.

6 Conclusion

In this paper, we proposed two novel ILME methods by learning a static context vector or a mapping between the query vector and the context vector. Experiments on multiple datasets demonstrate the effectiveness of the proposed methods. Compared with shallow fusion and other previously proposed ILME methods, the methods proposed in this paper significantly reduce the error rate of the system. In the future, we would like to extend these methods to the recurrent neural network transducer-based ASR framework.


  • [1] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals (2015) Listen, attend and spell. arXiv preprint arXiv:1508.01211. Cited by: §1, §3, §3, §5.2.
  • [2] J. Chorowski and N. Jaitly (2016) Towards better decoding and language model integration in sequence to sequence models. arXiv preprint arXiv:1612.02695. Cited by: §2.
  • [3] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §3, §5.2.
  • [4] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y. Bengio (2015)

    On using monolingual corpora in neural machine translation

    arXiv preprint arXiv:1503.03535. Cited by: §2.
  • [5] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve (2018) TED-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation. In Proc. SPECOM, pp. 198–208. Cited by: §5.1.
  • [6] T. Hori, S. Watanabe, and J. R. Hershey (2017) Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. In Proc. ASRU, pp. 287–293. Cited by: §2.
  • [7] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar (2018) An analysis of incorporating an external language model into a sequence-to-sequence model. In Proc. ICASSP, pp. 1–5828. Cited by: §2.
  • [8] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al. (2019) A comparative study on transformer vs rnn in speech applications. In Proc. ASRU, pp. 449–456. Cited by: §3.
  • [9] S. Li, D. Raj, X. Lu, P. Shen, T. Kawahara, and H. Kawai (2019) Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation.. In Proc. INTERSPEECH, pp. 4400–4404. Cited by: §5.2.
  • [10] E. McDermott, H. Sak, and E. Variani (2019) A density ratio approach to language model fusion in end-to-end automatic speech recognition. In Proc. ASRU, pp. 434–441. Cited by: §1, §2.
  • [11] Z. Meng, N. Kanda, Y. Gaur, S. Parthasarathy, E. Sun, L. Lu, X. Chen, J. Li, and Y. Gong (2021) Internal language model training for domain-adaptive end-to-end speech recognition. In Proc. ICASSP, pp. 7338–7342. Cited by: §1, §2.
  • [12] Z. Meng, S. Parthasarathy, E. Sun, Y. Gaur, N. Kanda, L. Lu, X. Chen, R. Zhao, J. Li, and Y. Gong (2021) Internal language model estimation for domain-adaptive end-to-end speech recognition. In Proc. SLT, pp. 243–250. Cited by: §1, §1, §2, §5.3.1.
  • [13] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur (2010) Recurrent neural network based language model.. In Proc. Interspeech, Vol. 2, pp. 1045–1048. Cited by: §2.
  • [14] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In Proc. ICASSP, pp. 5206–5210. Cited by: §5.1.
  • [15] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie (2019) Component fusion: learning replaceable language model component for end-to-end speech recognition system. In Proc. ICASSP, pp. 5361–5635. Cited by: §1, §2.
  • [16] J. Shen, P. Nguyen, Y. Wu, Z. Chen, M. X. Chen, Y. Jia, A. Kannan, T. Sainath, Y. Cao, C. Chiu, et al. (2019) Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295. Cited by: §5.2.
  • [17] A. Sriram, H. Jun, S. Satheesh, and A. Coates (2017) Cold fusion: training seq2seq models together with language models. arXiv preprint arXiv:1708.06426. Cited by: §1, §2.
  • [18] F. Stahlberg, J. Cross, and V. Stoyanov (2018) Simple fusion: return of the language model. arXiv preprint arXiv:1809.00125. Cited by: §1.
  • [19] M. Sugiyama, T. Suzuki, and T. Kanamori (2012)

    Density ratio estimation in machine learning

    Cambridge University Press. Cited by: §1, §2.
  • [20] S. Toshniwal, A. Kannan, C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu (2018) A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proc. SLT, pp. 369–375. Cited by: §2.
  • [21] E. Variani, D. Rybach, C. Allauzen, and M. Riley (2020) Hybrid autoregressive transducer (hat). In Proc. ICASSP, pp. 6139–6143. Cited by: §1, §2.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NeurIPS, pp. 5998–6008. Cited by: §5.2.
  • [23] M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schlüter, and H. Ney (2021) Investigating methods to improve language model integration for attention-based encoder-decoder asr models. arXiv e-prints, pp. arXiv–2104. Cited by: §1, §2, §4.2, §5.3.1.
  • [24] A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney (2021) Librispeech transducer model with internal language model prior correction. arXiv preprint arXiv:2104.03006. Cited by: §2.