1 Introduction
Endtoend (E2E) automatic speech recognition (ASR) has attracted interest as a method of directly integrating acoustic models (AMs) and language models (LMs) because of its simple training and efficient decoding procedures. In recent years, various approaches have been studied, including connectionist temporal classification (CTC)
[1, 2, 3], attentionbased encoder–decoder models [4, 5], hybrid models [6, 7], and transducers [8, 9, 10].E2E ASR requires pairs of audio and text data for training. Even with a large amount of paired data, Del Rio et al. demonstrated that training with 960 h of Librispeech read speech does not result in sufficient performance in the mismatched domain of earnings calls [11]. If the target domain has paired data, adaptation techniques can be adopted [12, 13, 14, 15]. However, in most scenarios, ordersofmagnitude more textonly data of the target domain are available, and it is more efficient to shift the linguistic bias of the E2E ASR model towards the domain of interest using such data.
Many researchers have studied fusion methods using an external LM trained with textonly data. Shallow fusion, which linearly interpolates the E2E ASR model with an external LM, is the most popular approach
[4, 16, 17]. More structural integration can be observed in deep fusion [18], cold fusion [19], and component fusion [20], which require additional training. Fundamentally, the probability estimation of the LMs relies on softmax computation. Therefore, although there are efficient logsumexp calculation tricks, it incurs a higher computational cost as the vocabulary size increases. The density ratio approach
[21] focuses more on domain adaptation by assuming that the source and target domains are acoustically consistent, and it adapts the E2E ASR model with LMs trained in each domain by following Bayes’ rule. Recently, the estimation of an internal LM, a linguistic bias of E2E ASR, has been investigated, and by subtracting from the ASR posterior, it improves performance in both crossdomain and intradomain scenarios [22, 23, 24, 25]. However, both the density ratio and internal LM estimation complicate the inference computation. In addition, due to the domain mismatch, the estimation of the internal LM may not always be accurate.In this paper, we propose a simple external LM fusion for domain adaptation, which considers the internal LM. Instead of subtracting the estimated internal LM and fusing with an external targetdomain LM, we directly model the residual factor of them. The difference of the probability distributions, namely the residual LM, is trained with a targetdomain textonly dataset, considering the estimated internal LM in the specific domain. Thus, the residual LM not only conveys the linguistic characteristic of the dataset, but also aggregates the estimation results of the internal LM in the targetdomain corpus into the model, thereby alleviating the domain mismatch problem. In addition, because the distribution is no longer a probability, the residual LM can omit costly softmax computations in the output layer. We propose a training approach that applies smoothing to the internal LM probability, and combines crossentropy with mean squared errors (MSEs) for the loss function. The trained residual LM can be simply fused in the same manner as shallow fusion. We performed experiments to determine the effectiveness of the proposed residual LM in crossdomain and intradomain scenarios using various corpora. The results show that the proposed residual LM improves performance in crossdomain scenarios by 4.0% relative word error rate (WER) in the Librispeech–TEDLIUM3 adaptation, with faster inference. Additionally, the residual LM fusion method performs robustly in intradomain scenarios.
2 Formulation and Related Studies
ASR is a problem in determining the most probable token sequence given an input audio . In a scenario in which the training and target domains differ, and the text corpus in the target domain is easily accessible, it is useful to combine the ASR model with an external LM trained using the target corpus. With a Bayesian interpretation, classical hybrid ASR systems determine the highest probability by combining with the external LM, as follows:
(1) 
where and denote parameters of the AM and LM, respectively. In E2E systems,
is replaced by E2E neural networks, which can be further decomposed into the following terms using the Bayesian theorem:
(2) 
where is a parameter set of an E2E ASR model. For domain adaptation, the E2E models are trained in a source domain, and the external LMs are trained in a target domain; thus, and represent source and target domains, respectively. By omitting and , which are not required to search for the highest probability of , the score function for recognition can be expressed in a logarithmic scale as
(3) 
The first term is the output posterior of the E2E ASR neural network, and the second term is the implicit linguistic bias (prior) of the trained E2E model. The last term is an external LM trained in the targetdomain text corpus.
2.1 Shallow fusion
E2E ASR models are often used with an external LM trained with a textonly corpus of the target domain. This is reasonable because, unlike the E2E ASR requiring a large corpus of paired audio–text data for training, external LM training requires only text data, which can be easily obtained with a large scale. Shallow fusion is a common method for integrating an E2E ASR model with an LM [4, 16, 17]. In practice, the second term in Eq. (3) is omitted because it is intractable, and in shallow fusion, only the external LM term is combined by introducing an LM weight, as:
(4) 
2.2 Density ratio approach
The density ratio [21] assumes that the source and target domains are acoustically consistent. Assuming that an LM trained with textonly data in the source domain can represent the linguistic prior of the E2E model, i.e., the second term of Eq. (3), the density ratio uses the ratio of LMs trained in both domains for adaptation. Thus, the score in the density ratio approach can be expressed using the sourcedomain and targetdomain LM weights, and , as follows:
(5) 
2.3 Internal language model estimation
Internal LM estimation (ILME) [22, 23, 24, 25] attempts to estimate the second term in Eq. (3), i.e., . As a result, the score for decoding using the internal LM is expressed as follows.
(6) 
where is a weight parameter for the estimated internal LM.
A common method of estimating the internal LM,
, is to replace the encoder output with zerofilled vectors and infer only with the decoder. This is because, particularly in attentionbased encoder–decoder and transducer architectures, decoders function similarly to LMs as they estimate next
th token with a given previous output . This estimation approach is statistically reasonable in the source domain because the encoder output is most likely normalized to zeromean vectors with a normalization layer. However, this estimation is performed in every inference, which complicates the computation. In addition, when it is estimated with the mismatched domain speech, the behavior of the internal LM becomes unpredictable and the estimation may not always be accurate.3 Residual Language Model
3.1 Definition of the residual language model
Instead of estimating the internal LM in every inference, we propose to directly model the residual factor of the targetdomain external LM and the estimated internal LM. The model predicts the difference between the second and third terms (internal and external LMs, respectively) in Eq. (3), which we defined as the residual LM. By directly modeling the residual term, we can simplify the inference computation to shallow fusion, as follows:
(7) 
where is the proposed residual LM. The residual LM conveys both the second and third terms in Eq. (3); thus Eq. (7) strictly follows the score calculation (3) derived by Bayes’ interpretation. The main differences from ILME are as follows.

The residual LM models the residual factor of an external LM and the internal LM, which simplify the inference procedure and reduce computational cost.

The difference of two LMs is no longer a probability distribution; thus it can further omit the logsoftmax operation, which requires costly logsumexp calculation.

The residual LM conveys statistical behavior of the estimated internal LM in the targetdomain text corpus.
The residual LM is trained using the targetdomain textonly data. To model the residual terms of the external and internal LMs, we can define the training target of the model output, as follows:
(8) 
where is a reference label, and is a tunable parameter. To avoid logzero computation, a smoothed label as in [26] is adopted for the reference , as
(9) 
where is the vocabulary index, is Dirac delta with respect to , is the vocabulary size, and is the smoothing weight. Throughout the training data, the internal LM is estimated using Eq. (8). While ordinary ILME (Sec. 2.3) is performed only with the input speech sample, the residual LM statistically considers its behavior in the entire targetdomain training data, which may be effective during inference with the targetdomain data.
3.2 Smoothing of the internal language model
Because the estimation of the internal LM is not always reliably preformed in the target domain, the probability distribution may be inaccurate. If the probability of the token of interest is incorrectly estimated to be significantly low, the target distribution diverges to infinity. To prevent this, we use a temperature , as introduced in [27], to soften its distribution.
(10) 
where is the output of the decoder of the E2E ASR model. We further introduce a small value, to avoid log zero in Eq. (8). By replacing with , the softened target is used instead as
(11) 
3.3 Training of the residual language model
The residual LM is trained to minimize the distance between the target distribution and the model output . A straightforward approach is to minimize the L1 norm or MSEs between the model output and the target distribution. However, in our preliminary experiments, the trained model did not reasonably perform. We assume that this was because the Euclideanbased optimization attempted to minimize the distances of all vocabulary entries equally, which did not contribute to improving recognition performance.
To stably train the residual LM, we decompose the target function by introducing a normalization term, as follows:
(12) 
where and are defined as
(13)  
(14) 
Thus, the target function is decomposed into the probabilistic term, , and the bias term, . We propose to separately minimize the distances pertaining to these terms, with a combination of crossentropy of probabilities and the Euclidean distance of the biases.
3.3.1 Crossentropy loss for the probabilistic term
LMs are generally optimized by minimizing the negative log likelihood by computing the cross entropy, which is equivalent to minimizing the perplexity. Inspired by this, we apply crossentropy to optimize the probabilistic term, , in Eq. (12). To extract the probabilistic factor from the residual LM, we marginalize the model output by applying softmax function.
(15) 
Then a crossentropy loss is accumulated in the targetdomain dataset as
(16) 
3.3.2 Meansquarederror loss for the bias term
3.3.3 Integrated objective function
Finally, we integrate aforementioned optimization in a hybrid manner. The crossentropy and MSE losses are combined with a weighted sum using a parameter as follows.
(20) 
Note that both losses contain statistic terms of the internal LM, in of Eq. (16) and in Eq. (19). Therefore, the trained residual LM considers the statistical behaviors of the internal LM in the targetdomain data.
4 Experiments
4.1 Crossdomain evaluation
To evaluate the effectiveness of the proposed residual LM on domain adaptation, we evaluated crossdomain scenarios in English and Japanese.
4.1.1 Experimental setup
For the English evaluation, we trained an E2E ASR model with the Librispeech dataset
[28], a read speech corpus, and applied it to the TEDLIUM 3 [29] dev/test set, which is a spontaneous lecture style. For Japanese, we trained an E2E ASR model with the lecture style CSJ corpus [30], and then applied it to the LaboroTV dev set [31], a corpus of TV programs. We trained streaming Transformer E2E ASR models following [32]. The input acoustic features were 80dimensional filter bank features. The transformer architecture consisted of 12 encoder blocks and six decoder blocks, with fourhead 256unit attention layers and 2048unit feedforward layers. Contextual block encoding [33] was applied to the encoder with a block size of 40, a shift size of 16, and a lookahead size of 8. The models were trained using multitask learning with CTC loss, as in [6], with a weight of 0.3. We used the Adam optimizer and Noam learning rate decay, and applied SpecAugment [34].External LMs for baseline shallow fusion [16] as well as the proposed residual LMs were trained using the textonly data of the training set in the target corpora, i.e. TEDLIUM3 and LaboroTV. Both LMs were fourlayer unidirectional LSTM with 1024 units for the English task and twolayer unidirectional LSTM with 2048 units for Japanese. We applied the bytepair encoding subword tokenization with 5000 token classes for English LMs. The tokens for Japanese LMs had 3262 character classes. The training weight in Eq. (8) was set as . We set the temperature in Eq. (10) as and the loss integration weight in Eq. (20) as for all experiments.
In addition to shallow fusion, we compared residual LMs with density ratio [21] and ILME [22]. After the parameter search, the LM weight was set to , and the weight for density ratio in Eq. (5) was , for the respective dataset. The internal LM weight in Eq. (6) was set to
for both languages. The beam size for decoding was 10. The internal LM was estimated by replacing the encoder output with a zerofilled tensor as in
[22, 24]. We also measured the inference speed using randomly sampled 100 utterances from each dev set of the target domain.LS TEDLIUM3  CSJ LaboroTV  
(WER)  (CER)  
Dev  Test  Dec. Speed  Dev  Dec. Speed  
Shallow Fusion [16]  13.2  12.6  x1.0  24.6  x1.0 
Density Ratio [21]  12.9  12.7  x0.92  21.9  x0.97 
ILME [22]  12.9  12.2  x0.58  23.7  x0.58 
w/ Smoothing  12.9  12.2  —  24.3  — 
Residual LM  12.6  12.1  x1.08  22.7  x1.04 
w/o Smoothing  12.9  12.2  —  24.1  — 
4.1.2 Experimental results
The experimental results are listed in Table. 1. In both the English and Japanese scenarios, the density ratio approach and ILME achieved lower WERs than the baseline shallow fusion. The proposed residual LM performed better than ILME and achieved the best performance in the English adaptation with a WER of 12.1 % on the test set (4.0 % WER relative improvement over the shallow fusion). Although, for the Japanese task, the residual LM performed poorer than the density ratio approach, there was an improvement from the baseline shallow fusion and ILME. We assume that the residual LM can consider the statistical behavior of the internal LM, as discussed in Sec. 3.1.
The results of the decoding speed are shown relative to shallow fusion as a base. ILME required almost twice a decoding duration as shallow fusion, because the decoders of the E2E ASR models were required to compute twice, once for ASR and once for ILME. The density ratio was also slightly slower than the shallow fusion because it was required to compute two LMs. The proposed residual LM was slightly faster than the baseline shallow fusion, because it can omit softmax operation, particularly in the larger vocabulary size in English setup.
We performed further ablation studies on the proposed residual LM. When we replaced the smoothed target with the regular target , as defined in Eq. (8), we observed a significant performance drop from the smoothed target in the Japanese case. We assume that the internal LM estimation is not always accurate in the target domain. On the other hand, applying smoothing did not aid the ILME to improve performance. We assume that using the smoothed soft labels in training gains positive effect similarly to the knowledge distillation learning [35].
4.2 Intradomain evaluation
4.2.1 Chinese/Japanese corpora
We performed an intradomain evaluation to determine if the proposed residual LM did not have an adverse impact in the matched conditions. The residual LMs were evaluated using AISHELL1 [36] and in the CSJ evaluation set. The experiments followed the configuration in Sec. 4.1.1, except the LMs for AISHELL1 consisted of two LSTM layers with 650 units, whose output is 4233 character classes. The LM weights were set to in the respective evaluation sets, the weights for the ILME were , and the training parameters were respectively.
The character error rate (CER) results are presented in Table. 2. In our reproduction of ILME, even with the best effort to search for the parameters, we observed degradation in both the AISHELL1 and CSJ evaluation sets. The literature reported that ILME had a positive effect even in the intradomain scenarios [22], but they were evaluated only in the English dataset, and were not tested in other languages. In contrast, our proposed residual LM performed robustly throughout the corpora and even had lower CERs particularly in the Japanese scenario.
4.2.2 Transformer LM evaluation using Librispeech
Lastly, we evaluated the stateoftheart architecture using the Librispeech dataset. For the E2E ASR model, we adopted 12 conformer encoder blocks [9] with 512unit eighthead attention and 2048 unit feed forward layers, followed by six transformer decoder with 512unit eight attention heads and 2048 unit feed forward layers. The LMs were 16layer transformers. We set the parameters as , and the beam size was 30. Inference speed was also evaluated using the testclean set.
Table. 3 shows the results. No significant difference was observed between the shallow fusion and ILME in this setup. Although we observed slight degradation in the testother set, both ILME and the proposed residual LM performed robustly with the large model architecture. We observed similar tendency of inference speed as in Table. 1.
5 Conclusion
We propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training. We directly modeled the ratio of an external target domain LM to an internal LM of the E2E ASR model, which is called residual LM. The residual LM was stably trained using a combination of crossentropy and MSE losses. The experimental results indicated that the proposed residual LM performed better than the internal LM estimation in most of the crossdomain and intradomain scenarios.
References

[1]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in
Proc. of 23rd International Conference on Machine Learning
, 2006, pp. 369–376.  [2] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: Endtoend speech recognition using deep RNN models and WFSTbased decoding,” in Proc. of ASRU Workshop, 2015, pp. 167–174.
 [3] D. Amodei et al., “Deep Speech 2: Endtoend speech recognition in English and Mandarin,” in Proc. of 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 173–182.
 [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attentionbased models for speech recognition,” in Proc. of NIPS, 2015, pp. 577–585.
 [5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, 2016, pp. 4960–4964.
 [6] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for endtoend speech recognition,” Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
 [7] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A comparative study on transformer vs RNN in speech applications,” in Proc. of ASRU Workshop, 2019, pp. 449–456.
 [8] A. Graves, A.R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. of ICASSP, 2013, pp. 6645–6649.
 [9] A. Gulati, J. Qin, C.C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolutionaugmented transformer for speech recognition,” in Proc. of Interspeech, 2020, pp. 5036–5040.
 [10] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnnt loss,” in Proc. of ICASSP, 2020, pp. 7829–7833.
 [11] M. Del Rio, N. Delworth, R. Westerman, M. Huang, N. Bhandari, J. Palakapilly, Q. McNamara, J. Dong, P. Żelasko, and M. Jetté, “Earnings21: A practical benchmark for ASR in the wild,” in Proc. of Interspeech, 2021, pp. 3465–3469.
 [12] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of contextdependent deep neural networks for automatic speech recognition,” in Spoken Language Technology Workshop (SLT), 2012 IEEE, 2012, pp. 366–369.
 [13] M. Delcroix, K. Kinoshita, C. Yu, A. Ogawa, T. Yoshioka, and T. Nakatani, “Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions,” in Proc. of ICASSP, 2016, pp. 5270–5274.
 [14] O. Klejch, J. Fainberg, and P. Bell, “Learning to adapt: A metalearning approach for speaker adaptation,” in Proc. of Interspeech, 2018, pp. 867–871.

[15]
E. Tsunoo, Y. Kashiwagi, S. Asakawa, and T. Kumakura, “Endtoend adaptation with backpropagation through WFST for ondevice speech recognition system,” in
Proc. of Interspeech, 2019, pp. 764–768.  [16] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up endtoend speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
 [17] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequencetosequence model,” in Proc. of ICASSP, 2018, pp. 5824–5828.
 [18] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015.
 [19] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” in Proc. of Interspeech, 2018, pp. 387–391.
 [20] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, “Component fusion: Learning replaceable language model component for endtoend speech recognition system,” in Proc. of ICASSP, 2019, pp. 5361–5635.
 [21] E. McDermott, H. Sak, and E. Variani, “A density ratio approach to language model fusion in endtoend automatic speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 434–441.
 [22] Z. Meng, S. Parthasarathy, E. Sun, Y. Gaur, N. Kanda, L. Lu, X. Chen, R. Zhao, J. Li, and Y. Gong, “Internal language model estimation for domainadaptive endtoend speech recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 243–250.
 [23] Z. Meng, N. Kanda, Y. Gaur, S. Parthasarathy, E. Sun, L. Lu, X. Chen, J. Li, and Y. Gong, “Internal language model training for domainadaptive endtoend speech recognition,” in Proc. of ICASSP, 2021, pp. 7338–7342.
 [24] M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “Investigating methods to improve language model integration for attentionbased encoderdecoder asr models,” in Proc. of Interspeech, 2021, pp. 2856–2860.
 [25] A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney, “Librispeech transducer model with internal language model prior correction,” in Proc. of Interspeech, 2021, pp. 2052–2056.

[26]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in
Proc. of CVPR, 2016, pp. 2818–2826.  [27] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 2.7, 2015.
 [28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. of ICASSP, 2015, pp. 5206–5210.
 [29] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, “Tedlium 3: twice as much data and corpus repartition for experiments on speaker adaptation,” in International conference on speech and computer. Springer, 2018, pp. 198–208.
 [30] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, “Spontaneous speech corpus of Japanese,” in Proc. of the International Conference on Language Resources and Evaluation (LREC), 2000, pp. 947–9520.
 [31] S. Ando and H. Fujihara, “Construction of a largescale japanese asr corpus on tv recordings,” in Proc. of ICASSP, 2021, pp. 6948–6952.
 [32] E. Tsunoo, C. Narisetty, M. Hentschel, Y. Kashiwagi, and S. Watanabe, “Runandback stitch search: novel block synchronous decoding for streaming encoderdecoder ASR,” arXiv preprint arXiv:2201.10190, 2022.
 [33] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in Proc. of ASRU Workshop, 2019, pp. 427–433.
 [34] D. S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. of Interspeech, 2019.
 [35] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[36]
H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell1: An opensource Mandarin speech corpus and a speech recognition baseline,” in
Oriental COCOSDA, 2017, pp. 1–5.