Log In Sign Up

Self-Normalized Importance Sampling for Neural Language Modeling

To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Compared to noise contrastive estimation, our method is directly comparable in terms of complexity in application. Through self-normalized language model training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.


page 1

page 2

page 3

page 4


On Sampling-Based Training Criteria for Neural Language Modeling

As the vocabulary size of modern word-based language models becomes ever...

Self-Normalization Properties of Language Modeling

Self-normalizing discriminative models approximate the normalized probab...

Strategies for Training Large Vocabulary Neural Language Models

Training neural network language models over large vocabularies is still...

Comparison of Lattice-Free and Lattice-Based Sequence Discriminative Training Criteria for LVCSR

Sequence discriminative training criteria have long been a standard tool...

External Patch-Based Image Restoration Using Importance Sampling

This paper introduces a new approach to patch-based image restoration ba...

Residual Energy-Based Models for Text Generation

Text generation is ubiquitous in many NLP tasks, from summarization, to ...

Truncation Sampling as Language Model Desmoothing

Long samples of text from neural language models can be of poor quality....

1 Introduction

Nowadays, word-based neural language models (LMs) consistently give better perplexities than count-based language models [19, 9], and are commonly used for second-pass rescoring or first-pass decoding of automatic speech recognition (ASR) outputs [13, 2, 1]. One challenge to train such LMs, especially when the vocabulary size is large, is the traversal over the full vocabulary in the softmax normalization. During both training and testing, this brings inefficiencies and calls for sampling-based methods to ease the calculation burdens. Previously, many sampling-based training criteria are proposed and investigated. Some prominent examples include: hierarchical softmax [14], negative sampling [15], importance sampling (IS) [3], and noise contrastive estimation (NCE) [8].

While NCE is one of the most popular choices for language modeling [16, 7, 4], recently Gao et al. [6] show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. The motivation of this work begins from there. We notice that by introducing simple modifications to the original IS training criteria, the additional correction step can be omitted and the class posteriors can be directly obtained from the model outputs. In other words, LMs trained with such criteria are also self-normalized - straightforwardly, we call this type of training: self-normalized importance sampling.

Compared to Gao et al. [6], our method is simpler in the sense that no additional correction is required, and the LMs are directly trained to give the intended class posteriors. Compared to NCE, our method is comparable and competitive, which we further show through extensive experiments on both research-oriented and production-oriented ASR tasks.

2 Related Work

Neural network LMs are shown to bring consistent improvements over count-based LMs [18, 19, 9]. These neural LMs are then used either in second-pass lattice rescoring or first pass decoding for ASR [13, 2, 1, 11, 12]. To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization, various sampling-based training criteria are proposed and investigated [14, 15, 3, 8, 16, 7, 4]

. These methods mainly deal with how to build up a contrastive loss such that the model is able to tell true target words from noisy ones. Related but originated from a different motivation, that is, to explicitly encourage normalization during training and hope for self-normalization during testing, variance regularization is also discussed in literature

[5]. Recently, in [6], the authors identify the relationships between model optimums under various sampling-based training criteria and the contextual posterior probabilities. They show that with one additional correction step, the intended posteriors can be recovered. This work can be thought of as a follow-up and extention of [6], where we propose several simple modifications to the IS training criterion to directly enable self-normalized models, without having to do the additional correction step.

3 Methodology

In this section, we first describe several training criteria, including Binary Cross Entropy (BCE), NCE and IS. Then, we pinpoint the reason why IS is not self-normalized by default. Afterwards, we move on to discuss three modes of modifications to enable self-normalization of IS.

3.1 Binary Cross Entropy (BCE)

BCE is a traditional training criterion, which only requires the model outputs to be bounded in

, and not necessarily a normalized distribution.


It can be shown that the optimum when maximizing is the empirical posterior distribution , which indicates that the model outputs are self-normalized.

3.2 Fundamentals of Sampling-Based Training Criteria

In sampling-based training criteria, for each training pair , the summation of some function over random samples , sampled from some noisy distribution , can approximate the expectation over the expected count of class in . For instance, if we do sampling with replacement, the expected count would be .


3.3 Noise Contrastive Estimation (NCE)

NCE is a popular BCE-style criterion [7, 17], due to its self-normalized property.


With enough amount of training data and samples , the optimum of model output is .

3.4 Importance Sampling (IS)

IS is another sampling-based BCE-style criterion.


Here, the model optimum is not normalized.


Nonetheless, as described in [6], can be transformed to with an additional correction step. In IS, the summation over samples is actually an approximation of the summation over all classes, which makes IS very similar to the original BCE criterion, except that the sum is over all classes rather than excluding the target class. By modifying criterion (4) to approximate the exact BCE criterion, we show that it is possible to obtain as the optimum output directly, hence achieving self-normalized models.

3.5 Self-Normalized Importance Sampling

3.5.1 Sampling including the Target Class

An intuitive way to exclude the target class from the summation is to simply substract the target term .


It can be shown that with this simple modification, the model optimum is now exactly at . However, the additional term can result in a large gradient. When target class does not appear in , we can consider the gradient with respect to of a single sentence .


Suppose that is a large number (close to one) and is close to its optimum , the gradient is large and makes it hard to converge. In the following experimental results, we show that when is small, which means there is less chance to sample the target class out, this can lead to a bad performance.

3.5.2 Sampling excluding the Target Class

Another way to obtain the self-normalized output is to find some proper distribution or function to directly approximate the summation without target class :


Considering equation (2), we propose two approaches: letting the expected count of be zero, i.e. , or letting the value of the function be zero when the summation index is , i.e. .

For the first approach (Mode2), is equivalent to , i.e. the probability to sample the target class being zero. To this end, we utilize different distributions with for different target classes during training. Compared to IS, Mode2 has the same formula, but different noise distributions for sampling.


Because in this way it is not possible to sample the target class , there is no devide-by-zero problem in the criterion.

For the second approach, we simply set to zero when the target class gets sampled out.


Compared to Mode2, Mode3 is more efficient since it can use one distribution for all training pairs . For Mode2, on the other hand, it is necessary to use different distributions for different target classes. For Mode3, it is safer to do sampling without replacement, since if replacement is allowed, in extreme cases there can be many samples where evaluates to zero, which adds no contribution to the summation and thus can influence the precision of the approximation.

4 Experimental Results

4.1 Experimental Setup

We conduct experiments on public datasets including Switchboard and Librispeech. Additionally, we experiment with one in-house dataset: 8kHz US English (EN). The detailed statistics of these three datasets can be found in Table

1. For EN, 7473M in Table 1 is the total amount of running words used for baseline count-based LM training, while roughly 694M running words are used for neural network training.

We use LSTM LM for Swichboard and Transformer [20] LM for Librispeech. We follow the setup in [6] in terms of model architectures. For EN, we use LSTM LMs. We use hybrid HMM as our acoustic models. For Switchboard and Librispeech, the lattice generation follows [6]. For EN, lattices are generated with 4-gram Kneser-Ney LMs.

We evaluate our models with properly normalized perplexities (PPLs) and second-pass rescoring word error rates (WERs). In rescoring, the outputs of models are used as is and no additional renormalization is done. We report rescoring results on clean and other test sets for Librispeech and on Switchboard (SWB) and CallHome (CH) Hub5’00 test sets for Switchboard. For EN, we use an in-house test set.

For sampling-based criteria, we use log-uniform distributions as noise distributions, following

[6]. If not mentioned otherwise, the number of samples is . We do sampling with replacement, except Mode3. During training, each batch shares the same sampled classes for efficiency [21, 10], except Mode2. For Mode2, to build up the desired distribution for each training pair , we use as the number of labels in log-uniform, and map the sampled indices to .


The average training time per batch is evaluated on NVIDIA GeForce GTX 1080 Ti GPUs, with a batch size of 32 for Switchboard, 64 for LibriSpeech and EN.

corpus vocabulary size running words
Switchboard 30k 24M
Librispeech 200k 812M
EN 250k 7473M
Table 1: Data statistics of the four datasets.

4.2 Main Results

We first compare three modes of self-normalized IS on Switchboard, employing NCE as the baseline. As shown in Table 2, for Mode1, when , the performance is much worse than other methods. When , on the other hand, the performance is on par with others, which verifies our statement about the big gradient problem in Section 3.5.1. For both and in Mode2 and Mode3, the performance is comparable with NCE. Considering that Mode3 is more efficient than Mode2 in terms of the noise distribution, we choose Mode3 to be our main method. For both Mode2 and Mode3, we contribute the relatively small difference in PPLs to the small vocabulary size.

criterion K PPL
NCE 100 54.4
8000 52.7
Mode1 100 87.2
8000 53.0
Mode2 100 53.9
8000 53.5
Mode3 100 54.0
8000 53.0
Table 2: PPLs of NCE and Mode{1,2,3} on Switchboard.

We further investigate the influence of the number of samples on Switchboard in Mode3. Figure 1 shows that with increasing , the PPL goes down, and gradually converges. On the other hand, training time goes up. One can find the best trade-off between training speed and performance according to the specific task. Surely, the trade-off between speed and performance depends on the model architecture and vocabulary size, and thus we do not do further experiments to investigate this relationship on other datasets.

Figure 1: The influence of number of samples .

Table 3

shows the comparison of different criteria on Switchboard. In this task, LSTM LMs are interpolated with count-based LM for evaluation. As can be seen, the performance of sampling-based methods are close to the traditional methods while obtaining expected training speedup. Compared to the original IS, the output of Mode3 can be used as scores directly in rescoring without correction.

train time
CE 0.100 49.9 10.1 6.8 13.4
BCE 0.107 52.3 10.3 6.9 13.7
IS 0.079 51.5 10.3 7.0 13.7
NCE 0.079 51.4 10.2 6.9 13.6
Mode3 0.090 51.7 10.2 6.9 13.6
Table 3: WERs of different criteria on Switchboard.

Table 4 exhibits similar results on Librispeech. Since the vocabulary size is much larger than Swichboard, the training speedup gained from sampling is more significant, while the performance are still similar to traditional methods. With different model architectures and larger vocabulary sizes, we show that Mode3 is still competitive compared to NCE.

train time
clean other
CE 0.302 57.7 2.5 5.4
BCE 0.358 58.5 2.5 5.4
IS 0.206 58.4 2.6 5.5
NCE 0.206 57.9 2.5 5.4
Mode3 0.216 58.3 2.5 5.4
Table 4: WERs of different criteria on Librispeech.

Experimental results on EN are presented in Table 5. Here, the baseline LM used for first-pass decoding is count-based. With LSTM LM rescoring, we observe improvements in WER. Besides, we observe that Mode3 has similar performance to NCE for both and on the production-oriented ASR dataset, validating our claim that the proposed self-normalized IS training criterion is a competitive sampling-based training criterion for neural LMs.

corpus criterion K
train speed
EN baseline - - 82.3 13.7
NCE 100 0.092 65.9 13.3
8000 0.098 55.9 13.1
Mode3 100 0.089 68.0 13.3
8000 0.114 55.8 13.1
Table 5: WERs of different criteria on EN.

5 Conclusion

We propose self-normalized importance sampling for training neural language models. Previously, noise contrastive estimation is a popular choice for sampling-based training criterion. In our recent work, we show that other sampling-based training criteria, including importance sampling, can perform on par with noise contrastive estimation. However, one caveat in that work is an additional correction step to recover the posterior distribution. In this work, we completely negate the need for this step by modifying the importance sampling criterion to be self-normalized. Through extensive experiments on both research-oriented and production-oriented datasets, we obtain competitive perplexities as well as word error rates with our improved method.

6 Acknowledgements

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 694537, project “SEQCLAS”). The work reflects only the authors’ views and the European Research Council Executive Agency (ERCEA) is not responsible for any use that may be made of the information it contains. This work was partially supported by the project HYKIST funded by the German Federal Ministry of Health on the basis of a decision of the German Federal Parliament (Bundestag) under funding ID ZMVI1-2520DAT04A.


  • [1] E. Beck, R. Schlüter, and H. Ney (2020) LVCSR with transformer language models.. In INTERSPEECH, pp. 1798–1802. Cited by: §1, §2.
  • [2] E. Beck, W. Zhou, R. Schlüter, and H. Ney (2019) Lstm language models for lvcsr in first-pass decoding and lattice-rescoring. arXiv preprint arXiv:1907.01030. Cited by: §1, §2.
  • [3] Y. Bengio and J. Senecal (2003-03–06 Jan) Quick training of probabilistic neural nets by importance sampling. In

    Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics

    , C. M. Bishop and B. J. Frey (Eds.),

    Proceedings of Machine Learning Research

    , Vol. R4, pp. 17–24.
    Note: Reissued by PMLR on 01 April 2021. External Links: Link Cited by: §1, §2.
  • [4] X. Chen, X. Liu, M. J. F. Gales, and P. C. Woodland (2015) Recurrent neural network language model training with noise contrastive estimation for speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5411–5415. External Links: Document Cited by: §1, §2.
  • [5] X. Chen, X. Liu, Y. Wang, M. J. F. Gales, and P. C. Woodland (2016) Efficient training and evaluation of recurrent neural network language models for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2146–2157. External Links: Document Cited by: §2.
  • [6] Y. Gao, D. Thulke, A. Gerstenberger, K. V. Tran, R. Schlüter, and H. Ney (2021) On Sampling-Based Training Criteria for Neural Language Modeling. In Proc. Interspeech 2021, pp. 1877–1881. External Links: Document Cited by: §1, §1, §2, §3.4, §4.1, §4.1.
  • [7] J. Goldberger and O. Melamud (2018-08) Self-normalization properties of language modeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 764–773. External Links: Link Cited by: §1, §2, §3.3.
  • [8] M. Gutmann and A. Hyvärinen (2010-13–15 May) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington (Eds.), Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 297–304. External Links: Link Cited by: §1, §2.
  • [9] K. Irie, A. Zeyer, R. Schlüter, and H. Ney (2019-09) Language modeling with deep transformers. In Interspeech, Graz, Austria, pp. 3905–3909. Note: ISCA Best Student Paper Award. [slides] External Links: Link Cited by: §1, §2.
  • [10] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: §4.1.
  • [11] S. Kumar, M. Nirschl, D. Holtmann-Rice, H. Liao, A. T. Suresh, and F. Yu (2017)

    Lattice rescoring strategies for long short term memory language models in speech recognition

    In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 165–172. Cited by: §2.
  • [12] K. Li, D. Povey, and S. Khudanpur (2021) A parallelizable lattice rescoring strategy with neural language models. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6518–6522. Cited by: §2.
  • [13] X. Liu, Y. Wang, X. Chen, M. J. F. Gales, and P. C. Woodland (2014) Efficient lattice rescoring using recurrent neural network language models. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4908–4912. External Links: Document Cited by: §1, §2.
  • [14] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.
  • [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. External Links: Link Cited by: §1, §2.
  • [16] A. Mnih and Y. W. Teh (2012) A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, External Links: Link Cited by: §1, §2.
  • [17] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.3.
  • [18] M. Sundermeyer, H. Ney, and R. Schlüter (2015-03) From feedforward to recurrent lstm neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 517–529. External Links: Link Cited by: §2.
  • [19] M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §1, §2.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.1.
  • [21] B. Zoph, A. Vaswani, J. May, and K. Knight (2016) Simple, fast noise-contrastive estimation for large rnn vocabularies. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1217–1222. Cited by: §4.1.