1 Introduction
The goal of statistical language modeling is to estimate the probability of sentences or sequences of words
(Bahl et al., 1990). By the chain rule of probability theory, this is equivalent to estimation of the conditional probability of a word given all preceding words. This problem is key to natural language processing, with applications not only in typeahead systems, but also machine translation
(Brown et al., 1993) and automatic speech recognition (Bahl et al., 1990). While earlier work on statistical language modeling focused on ngram language models (Kneser and Ney, 1995; Chen and Goodman, 1999), recent advances are based on variants of neural language models (Bengio et al., 2003; Mikolov et al., 2010; Dauphin et al., 2016), which have yielded state of the art performance on several large scale benchmarks Jozefowicz et al. (2016). Neural approaches require less memory than ngrams and they generalize better, but with a substantial increase in computational complexity both at training and test time. Despite the superior performance of neural models, even better results in terms of perplexity can be achieved by ensembling neural models with ngrams (Mikolov et al., 2011; Chelba et al., 2013; Jozefowicz et al., 2016). However, Fig. 1, which shows results using a single constant scalar to weigh the output distribution of a neural model and an ngram model, suggests that the relative contributions of the two models are not simple. For example, the neural model generalizes better than the ngram on rarer words, yet on rarer words the ensemble yields the largest gains.In this work we will study more sophisticated methods for combining the results of ngram and neural language models than a fixed scalar weight. We will propose a simple gating network which takes as input a handful of features based on frequency statistics to produce as output an input dependent weight to be used in the ensemble, effectively turning the ensemble into an adaptive mixture of experts model. We show that given already trained neural and ngram language models, the gating network can be trained quickly on a handful of examples. The gating network consistently yields better results than the ensemble which uses a fixed weight in the mixture, while adding a negligible computational cost. We evaluated our proposed approach on the One Billion Word benchmark (Chelba et al., 2013), the biggest publicly available benchmark for language modeling, and on the Wall Street Journal corpus, demonstrating seizable gains on both datasets.
2 Related work
The method we propose is a particular instance of a mixture of experts (MoEs) (Jacobs et al., 1991), where experts are pretrained and the gating is a possibly recurrent function of some handcrafted features. The advantages are twofold. First, we do not need to update the experts which are very large systems and instead, we can learn very quickly to modulate between them. Second, the gating network is tiny as we do not need to represent the actual input words, which further speeds up training time and reduces sample complexity.
Shazeer et al. (2017) also proposed to use a MoEs for language modeling. The major technical difference is that they employ MoEs in between layers of a deep LSTM, while we do it at the output. While our focus is to design a lightweight system that optimally combines pretrained models, their focus is to train a single much higher capacity system, a much more engineering involved endeavor.
In (Kneser and Steinbiss, 1993)
information about ground truth performance of several ngram models on previous 400 words is used to predict the optimal interpolation weights at each position. However, this approach could only be applied if the ground truth is known in advance and enough context is given.
A few works explored using ngram features within a neural language model. Mikolov et al. (2011) and Chelba et al. (2013) train a neural model jointly with a maximum entropy model taking as input ngram features. (Neubig and Dyer, 2016) proposes an approach more similar to ours, except that the gating network takes as input the hidden state of the neural model. This has two drawbacks. First, the hidden state may already have lost the discriminative information necessary for the selection of the expert. Second, the gating network operates on a much higher dimensional input, and therefore, it requires more data to train. Moreover, we do not attempt at tuning the experts nor we care about how these were trained (Jean et al., 2014; Grave et al., 2016), but we use them as blackboxes and only train the gating network which is a much simpler task.
3 Basic Models
The goal of language modeling is to estimate the probability of a next word given its context sequence ; the context being empty if . In this section, we introduce the models we use to instantiate our experts.
3.1 Ngram Language Models
Ngram models rely on the following Markov assumption: the next word depends only on the previous words: , where ; maximum likelihood estimation then yields: , where stands for the number of occurrences of the sequence in the training corpus.
For high order models, e.g, , only a small fraction of the ngrams appear in the training corpus, a problem also referred to as data sparsity, which would yield probability for almost all sentences. To counteract this, several backoff techniques have been suggested, the most popular being defined as:
(1) 
where and are called backoff coefficients and discounted probabilities, respectively. In this work, we use KneserNey formulation (Kneser and Ney, 1995) that yields state of the art results among ngram models.
3.2 Neural Language Model
Another approach to reduce sparsity is to encode the context
as a fixed length dense vector
. To do so each word is mapped to an embedding vector . The sequence of vectorsis then fed to a neural network
to produce. A linear classifier
is then applied toto estimate the probability distribution over the next word:
(2) 
Different types of networks could be used as encoders, such as fully connected (Bengio et al., 2003), convolutional (Dauphin et al., 2016) or recurrent (Mikolov et al., 2010; Chelba et al., 2013). In this work we use LSTMs (Hochreiter and Schmidhuber, 1997) which is nowadays one of the strongest performing methods.
4 Mixture of experts
Different experts have different strengths and weaknesses. The complementarity of neural and ngram language models explains why ensembling works so well. However, it is conceivable that different contexts may need different weighting. MoEs address exactly this issue, enhancing the model with a gating network that weighs experts in an input dependent manner. Next, we first analyze where ngrams outperform neural language models, and then propose a simple gating mechanism to automatically select the most suitable expert.
4.1 Analysis
…will be made at an event at the San Francisco Museum of Science 

Robert Jew of the National Archives and Records Administration 
…We need professional advice , said Senator George H. Winner 
…he shows up armed to buy machine guns and siliences 
Xbox 360 ( R ) video game and entertainment system from 
Ngram language models require a large memory which grows with the amount of training data Silva et al. (2016), but are fast at test time as they require only table lookups. They can easily memorize patterns but do not generalize well to rare events. On the other hand, neural language models are much more compact, generalize much better but require more computation. Jozefowicz et al. (2016) found that the relative advantage of neural models increases as the frequency of target word decreases. While we observe the same behavior, we also notice that the relative improvement of an ensemble over the neural model is bigger for rare words, as shown in Fig. 1.
In order to gain better understanding, we selected sentences where the ngram model significantly outperforms the neural model in the One Billion Word dataset, see some examples in Table 1. In the vast majority of the cases, these contexts contain long proper nouns, e.g., Senator George H. Winner. As a quantitative evidence, we found that 23% of words, such that , are capitalized versus 13% in general distribution. In other cases, we found phrases that exactly match training examples. In both cases, ngram models are better equipped at predicting since these are essentially memorization tasks. This also explains why ngrams yield the biggest gains on the rarest word bucket of Fig. 1, despite being limited on rare words (because of data sparsity). Overall, it seems that the task of assessing whether the ngram model is better than the neural model is fairly easily predictable. Next, we propose a simple method to do so.
4.2 Gating Network
#features  

backoff weights ()  5 
discounted probabilities ()  5 
logarithm of position ()  1 
,  2 
,  2 
A mixture of experts can be written as:
(3)  
where is a scalar between and which is the output of our gating network, and and are defined by Eq. 1 and Eq. 2 respectively.
In this work, we propose to use as gating network a small model that takes as input a handful of hand crafted features. We choose our features to convey our intuition that switching to an ngram model should depend on both the frequency of the word as well as the entropy of the prediction. We therefore use both the backoff KneserNey coefficients and discounted probabilities, as well as entropy of the distribution over the next word and its mode. In order to account for positional information in the sentence (as we also found that ngrams are worse at later positions in long sentences), we also add the log of the word position. The full list of features is given in Table 2. We have a total of 15 features, denoted as FULL set, which we use as input to the gating network. We denote a subset of features that require nothing more than the coefficients from the existing ngram model as SIMPLE. This allows us to investigate the relative importance of the signal from the neural language model and to further reduce the computational burden.
We train the gating network with crossentropy loss using Eq. 3 as predictive distribution, and without updating the expert models.
5 Experiments
We performed language modeling at the word level using the One Billion Word benchmark Chelba et al. (2013) and the Wall Street Journal (WSJ) ^{1}^{1}1Obtained from http://www.fit.vutbr.cz/~imikolov/rnnlm/kaldiwsj.tgz, see details in Table 3.
WSJ  1B  

training tokens  36M  768M 
unique words  20k  793k 
5.1 Expert models
We used KenLM toolkit (Heafield, 2011) to train a 5gram model with modified KneserNey smoothing for both datasets.
For the One Billion Word dataset, we used the best neural model reported by Jozefowicz et al. (2016), composed of two LSTM layers with 8092 hidden units each, and projection layers with 1024 units. Due to the large vocabulary size, we trained using sampled softmax (Jean et al., 2014). We trained the model for four days on 8 GPU. For the smaller WSJ we used a much simpler model with a single LSTM layer with 500 units and trained it until convergence. We used dropout as a regularization technique for both models Srivastava et al. (2014).
5.2 Gating model
The gating model was trained using only validation data. For WSJ we used 14k sentences for training and the remaining 2k for early stopping. For 1B Word, we used two separate heldout partitions to train and validate the gating models, each with 6K sentences. In both cases we didn’t use any test set to train nor tune the gating model.
We experimented with three architectures: a linear model (LIN), a fully connected neural network with two hidden layers with 32 units each (MLP), and an LSTM with 8 units on top of the previous MLP to test if temporal dependencies matter.
The largest gating network has only 2572 trainable parameters. All hyperparameters were tuned on 1B Word dataset and applied to WSJ as is. We normalize feature on the training set to have zeromean and unitvariance. All networks were trained until convergence with Adam optimizer
(Kingma and Ba, 2014) with initial learning rate of , which was halved every 5k steps.5.3 Results
First, we evaluated the three gating architectures on the validation set of the One Billion Word dataset, as shown in Table 4. While extra features decrease the perplexity, the effect of the model architecture is much more impactful. On the following, we only consider the LSTM architecture for the gating model.
feature set  

model  SIMPLE  FULL 
LIN  30.82  30.75 
MLP  30.35  30.30 
LSTM  30.05  29.85 
Next, we evaluated the model on the test sets, see Table 5. We compare the mixture of experts with static ensembling, which is the current state of the art.
WSJ  1B  
ngram LM  113.23  66.96 
neural LM  71.39  33.01 
ensemble  67.44  29.80 
HIDDEN  
+endtoend    
SIMPLE  
FULL 
models we report mean and standard error over 10 runs. Results in bold show significant improvement over the best baseline (two tailed ttest,
).To compare our approach with the method proposed in Neubig and Dyer (2016), we trained a gating model using the last hidden layer of the neural language model as features (HIDDEN; 500 features for WSJ and 1024 for 1B Word). We tried two variants of the model: with and without gradient flow through neural LM. The fine tuning requires computing of the full softmax as well as its gradient. Attempt to do this with 1B Word neural model resulted in running out of GPU memory. As described in 5.1 this model is trained with a sampled softmax that requires much less memory than full softmax. Due to this reason we report results on HIDDEN features with endtoend training only for WSJ.
On both datasets the mixture model with FULL features show a significant improvement over these baselines. Note that on 1B Word dataset SIMPLE features alone outperform HIDDEN features by a margin. We speculate that this is caused by the better ratio of the model size to the size of the dataset: SIMPLE features are derived from an ngram model that scales better than a neural model.
5.4 Conclusion
We proposed a very simple yet effective method to combine pretrained neural and ngram language models. Instead of ensembling, we learn a pertime step predictor of the optimal weight between the two models. The gating network is small, fast to train and to run, because it takes as input handcrafted features found by analyzing where ngrams outperform neural models.
References
 Bahl et al. (1990) Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. 1990. A maximum likelihood approach to continuous speech recognition. In Readings in speech recognition, pages 308–319. Elsevier.

Bengio et al. (2003)
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003.
A neural probabilistic language model.
Journal of machine learning research
, 3(Feb):1137–1155.  Brown et al. (1993) Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263–311.
 Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
 Chen and Goodman (1999) Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394.
 Dauphin et al. (2016) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083.
 Grave et al. (2016) Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. 2016. Efficient softmax approximation for GPUs. arXiv preprint arXiv:1609.04309.
 Heafield (2011) Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long shortterm memory. Neural Computation, 9(8):1735––1780.
 Jacobs et al. (1991) Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
 Jean et al. (2014) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007.
 Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kneser and Ney (1995) Reinhard Kneser and Hermann Ney. 1995. Improved backingoff for mgram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP95., 1995 International Conference on, volume 1, pages 181–184. IEEE.
 Kneser and Steinbiss (1993) Reinhard Kneser and Volker Steinbiss. 1993. On the dynamic adaptation of stochastic language models. In Acoustics, Speech, and Signal Processing, 1993. ICASSP93., 1993 IEEE International Conference on, volume 2, pages 586–589. IEEE.
 Mikolov et al. (2011) Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černockỳ. 2011. Strategies for training large scale neural network language models. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 196–201. IEEE.
 Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
 Neubig and Dyer (2016) Graham Neubig and Chris Dyer. 2016. Generalizing and hybridizing countbased and neural language models. In Conference on Empirical Methods in Natural Language Processing.
 Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538.
 Silva et al. (2016) Joaquim F Silva, Carlos Goncalves, and Jose C Cunha. 2016. A theoretical model for ngram distribution in big data corpora. In Big Data (Big Data), 2016 IEEE International Conference on, pages 134–141. IEEE.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
Comments
There are no comments yet.