Lightweight Adaptive Mixture of Neural and N-gram Language Models

04/20/2018 ∙ by Anton Bakhtin, et al. ∙ 0

It is often the case that the best performing language model is an ensemble of a neural language model with n-grams. In this work, we propose a method to improve how these two models are combined. By using a small network which predicts the mixture weight between the two models, we adapt their relative importance at each time step. Because the gating network is small, it trains quickly on small amounts of held out data, and does not add overhead at scoring time. Our experiments carried out on the One Billion Word benchmark show a significant improvement over the state of the art ensemble without retraining of the basic modules.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of statistical language modeling is to estimate the probability of sentences or sequences of words 

(Bahl et al., 1990)

. By the chain rule of probability theory, this is equivalent to estimation of the conditional probability of a word given all preceding words. This problem is key to natural language processing, with applications not only in type-ahead systems, but also machine translation 

(Brown et al., 1993) and automatic speech recognition (Bahl et al., 1990). While earlier work on statistical language modeling focused on n-gram language models (Kneser and Ney, 1995; Chen and Goodman, 1999), recent advances are based on variants of neural language models (Bengio et al., 2003; Mikolov et al., 2010; Dauphin et al., 2016), which have yielded state of the art performance on several large scale benchmarks Jozefowicz et al. (2016). Neural approaches require less memory than n-grams and they generalize better, but with a substantial increase in computational complexity both at training and test time. Despite the superior performance of neural models, even better results in terms of perplexity can be achieved by ensembling neural models with n-grams (Mikolov et al., 2011; Chelba et al., 2013; Jozefowicz et al., 2016). However, Fig. 1, which shows results using a single constant scalar to weigh the output distribution of a neural model and an n-gram model, suggests that the relative contributions of the two models are not simple. For example, the neural model generalizes better than the n-gram on rarer words, yet on rarer words the ensemble yields the largest gains.

Figure 1: As the frequency of the word-to-predict decreases (from left to right), the relative performance of neural models gets better compared to n-grams (orange curve). Yet, ensembling the two models (with a fixed scalar weight) is more effective on rarer words (blue curve). Bins were built by sorting words by frequency and by dividing them into buckets with equal probability mass.

In this work we will study more sophisticated methods for combining the results of n-gram and neural language models than a fixed scalar weight. We will propose a simple gating network which takes as input a handful of features based on frequency statistics to produce as output an input dependent weight to be used in the ensemble, effectively turning the ensemble into an adaptive mixture of experts model. We show that given already trained neural and n-gram language models, the gating network can be trained quickly on a handful of examples. The gating network consistently yields better results than the ensemble which uses a fixed weight in the mixture, while adding a negligible computational cost. We evaluated our proposed approach on the One Billion Word benchmark (Chelba et al., 2013), the biggest publicly available benchmark for language modeling, and on the Wall Street Journal corpus, demonstrating seizable gains on both datasets.

2 Related work

The method we propose is a particular instance of a mixture of experts (MoEs) (Jacobs et al., 1991), where experts are pre-trained and the gating is a possibly recurrent function of some handcrafted features. The advantages are twofold. First, we do not need to update the experts which are very large systems and instead, we can learn very quickly to modulate between them. Second, the gating network is tiny as we do not need to represent the actual input words, which further speeds up training time and reduces sample complexity.

Shazeer et al. (2017) also proposed to use a MoEs for language modeling. The major technical difference is that they employ MoEs in between layers of a deep LSTM, while we do it at the output. While our focus is to design a lightweight system that optimally combines pre-trained models, their focus is to train a single much higher capacity system, a much more engineering involved endeavor.

In (Kneser and Steinbiss, 1993)

information about ground truth performance of several n-gram models on previous 400 words is used to predict the optimal interpolation weights at each position. However, this approach could only be applied if the ground truth is known in advance and enough context is given.

A few works explored using n-gram features within a neural language model. Mikolov et al. (2011) and Chelba et al. (2013) train a neural model jointly with a maximum entropy model taking as input n-gram features. (Neubig and Dyer, 2016) proposes an approach more similar to ours, except that the gating network takes as input the hidden state of the neural model. This has two drawbacks. First, the hidden state may already have lost the discriminative information necessary for the selection of the expert. Second, the gating network operates on a much higher dimensional input, and therefore, it requires more data to train. Moreover, we do not attempt at tuning the experts nor we care about how these were trained (Jean et al., 2014; Grave et al., 2016), but we use them as black-boxes and only train the gating network which is a much simpler task.

3 Basic Models

The goal of language modeling is to estimate the probability of a next word given its context sequence ; the context being empty if . In this section, we introduce the models we use to instantiate our experts.

3.1 N-gram Language Models

N-gram models rely on the following Markov assumption: the next word depends only on the previous words: , where ; maximum likelihood estimation then yields: , where stands for the number of occurrences of the sequence in the training corpus.

For high order models, e.g, , only a small fraction of the n-grams appear in the training corpus, a problem also referred to as data sparsity, which would yield probability for almost all sentences. To counteract this, several back-off techniques have been suggested, the most popular being defined as:


where and are called back-off coefficients and discounted probabilities, respectively. In this work, we use Kneser-Ney formulation (Kneser and Ney, 1995) that yields state of the art results among n-gram models.

3.2 Neural Language Model

Another approach to reduce sparsity is to encode the context

as a fixed length dense vector

. To do so each word is mapped to an embedding vector . The sequence of vectors

is then fed to a neural network

to produce

. A linear classifier

is then applied to

to estimate the probability distribution over the next word:


Different types of networks could be used as encoders, such as fully connected (Bengio et al., 2003), convolutional (Dauphin et al., 2016) or recurrent (Mikolov et al., 2010; Chelba et al., 2013). In this work we use LSTMs (Hochreiter and Schmidhuber, 1997) which is nowadays one of the strongest performing methods.

4 Mixture of experts

Different experts have different strengths and weaknesses. The complementarity of neural and n-gram language models explains why ensembling works so well. However, it is conceivable that different contexts may need different weighting. MoEs address exactly this issue, enhancing the model with a gating network that weighs experts in an input dependent manner. Next, we first analyze where n-grams outperform neural language models, and then propose a simple gating mechanism to automatically select the most suitable expert.

4.1 Analysis

…will be made at an event at the San Francisco Museum of Science
Robert Jew of the National Archives and Records Administration
…We need professional advice , said Senator George H. Winner
…he shows up armed to buy machine guns and siliences
Xbox 360 ( R ) video game and entertainment system from
Table 1: Examples of contexts where the n-gram model significantly outperforms the neural model. The word to be predicted is marked in bold font.

N-gram language models require a large memory which grows with the amount of training data Silva et al. (2016), but are fast at test time as they require only table lookups. They can easily memorize patterns but do not generalize well to rare events. On the other hand, neural language models are much more compact, generalize much better but require more computation. Jozefowicz et al. (2016) found that the relative advantage of neural models increases as the frequency of target word decreases. While we observe the same behavior, we also notice that the relative improvement of an ensemble over the neural model is bigger for rare words, as shown in Fig. 1.

In order to gain better understanding, we selected sentences where the n-gram model significantly outperforms the neural model in the One Billion Word dataset, see some examples in Table 1. In the vast majority of the cases, these contexts contain long proper nouns, e.g., Senator George H. Winner. As a quantitative evidence, we found that 23% of words, such that , are capitalized versus 13% in general distribution. In other cases, we found phrases that exactly match training examples. In both cases, n-gram models are better equipped at predicting since these are essentially memorization tasks. This also explains why n-grams yield the biggest gains on the rarest word bucket of Fig. 1, despite being limited on rare words (because of data sparsity). Overall, it seems that the task of assessing whether the n-gram model is better than the neural model is fairly easily predictable. Next, we propose a simple method to do so.

4.2 Gating Network

back-off weights () 5
discounted probabilities () 5
logarithm of position () 1
, 2
, 2
Table 2: Description of features used. Top block of features are used for both SIMPLE and FULL. Lower block is used only for FULL. The notation follows that in Eq. 1 and Eq. 2.

A mixture of experts can be written as:


where is a scalar between and which is the output of our gating network, and and are defined by Eq. 1 and Eq. 2 respectively.

In this work, we propose to use as gating network a small model that takes as input a handful of hand crafted features. We choose our features to convey our intuition that switching to an n-gram model should depend on both the frequency of the word as well as the entropy of the prediction. We therefore use both the back-off Kneser-Ney coefficients and discounted probabilities, as well as entropy of the distribution over the next word and its mode. In order to account for positional information in the sentence (as we also found that n-grams are worse at later positions in long sentences), we also add the log of the word position. The full list of features is given in Table 2. We have a total of 15 features, denoted as FULL set, which we use as input to the gating network. We denote a subset of features that require nothing more than the coefficients from the existing n-gram model as SIMPLE. This allows us to investigate the relative importance of the signal from the neural language model and to further reduce the computational burden.

We train the gating network with cross-entropy loss using Eq. 3 as predictive distribution, and without updating the expert models.

5 Experiments

We performed language modeling at the word level using the One Billion Word benchmark Chelba et al. (2013) and the Wall Street Journal (WSJ) 111Obtained from, see details in Table 3.

training tokens 36M 768M
unique words 20k 793k
Table 3: Data sizes for the WSJ and 1B Word corpora.

5.1 Expert models

We used KenLM toolkit (Heafield, 2011) to train a 5-gram model with modified Kneser-Ney smoothing for both datasets.

For the One Billion Word dataset, we used the best neural model reported by Jozefowicz et al. (2016), composed of two LSTM layers with 8092 hidden units each, and projection layers with 1024 units. Due to the large vocabulary size, we trained using sampled softmax (Jean et al., 2014). We trained the model for four days on 8 GPU. For the smaller WSJ we used a much simpler model with a single LSTM layer with 500 units and trained it until convergence. We used dropout as a regularization technique for both models Srivastava et al. (2014).

5.2 Gating model

The gating model was trained using only validation data. For WSJ we used 14k sentences for training and the remaining 2k for early stopping. For 1B Word, we used two separate held-out partitions to train and validate the gating models, each with 6K sentences. In both cases we didn’t use any test set to train nor tune the gating model.

We experimented with three architectures: a linear model (LIN), a fully connected neural network with two hidden layers with 32 units each (MLP), and an LSTM with 8 units on top of the previous MLP to test if temporal dependencies matter.

The largest gating network has only 2572 trainable parameters. All hyper-parameters were tuned on 1B Word dataset and applied to WSJ as is. We normalize feature on the training set to have zero-mean and unit-variance. All networks were trained until convergence with Adam optimizer 

(Kingma and Ba, 2014) with initial learning rate of , which was halved every 5k steps.

5.3 Results

First, we evaluated the three gating architectures on the validation set of the One Billion Word dataset, as shown in Table 4. While extra features decrease the perplexity, the effect of the model architecture is much more impactful. On the following, we only consider the LSTM architecture for the gating model.

feature set
LIN 30.82 30.75
MLP 30.35 30.30
LSTM 30.05 29.85
Table 4: Validation perplexity (lower is better) for different features and architectures of the gating model. Each experiment is an average over 10 runs.

Next, we evaluated the model on the test sets, see Table 5. We compare the mixture of experts with static ensembling, which is the current state of the art.

n-gram LM 113.23 66.96
neural LM 71.39 33.01
ensemble 67.44 29.80
+end-to-end -
Table 5: Test perplexity on WSJ and 1B (lower is better). For MoE

models we report mean and standard error over 10 runs. Results in bold show significant improvement over the best baseline (two tailed t-test,


To compare our approach with the method proposed in Neubig and Dyer (2016), we trained a gating model using the last hidden layer of the neural language model as features (HIDDEN; 500 features for WSJ and 1024 for 1B Word). We tried two variants of the model: with and without gradient flow through neural LM. The fine tuning requires computing of the full softmax as well as its gradient. Attempt to do this with 1B Word neural model resulted in running out of GPU memory. As described in 5.1 this model is trained with a sampled softmax that requires much less memory than full softmax. Due to this reason we report results on HIDDEN features with end-to-end training only for WSJ.

On both datasets the mixture model with FULL features show a significant improvement over these baselines. Note that on 1B Word dataset SIMPLE features alone outperform HIDDEN features by a margin. We speculate that this is caused by the better ratio of the model size to the size of the dataset: SIMPLE features are derived from an n-gram model that scales better than a neural model.

5.4 Conclusion

We proposed a very simple yet effective method to combine pretrained neural and n-gram language models. Instead of ensembling, we learn a per-time step predictor of the optimal weight between the two models. The gating network is small, fast to train and to run, because it takes as input handcrafted features found by analyzing where n-grams outperform neural models.