Neural Machine Translation on Android
Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.READ FULL TEXT VIEW PDF
Knowledge distillation describes a method for training a student network...
Neural Machine Translation (NMT) models achieve state-of-the-art perform...
Knowledge distillation has been proven to be effective in model accelera...
Sequence-level knowledge distillation (SLKD) is a model compression tech...
The performance of autoregressive models on natural language generation ...
Large convolutional neural network models have recently demonstrated
Recent advances in deep learning have facilitated the demand of neural m...
Neural Machine Translation on Android
is a deep learning-based method for translation that has recently shown promising results as an alternative to statistical approaches. NMT systems directly model the probability of the next word in the target sentence simply by conditioning a recurrent neural network on the source sentence and previously generated target words.
While both simple and surprisingly accurate, NMT systems typically need to have very high capacity in order to perform well: Sutskever2014 used a -layer LSTM with hidden units per layer (herein ) and Zhou2016 obtained state-of-the-art results on English French with a -layer LSTM with units per layer. The sheer size of the models requires cutting-edge hardware for training and makes using the models on standard setups very challenging.
This issue of excessively large networks has been observed in several other domains, with much focus on fully-connected and convolutional networks for multi-class classification. Researchers have particularly noted that large networks seem to be necessary for training, but learn redundant representations in the process [Denil et al.2013]. Therefore compressing deep models into smaller networks has been an active area of research. As deep learning systems obtain better results on NLP tasks, compression also becomes an important practical issue with applications such as running deep learning models for speech and translation locally on cell phones.
, zero-out weights or entire neurons based on an importance criterion: LeCun1990 use (a diagonal approximation to) the Hessian to identify weights whose removal minimally impacts the objective function, while Han2016 remove weights based on thresholding their absolute values.Knowledge distillation approaches [Bucila et al.2006, Ba and Caruana2014, Hinton et al.2015] learn a smaller student network to mimic the original teacher network by minimizing the loss (typically or cross-entropy) between the student and teacher output.
In this work, we investigate knowledge distillation in the context of neural machine translation. We note that NMT differs from previous work which has mainly explored non-recurrent models in the multi-class prediction setting. For NMT, while the model is trained on multi-class prediction at the word-level, it is tasked with predicting complete sequence outputs conditioned on previous decisions. With this difference in mind, we experiment with standard knowledge distillation for NMT and also propose two new versions of the approach that attempt to approximately match the sequence-level (as opposed to word-level) distribution of the teacher network. This sequence-level approximation leads to a simple training procedure wherein the student network is trained on a newly generated dataset that is the result of running beam search with the teacher network.
We run experiments to compress a large state-of-the-art LSTM model, and find that with sequence-level knowledge distillation we are able to learn a LSTM that roughly matches the performance of the full system. We see similar results compressing a model down to on a smaller data set. Furthermore, we observe that our proposed approach has other benefits, such as not requiring any beam search at test-time. As a result we are able to perform greedy decoding on the model times faster than beam search on the model with comparable performance. Our student models can even be run efficiently on a standard smartphone.111https://github.com/harvardnlp/nmt-android Finally, we apply weight pruning on top of the student network to obtain a model that has fewer parameters than the original teacher model. We have released all the code for the models described in this paper.222https://github.com/harvardnlp/seq2seq-attn
be (random variable sequences representing) the source/target sentence, withand respectively being the source/target lengths. Machine translation involves finding the most probable target sentence given the source:
where is the set of all possible sequences. NMT models parameterize with an encoder neural network which reads the source sentence and a decoder neural network which produces a distribution over the target sentence (one word at a time) given the source. We employ the attentional architecture from Luong2015, which achieved state-of-the-art results on English German translation.333Specifically, we use the global-generalattention model with the input-feeding approach. We refer the reader to the original paper for further details.
Knowledge distillation describes a class of methods for training a smaller student network to perform better by learning from a larger teacher
network (in addition to learning from the training data set). We generally assume that the teacher has previously been trained, and that we are estimating parameters for the student. Knowledge distillation suggests training by matching the student’s predictions to the teacher’s predictions. For classification this usually means matching the probabilities either viaon the scale [Ba and Caruana2014] or by cross-entropy [Li et al.2014, Hinton et al.2015].
Concretely, assume we are learning a multi-class classifier over a data set of examples of the formwith possible classes . The usual training criteria is to minimize NLL for each example from the training data,
where is the indicator function and the distribution from our model (parameterized by ). This objective can be seen as minimizing the cross-entropy between the degenerate data distribution (which has all of its probability mass on one class) and the model distribution .
In knowledge distillation, we assume access to a learned teacher distribution
, possibly trained over the same data set. Instead of minimizing cross-entropy with the observed data, we instead minimize the cross-entropy with the teacher’s probability distribution,
where parameterizes the teacher distribution and remains fixed.
Note the cross-entropy setup is identical, but the target distribution is no longer a sparse distribution.444
In some cases the entropy of the teacher/student
distribution is increased by annealing it with
a temperature term
is attractive since it gives more information about other classes for a given data point (e.g. similarity between classes) and has less variance in gradients[Hinton et al.2015].
Since this new objective has no direct term for the training data, it is common practice to interpolate between the two losses,
where is mixture parameter combining the one-hot distribution and the teacher distribution.
The large sizes of neural machine translation systems make them an ideal candidate for knowledge distillation approaches. In this section we explore three different ways this technique can be applied to NMT.
NMT systems are trained directly to minimize word NLL, , at each position. Therefore if we have a teacher model, standard knowledge distillation for multi-class cross-entropy can be applied. We define this distillation for a sentence as,
where is the target vocabulary set. The student can further be trained to optimize the mixture of and . In the context of NMT, we refer to this approach as word-level knowledge distillation and illustrate this in Figure 1 (left).
Word-level knowledge distillation allows transfer of these local word distributions. Ideally however, we would like the student model to mimic the teacher’s actions at the sequence-level. The sequence distribution is particularly important for NMT, because wrong predictions can propagate forward at test-time.
First, consider the sequence-level distribution specified by the model over all possible sequences ,
for any length . The sequence-level negative log-likelihood for NMT then involves matching the one-hot distribution over all complete sequences,
where is the observed sequence. Of course, this just shows that from a negative log likelihood perspective, minimizing word-level NLL and sequence-level NLL are equivalent in this model.
But now consider the case of sequence-level knowledge distillation. As before, we can simply replace the distribution from the data with a probability distribution derived from our teacher model. However, instead of using a single word prediction, we use to represent the teacher’s sequence distribution over the sample space of all possible sequences,
Note that is inherently different from , as the sum is over an exponential number of terms. Despite its intractability, we posit that this sequence-level objective is worthwhile. It gives the teacher the chance to assign probabilities to complete sequences and therefore transfer a broader range of knowledge. We thus consider an approximation of this objective.
Our simplest approximation is to replace the teacher distribution with its mode,
Observing that finding the mode is itself intractable, we use beam search to find an approximation. The loss is then
where is now the output from running beam search with the teacher model.
Using the mode seems like a poor approximation for
the teacher distribution , as we are
approximating an exponentially-sized distribution with a single
sample. However, previous results showing the effectiveness of
beam search decoding for NMT lead us to belief that a large portion of
’s mass lies in a single output sequence. In fact, in
experiments we find that with beam of size , (on average) accounts for
of the distribution for German English, and
for Thai English (Table 1: ).555Additionally there
are simple ways to better approximate . One way would be to
consider a -best list from beam search and renormalizing the
To summarize, sequence-level knowledge distillation suggests to: (1) train a teacher model, (2) run beam search over the training set with this model, (3) train the student network with cross-entropy on this new dataset. Step (3) is identical to the word-level NLL process except now on the newly-generated data set. This is shown in Figure 1 (center).
Next we consider integrating the training data back into the process, such that we train the student model as a mixture of our sequence-level teacher-generated data () with the original training data (),
where is the gold target sequence.
Since the second term is intractable, we could again apply the mode approximation from the previous section,
and train on both observed () and teacher-generated () data. However, this process is non-ideal for two reasons: (1) unlike for standard knowledge distribution, it doubles the size of the training data, and (2) it requires training on both the teacher-generated sequence and the true sequence, conditioned on the same source input. The latter concern is particularly problematic since we observe that and are often quite different.
As an alternative, we propose a single-sequence approximation that is more attractive in this setting. This approach is inspired by local updating [Liang et al.2006], a method for discriminative training in statistical machine translation (although to our knowledge not for knowledge distillation). Local updating suggests selecting a training sequence which is close to and has high probability under the teacher model,
where is a function measuring closeness (e.g. Jaccard similarity or BLEU [Papineni et al.2002]). Following local updating, we can approximate this sequence by running beam search and choosing
where is the -best list from beam search. We take to be smoothed sentence-level BLEU [Chen and Cherry2014].
We justify training on from a knowledge distillation perspective with the following generative process: suppose that there is a true target sequence (which we do not observe) that is first generated from the underlying data distribution . And further suppose that the target sequence that we observe () is a noisy version of the unobserved true sequence: i.e. (i) , (ii) , where is, for example, a noise function that independently replaces each element in with a random element in with some small probability.666While we employ a simple (unrealistic) noise function for illustrative purposes, the generative story is quite plausible if we consider a more elaborate noise function which includes additional sources of noise such as phrase reordering, replacement of words with synonyms, etc. One could view translation having two sources of variance that should be modeled separately: variance due to the source sentence (), and variance due to the individual translator (). In such a case, ideally the student’s distribution should match the mixture distribution,
In this setting, due to the noise assumption, now has significant probability mass around a neighborhood of (not just at ), and therefore the of the mixture distribution is likely something other than (the observed sequence) or (the output from beam search). We can see that is a natural approximation to the of this mixture distribution between and for some . We illustrate this framework in Figure 1 (right) and visualize the distribution over a real example in Figure 2.
To test out these approaches, we conduct two sets of NMT experiments: high resource (English German) and low resource (Thai English).
The English-German data comes from WMT 2014.777http://statmt.org/wmt14 The training set has m sentences and we take newstest2012/newstest2013 as the dev set and newstest2014 as the test set. We keep the top k most frequent words, and replace the rest with UNK. The teacher model is a LSTM (as in Luong2015) and we train two student models: and . The Thai-English data comes from IWSLT 2015.888https://sites.google.com/site/iwsltevaluation2015/mt-track There are k sentences in the training set and we take 2010/2011/2012 data as the dev set and 2012/2013 as the test set, with a vocabulary size is k. Size of the teacher model is (which performed better than , models), and the student model is . Other training details mirror Luong2015.
We evaluate on tokenized BLEU with multi-bleu.perl, and experiment with the following variations:
Student is trained on the original data and additionally trained to minimize the cross-entropy of the teacher distribution at the word-level. We tested and found to work better.
Student is trained on the teacher-generated data, which is the result of running beam search and taking the highest-scoring sequence with the teacher model. We use beam size (we did not see improvements with a larger beam).
Student is trained on the sequence on the teacher’s beam that had the highest BLEU (beam size ). We adopt a fine-tuning approach where we begin training from a pretrained model (either on original data or Seq-KD data) and train with a smaller learning rate (). For English-German we generate Seq-Inter data on a smaller portion of the training set () for efficiency.
The above methods are complementary and can be combined with each other. For example, we can train on teacher-generated data but still include a word-level cross-entropy term between the teacher/student (Seq-KD Word-KD in Table 1), or fine-tune towards Seq-Inter data starting from the baseline model trained on original data (Baseline Seq-Inter in Table 1).999For instance, ‘Seq-KD Seq-Inter Word-KD’ in Table 1 means that the model was trained on Seq-KD data and fine-tuned towards Seq-Inter data with the mixture cross-entropy loss at the word-level.
|English German WMT 2014|
|Teacher Baseline (Params: m)|
|Student Baseline (Params: m)|
|Seq-KD Seq-Inter Word-KD|
|Student Baseline (Params: m)|
|Seq-KD Seq-Inter Word-KD|
|Thai English IWSLT 2015|
|Teacher Baseline (Params: m)|
|Student Baseline (Params: m)|
|Seq-KD Seq-Inter Word-KD|
Results of our experiments are shown in Table 1. We find that while word-level knowledge distillation (Word-KD) does improve upon the baseline, sequence-level knowledge distillation (Seq-KD) does better on English German and performs similarly on Thai English. Combining them (Seq-KD Word-KD) results in further gains for the and models (although not for the model), indicating that these methods provide orthogonal means of transferring knowledge from the teacher to the student: Word-KD is transferring knowledge at the the local (i.e. word) level while Seq-KD is transferring knowledge at the global (i.e. sequence) level.
Sequence-level interpolation (Seq-Inter), in addition to improving models trained via Word-KD and Seq-KD, also improves upon the original teacher model that was trained on the actual data but fine-tuned towards Seq-Inter data (Baseline Seq-Inter). In fact, greedy decoding with this fine-tuned model has similar performance () as beam search with the original model (), allowing for faster decoding even with an identically-sized model.
We hypothesize that sequence-level knowledge distillation is effective because it allows the student network to only model relevant parts of the teacher distribution (i.e. around the teacher’s mode) instead of ‘wasting’ parameters on trying to model the entire space of translations. Our results suggest that this is indeed the case: the probability mass that Seq-KD models assign to the approximate mode is much higher than is the case for baseline models trained on original data (Table 1: ). For example, on English German the (approximate) for the Seq-KD model (on average) accounts for of the total probability mass, while the corresponding number is for the baseline. This also explains the success of greedy decoding for Seq-KD models—since we are only modeling around the teacher’s mode, the student’s distribution is more peaked and therefore the is much easier to find. Seq-Inter offers a compromise between the two, with the greedily-decoded sequence accounting for of the distribution.
Finally, although past work has shown that models with lower perplexity generally tend to have higher BLEU, our results indicate that this is not necessarily the case. The perplexity of the baseline English German model is while the perplexity of the corresponding Seq-KD model is , despite the fact that Seq-KD model does significantly better for both greedy ( BLEU) and beam search ( BLEU) decoding.
|Beam = 1 (Greedy)|
Run-time complexity for beam search grows linearly with beam size. Therefore, the fact that sequence-level knowledge distillation allows for greedy decoding is significant, with practical implications for running NMT systems across various devices. To test the speed gains, we run the teacher/student models on GPU, CPU, and smartphone, and check the average number of source words translated per second (Table 2). We use a GeForce GTX Titan X for GPU and a Samsung Galaxy 6 smartphone. We find that we can run the student model times faster with greedy decoding than the teacher model with beam search on GPU ( vs words/sec), with similar performance.
Although knowledge distillation enables training faster models, the number of parameters for the student models is still somewhat large (Table 1: Params), due to the word embeddings which dominate most of the parameters.101010Word embeddings scale linearly while RNN parameters scale quadratically with the dimension size. For example, on the English German model the word embeddings account for approximately (m out of m) of the parameters. The size of word embeddings have little impact on run-time as the word embedding layer is a simple lookup table that only affects the first layer of the model.
We therefore focus next on reducing the memory footprint of the student models further through weight pruning. Weight pruning for NMT was recently investigated by See2016, who found that up to of the parameters in a large NMT model can be pruned with little loss in performance. We take our best English German student model ( Seq-KD Seq-Inter) and prune of the parameters by removing the weights with the lowest absolute values. We then retrain the pruned model on Seq-KD data with a learning rate of and fine-tune towards Seq-Inter data with a learning rate of . As observed by See2016, retraining proved to be crucial. The results are shown in Table 3.
Our findings suggest that compression benefits achieved through weight pruning and knowledge distillation are orthogonal.111111To our knowledge combining pruning and knowledge distillation has not been investigated before. Pruning of the weight in the student model results in a model with fewer parameters than the original teacher model with only a decrease of BLEU. While pruning of the weights results in a more appreciable decrease of BLEU, the model is drastically smaller with m parameters, which is fewer than the original teacher model.
For models trained with word-level knowledge distillation, we also tried regressing the student network’s top-most hidden layer at each time step to the teacher network’s top-most hidden layer as a pretraining step, noting that Romero2015 obtained improvements with a similar technique on feed-forward models. We found this to give comparable results to standard knowledge distillation and hence did not pursue this further.
There have been promising recent results on eliminating word embeddings completely and obtaining word representations directly from characters with character composition models, which have many fewer parameters than word embedding lookup tables [Ling et al.2015a, Kim et al.2016, Ling et al.2015b, Jozefowicz et al.2016, Costa-Jussa and Fonollosa2016]. Combining such methods with knowledge distillation/pruning to further reduce the memory footprint of NMT systems remains an avenue for future work.
Compressing deep learning models is an active area of current research. Pruning methods involve pruning weights or entire neurons/nodes based on some criterion. LeCun1990 prune weights based on an approximation of the Hessian, while Han2016 show that a simple magnitude-based pruning works well. Prior work on removing neurons/nodes include Srinivas2015 and Mariet2016. See2016 were the first to apply pruning to Neural Machine Translation, observing that that different parts of the architecture (input word embeddings, LSTM matrices, etc.) admit different levels of pruning. Knowledge distillation approaches train a smaller student model to mimic a larger teacher model, by minimizing the loss between the teacher/student predictions [Bucila et al.2006, Ba and Caruana2014, Li et al.2014, Hinton et al.2015]. Romero2015 additionally regress on the intermediate hidden layers of the student/teacher network as a pretraining step, while Mou2015 obtain smaller word embeddings from a teacher model via regression. There has also been work on transferring knowledge across different network architectures: Chan2015b show that a deep non-recurrent neural network can learn from an RNN; Geras2016 train a CNN to mimic an LSTM for speech recognition. Kuncoro2016 recently investigated knowledge distillation for structured prediction by having a single parser learn from an ensemble of parsers.
Other approaches for compression involve low rank factorizations of weight matrices [Denton et al.2014, Jaderberg et al.2014, Lu et al.2016, Prabhavalkar et al.2016], sparsity-inducing regularizers [Murray and Chiang2015]
, binarization of weights[Courbariaux et al.2016, Lin et al.2016], and weight sharing [Chen et al.2015, Han et al.2016]. Finally, although we have motivated sequence-level knowledge distillation in the context of training a smaller model, there are other techniques that train on a mixture of the model’s predictions and the data, such as local updating [Liang et al.2006], hope/fear training [Chiang2012], SEARN [Daumé III et al.2009], DAgger [Ross et al.2011], and minimum risk training [Och2003, Shen et al.2016].
In this work we have investigated existing knowledge distillation methods for NMT (which work at the word-level) and introduced two sequence-level variants of knowledge distillation, which provide improvements over standard word-level knowledge distillation.
We have chosen to focus on translation as this domain has generally required the largest capacity deep learning models, but the sequence-to-sequence framework has been successfully applied to a wide range of tasks including parsing [Vinyals et al.2015a], summarization [Rush et al.2015], dialogue [Vinyals and Le2015, Serban et al.2016, Li et al.2016], NER/POS-tagging [Gillick et al.2016], image captioning [Vinyals et al.2015b, Xu et al.2015], video generation [Srivastava et al.2015], and speech recognition [Chan et al.2015a]. We anticipate that methods described in this paper can be used to similarly train smaller models in other domains.
Exploiting Linear Structure within Convolutional Neural Networks for Efficient Evaluation.In Proceedings of NIPS.
Auto-sizing Neural Networks: With Applications to N-Gram Language Models.In Proceedings of EMNLP.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.In Proceedings of AISTATS.