nmt_rl
None
view repo
We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model. We also offer a new method of regularizing the output embedding. Our methods lead to a significant reduction in perplexity, as we are able to show on a variety of neural network language models. Finally, we show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.
READ FULL TEXT VIEW PDFNone
In a common family of neural network language models, the current input word is represented as the vector
and is projected to a dense representation using a word embedding matrix . Some computation is then performed on the word embedding , which results in a vector of activations . A second matrix then projects to a vector containing one score per vocabulary word:. The vector of scores is then converted to a vector of probability values
, which represents the models’ prediction of the next word, using the softmax function.For example, in the LSTMbased language models of [Sundermeyer et al.2012, Zaremba et al.2014], for vocabulary of size
, the onehot encoding is used to represent the input
and . An LSTM is then employed, which results in an activation vector that similarly to , is also in . In this case, and are of exactly the same size.We call the input embedding, and the output embedding. In both matrices, we expect rows that correspond to similar words to be similar: for the input embedding, we would like the network to react similarly to synonyms, while in the output embedding, we would like the scores of words that are interchangeable to be similar [Mnih and Teh2012].
While and
can both serve as word embeddings, in the literature, only the former serves this role. In this paper, we compare the quality of the input embedding to that of the output embedding, and we show that the latter can be used to improve neural network language models. Our main results are as follows: (i) We show that in the word2vec skipgram model, the output embedding is only slightly inferior to the input embedding. This is shown using metrics that are commonly used in order to measure embedding quality. (ii) In recurrent neural network based language models, the output embedding outperforms the input embedding. (iii) By tying the two embeddings together, i.e., enforcing
, the joint embedding evolves in a more similar way to the output embedding than to the input embedding of the untied model. (iv) Tying the input and output embeddings leads to an improvement in the perplexity of various language models. This is true both when using dropout or when not using it. (v) When not using dropout, we propose adding an additional projection before , and apply regularization to . (vi) Weight tying in neural translation models can reduce their size (number of parameters) to less than half of their original size without harming their performance.Neural network language models (NNLMs) assign probabilities to word sequences. Their resurgence was initiated by [Bengio et al.2003]. Recurrent neural networks were first used for language modeling in [Mikolov et al.2010] and [Pascanu et al.2013]. The first model that implemented language modeling with LSTMs [Hochreiter and Schmidhuber1997] was [Sundermeyer et al.2012]. Following that, [Zaremba et al.2014] introduced a dropout [Srivastava2013] augmented NNLM. [Gal2015, Gal and Ghahramani2016] proposed a new dropout method, which is referred to as Bayesian Dropout below, that improves on the results of [Zaremba et al.2014].
The skipgram word2vec model introduced in [Mikolov et al.2013a, Mikolov et al.2013b] learns representations of words. This model learns a representation for each word in its vocabulary, both in an input embedding matrix and in an output embedding matrix. When training is complete, the vectors that are returned are the input embeddings. The output embedding is typically ignored, although [Mitra et al.2016, Mnih and Kavukcuoglu2013] use both the output and input embeddings of words in order to compute word similarity. Recently, [Goldberg and Levy2014] argued that the output embedding of the word2vec skipgram model needs to be different than the input embedding.
As we show, tying the input and the output embeddings is indeed detrimental in word2vec. However, it improves performance in NNLMs.
In neural machine translation (NMT) models
[Kalchbrenner and Blunsom2013, Cho et al.2014, Sutskever et al.2014, Bahdanau et al.2014], the decoder, which generates the translation of the input sentence in the target language, is a language model that is conditioned on both the previous words of the output sentence and on the source sentence. State of the art results in NMT have recently been achieved by systems that segment the source and target words into subword units [Sennrich et al.2016a]. One such method [Sennrich et al.2016b] is based on the byte pair encoding (BPE) compression algorithm [Gage1994]. BPE segments rare words into their more commonly appearing subwords.Weight tying was previously used in the logbilinear model of [Mnih and Hinton2009], but the decision to use it was not explained, and its effect on the model’s performance was not tested. Independently and concurrently with our work [Inan et al.2016] presented an explanation for weight tying in NNLMs based on [Hinton et al.2015].
In this work, we employ three different model categories: NNLMs, the word2vec skipgram model, and NMT models. Weight tying is applied similarly in all models. For translation models, we also present a threeway weight tying method.
NNLM models contain an input embedding matrix, two LSTM layers ( and
), a third hidden scores/logits layer
, and a softmax layer. The loss used during training is the cross entropy loss without any regularization terms.
Following [Zaremba et al.2014], we employ two models: large and small. The large model employs dropout for regularization. The small model is not regularized. Therefore, we propose the following regularization scheme. A projection matrix is inserted before the output embedding, i.e., . The regularizing term
is then added to the small model’s loss function. In all of our experiments,
.Projection regularization allows us to use the same embedding (as both the input/output embedding) with some adaptation that is under regularization. It is, therefore, especially suited for WT.
While training a vanilla untied NNLM, at timestep , with current input word sequence
and current target output word
, the negative log likelihood loss is given by: , where , () is the th row of (), which corresponds to word , and is the vector of activations of the topmost LSTM layer’s output at time . For simplicity, we assume that at each timestep ,. Optimization of the model is performed using stochastic gradient descent.
The update for row of the input embedding is:
For the output embedding, row ’s update is:
Therefore, in the untied model, at every timestep, the only row that is updated in the input embedding is the row representing the current input word. This means that vectors representing rare words are updated only a small number of times. The output embedding updates every row at each timestep.
In tied NNLMs, we set . The update for each row in is the sum of the updates obtained for the two roles of S as both an input and output embedding.
The update for row is similar to the update of row in the untied NNLM’s output embedding (the only difference being that U and V are both replaced by a single matrix S). In this case, there is no update from the input embedding role of .
The update for row , is made up of a term from the input embedding (case ) and a term from the output embedding (case ). The second term grows linearly with , which is expected to be close to zero, since words seldom appear twice in a row (the low probability in the network was also verified experimentally). The update that occurs in this case is, therefore, mostly impacted by the update from the input embedding role of .
To conclude, in the tied NNLM, every row of is updated during each iteration, and for all rows except one, this update is similar to the update of the output embedding of the untied model. This implies a greater degree of similarity of the tied embedding to the untied model’s output embedding than to its input embedding.
The analysis above focuses on NNLMs for brevity. In word2vec, the update rules are similar, just that is replaced by the identity function. As argued by [Goldberg and Levy2014], in this case weight tying is not appropriate, because if is close to zero then so is the norm of the embedding of . This argument does not hold for NNLMs, since the LSTM layers cause a decoupling of the input and output embedddings.
Finally, we evaluate the effect of weight tying in neural translation models. In this model: where is the set of words in the source sentence, and are the input and output embeddings of the decoder and is the input embedding of the encoder (in translation models and , where / is the size of the vocabulary of the source / target). is the decoder, which receives the context vector, the embedding of the input word () in , and its previous state at each timestep. is the context vector at timestep , , where is the weight given to the th annotation at time : , and , where is the alignment model. is the encoder which produces the sequence of annotations .
The output of the decoder is then projected to a vector of scores using the output embedding: . The scores are then converted to probability values using the softmax function.
In our weight tied translation model, we tie the input and output embeddings of the decoder.
We observed that when preprocessing the ACL WMT 2014 ENFR^{1}^{1}1http://statmt.org/wmt14/translationtask.html and WMT 2015 ENDE^{2}^{2}2http://statmt.org/wmt15/translationtask.html datasets using BPE, many of the subwords appeared in the vocabulary of both the source and the target languages. Tab. 1 shows that up to 90% (85%) of BPE subwords between English and French (German) are shared.
Based on this observation, we propose threeway weight tying (TWWT), where the input embedding of the decoder, the output embedding of the decoder and the input embedding of the encoder are all tied. The single source/target vocabulary of this model is the union of both the source and target vocabularies. In this model, both in the encoder and decoder, all subwords are embedded in the same duolingual space.
Language  Subwords  Subwords  Subwords 

pairs  only in source  only in target  in both 
ENFR  2K  7K  85K 
ENDE  3K  11K  80K 
Our experiments study the quality of various embeddings, the similarity between them, and the impact of tying them on the word2vec skipgram model, NNLMs, and NMT models.
In order to compare the various embeddings, we pooled five embedding evaluation methods from the literature. These evaluation methods involve calculating pairwise (cosine) distances between embeddings and correlating these distances with human judgments of the strength of relationships between concepts. We use: Simlex999 [Hill et al.2016], Verb143 [Baker et al.2014], MEN [Bruni et al.2014], RareWord [Luong et al.2013] and MTurk771 [Halawi et al.2012].
We begin by training both the tied and untied word2vec models on the text8^{3}^{3}3http://mattmahoney.net/dc/textdata dataset, using a vocabulary consisting only of words that appear at least five times. As can be seen in Tab. 2, the output embedding is almost as good as the input embedding. As expected, the embedding of the tied model is not competitive. The situation is different when training the small NNLM model on either the Penn Treebank [Marcus et al.1993] or text8 datasets (for PTB, we used the same train/validation/test set split and vocabulary as [Mikolov et al.2011], while on text8 we used the split/vocabulary from [Mikolov et al.2014]). These results are presented in Tab. 3. In this case, the input embedding is far inferior to the output embedding. The tied embedding is comparable to the output embedding.
Input  Output  Tied  
Simlex999  0.30  0.29  0.17 
Verb143  0.41  0.34  0.12 
MEN  0.66  0.61  0.50 
RareWord  0.34  0.34  0.23 
MTurk771  0.59  0.54  0.37 


PTB  text8  

Embedding  In  Out  Tied  In  Out  Tied 
Simlex999  0.02  0.13  0.14  0.17  0.27  0.28 
Verb143  0.12  0.37  0.32  0.20  0.35  0.42 
MEN  0.11  0.21  0.26  0.26  0.50  0.50 
RareWord  0.28  0.38  0.36  0.14  0.15  0.17 
MTurk771  0.17  0.28  0.30  0.26  0.48  0.45 
A  B  

word2vec  NNLM(S)  NNLM(L)  
In  Out  0.77  0.13  0.16 
In  Tied  0.19  0.31  0.45 
Out  Tied  0.39  0.65  0.77 
A natural question given these results and the analysis in Sec. 3 is whether the word embedding in the weight tied NNLM model is more similar to the input embedding or to the output embedding of the original model. We, therefore, run the following experiment: First, for each embedding, we compute the cosine distances between each pair of words. We then compute Spearman’s rank correlation between these vectors of distances. As can be seen in Tab. 4, the results are consistent with our analysis and the results of Tab. 2 and Tab. 3: for word2vec the input and output embeddings are similar to each other and differ from the tied embedding; for the NNLM models, the output embedding and the tied embeddings are similar, the input embedding is somewhat similar to the tied embedding, and differs considerably from the output embedding.
Model  Size  Train  Val.  Test 

Large [Zaremba et al.2014]  66M  37.8  82.2  78.4 
Large + Weight Tying  51M  48.5  77.7  74.3 
Large + BD [Gal2015] + WD  66M  24.3  78.1  75.2 
Large + BD + WT  51M  28.2  75.8  73.2 
RHN [Zilly et al.2016] + BD  32M  67.4  71.2  68.5 
RHN + BD + WT  24M  74.1  68.1  66.0 
Model  Size  Train  Val.  Test 

KN 5gram  141  
RNN  123  
LSTM  117  
Stack RNN  8.48M  110  
FOFEFNN  108  
Noisy LSTM  4.65M  111.7  108.0  
Deep RNN  6.16M  107.5  
Small model  4.65M  38.0  120.7  114.5 
Small + WT  2.65M  36.4  117.5  112.4 
Small + PR  4.69M  50.8  116.0  111.7 
Small + WT + PR 
2.69M  53.5  104.9  100.9 
Model  Small  S + WT  S + PR  S + WT + PR  

text8 
Train  90.4  95.6  92.6  95.3 
Val.          
Test  195.3  187.1  199.0  183.2  
IMDB 
Train  71.3  75.4  72.0  72.9 
Val.  94.1  94.6  94.0  91.2  
Test  94.3  94.8  94.4  91.5  
BBC 
Train  28.6  30.1  42.5  45.7 
Val.  103.6  99.4  104.9  96.4  
Test  110.8  106.8  108.7  98.9 
We next study the effect of tying the embeddings on the perplexity obtained by the NNLM models. Following [Zaremba et al.2014], we study two NNLMs. The two models differ mostly in the size of the LSTM layers. In the small model, both LSTM layers contain 200 units and in the large model, both contain 1500 units. In addition, the large model uses three dropout layers, one placed right before the first LSTM layer, one between and and one right after . The dropout probability is
. For both the small and large models, we use the same hyperparameters (i.e. weight initialization, learning rate schedule, batch size) as in
[Zaremba et al.2014].In addition to training our models on PTB and text8, following [Miyamoto and Cho2016], we also compare the performance of the NNLMs on the BBC [Greene and Cunningham2006] and IMDB [Maas et al.2011] datasets, each of which we process and split into a train/validation/test split (we use the same vocabularies as [Miyamoto and Cho2016]).
In the first experiment, which was conducted on the PTB dataset, we compare the perplexity obtained by the large NNLM model and our version in which the input and output embeddings are tied. As can be seen in Tab. 5, weight tying significantly reduces perplexity on both the validation set and the test set, but not on the training set. This indicates less overfitting, as expected due to the reduction in the number of parameters. Recently, [Gal and Ghahramani2016], proposed a modified model that uses Bayesian dropout and weight decay. They obtained improved performance. When the embeddings of this model are tied, a similar amount of improvement is gained. We tried this with and without weight decay and got similar results in both cases, with slight improvement in the latter model. Finally, by replacing the LSTM with a recurrent highway network [Zilly et al.2016], state of the art results are achieved when applying weight tying. The contribution of WT is also significant in this model.
Perplexity results are often reported separately for models with and without dropout. In Tab. 6, we report the results of the small NNLM model, that does not utilize dropout, on PTB. As can be seen, both WT and projection regularization (PR) improve the results. When combining both methods together, state of the art results are obtained. An analog table for text8, IMDB and BBC is Tab. 7, which shows a significant reduction in perplexity across these datasets when both PR and WT are used. PR does not help the large models, which employ dropout for regularization.
Finally, we study the impact of weight tying in attention based NMT models, using the DL4MT^{4}^{4}4https://github.com/nyudl/dl4mttutorial implementation. We train our ENFR models on the parallel corpora provided by ACL WMT 2014. We use the data as processed by [Cho et al.2014] using the data selection method of [Axelrod et al.2011]. For ENDE we train on data from the translation task of WMT 2015, validate on newstest2013 and test on newstest2014 and newstest2015. Following [Sennrich et al.2016b] we learn the BPE segmentation on the union of the vocabularies that we are translating from and to (we use BPE with 89500 merge operations). All models were trained using Adadelta [Zeiler2012] for 300K updates, have a hidden layer size of 1000 and all embedding layers are of size 500.
Size  Validation  Test  

ENFR  Baseline  168M  29.49  33.13 
Decoder WT  122M  29.47  33.26  
TWWT  80M  29.43  33.46  
ENDE 
Baseline  165M  20.96  16.79 
Decoder WT  119M  21.09  16.54  
TWWT  79M  21.02  17.15 
Tab. 8 shows that even though the weight tied models have about 28% fewer parameters than the baseline models, their performance is similar. This is also the case for the threeway weight tied models, even though they have about 52% fewer parameters than their untied counterparts.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
, pages 355–362, Edinburgh, Scotland, UK., July. Association for Computational Linguistics.Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.
InProceedings of the 33rd International Conference on Machine Learning (ICML16)
.Simlex999: Evaluating semantic models with (genuine) similarity estimation.
Computational Linguistics.Learning word vectors for sentiment analysis.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150. Association for Computational Linguistics.Learning word embeddings efficiently with noisecontrastive estimation.
In Advances in Neural Information Processing Systems, pages 2265–2273.