Deep Neural Networks have been general solutionships for various Natural Language Processing tasks. For example, Transformer-based models [Vaswani et al.2017, Ott et al.2018] have been state-of-the-art models for tasks such as Machine Translation, and huge pre-trained Neural Network models [Devlin et al.2018, Yang et al.2019, Lan et al.2019, Raffel et al.2019] have dominated multi-task learning task in Natural Language Processing.
Although the design of the inner parts of the networks may vary, each Neural Network model needs modules for mappings between natural language words/characters/tokens and numerical vectors, Which is called embeddings.
To relieve the great computational cost and storage need brought by huge size of embedding weights for input and output vocabularies, recent Neural Network models for text generation (e.g. Language Modeling or Neural Machine Translation models) are using input-output shared embedding weights[Press and Wolf2016] with a randomized initialization: When the input and output spaces are the same, the same embedding weights can serve in both input and output layers, like figure 1 illustrates. The embedding weight sharing method greatly reduces the number of parameters of the model, at the same time preserves and even raises the performances of the model.
Nevertheless, this paper points out that the weight-sharing method used in recent state-of-the-art Neural Network models has obvious shortcomings, and can be improved with low computational costs. After making insight into the existed input-output weight-sharing method, with linear algebra and statistical theories, this paper shows that normalization methods on the embedding weights should be applied on input-output shared embedding weights, and this kind of techniques can work well with shared embedding weight matrix in Neural Network text generation models. Moreover, the methods presented in this paper can also be well integrated with existed embedding parameter reduction method, such as Adaptive Embedding/Softmax [Grave et al.2017, Baevski and Auli2018], or Embedding Layer Factorization [Lan et al.2019].
The normalization of word embedding have been discussed before, such as [Xing et al.2015], but their works focus on different aspects from this paper (the domain-transfer problem, or density of word embedding). Totally speaking, the achievements of this paper are as following:
1. This paper makes deep analyis into and explains the effectiveness of input-output embedding weights sharing, then provides a theory-based method to improve it with trivial cost compared with the rest parts of the models.
2. The methods this paper presents can be simply integrated with other existed state-of-the-art embedding dimension reduction or parameter saving techniques.
3. In designed experiments, this paper proves the effectiveness of the methods in different kinds of Neural Network models for Natural Language Processing.
The outline of this paper is as following: Section 1 is the introduction of this paper, section 2 is the theory base and the practical normalization methods for input-output shared embeddings, section 3 is the experiment confurations, results and analysis. section 4 collects and introduces related works of this paper, and section 5 is a conclusion of this paper.
2 Normalization of Input-output Shared Embedding
2.1 Decoder Output Embedding and Input-output Embedding Weights Sharing
Input embedding and output Softmax layers are included in almost all Neural Network models for Natural Language Processing. Commonly speaking, An input embedding maps a language token (or regarded as an one-hot token vector) to a relatively low-dimension vector in continuous numerical space (mostly real Euclidean space). Usingto represent the input embedding Matrix, to represent an one-hot input word vector (no batch size or sequence length is involved), the input embedding does the following computation:
has the size of , has the size of (all 1D vectors are as default column vectors in this paper) and the generated embedding vector has size of . is the vocabulary size which is often big (e.g. tens of thousands), and is the input embedding dimension.
At the output layer, suppose
is the output embedding vector generated by the inner parts of the decoder, which is for an output token, the output Softmax layer does the following computations to find the most probable token:
is a matrix for Softmax layer kernel, and is a vector with size for layer bias. is the size of the output embedding vector.
In some text generation Natural Language Processing tasks, the input and output token space is the same, e.g. language modeling, text summrization, and machine translation (when using shared vocabulary). In this situation, if the output embedding vector and the input embedding vector has the same size (), it can be asserted that the space of output embedding vectors is isomorphy to the input embedding space. That is, given a output embedding vector , it should be a non-negative -normalized linear combination of input embedding vectors (pay attention that is the ith column of matrix ).
is a probability distribution (it’s non-negative and normalized) showing the probabilites of the feature vector should generate the i-th token for output, which is just the conditional probabalityof output and input. Apparently, Softmax layers can generate , and a model can also use
vector to estimate, as Softmax transformation has the isotone property. Because embedding dimension is always smaller than the vocabulary size , the problem of solving is on an underdefined linear equations set. Therefore, additional constraints need to be introduced into this problem. Using a priori knowledge, one can confidently make the assumption that the real distribution of should be sparse (there will be little number of tokens which are appropriate for output at a certain position of one text sentence), and the whole optimization problem is:
Or its relaxed convex version:
But this convex version is still computational-expensive to be directly solved. As it is for an output layer of a big Neural Network, it is natural to hope that the estimation can be computed with simplified computations. Because Neural Network models just use the estimations for Softmax/Argmax classification, only the order of (the components of on each dimension, which is the estimation score for each token) determines the output. Thus, one can also get isotone estimations for instead, which have the property of:
And the estimations are expected to be computed in a linear way, like the Fully Connected Layer:
Where is a matrix having the same size of (it serves for the kernel of the Fully Connected Layer), and
is the bias vector of the Fully Connected Layer. The most direct method is learningand from scratch. However, the size of is huge, so weight sharing methods attempt to bind and together with simple operations. So far, the most widely used, and almost only used setting for and is , and (it will be the baseline in this paper’s experiments). It is introduced in [Press and Wolf2016], and has achieved good performances, therefore has been adopted by many state-of-the-art Natural Language Processing models. In the next subsection, it will be discussed how to improve this method with simple and low cost computations.
2.2 Normalization in Embedding Weight Matrix
First, let’s see specifically what the baseline method is using. Still assume that as the real distribution of , therefore,
Remind that is the i-th column of (the input embedding vector of the i-th token). As the target of Neural Network models is to find argmax, it is most important in the estimation of the largest . let , then
The approximation is based on the fact that is sparse, and is much larger than others. Obviously, using the baseline method, the estimation of is biased: For token with small embedding vector norm, it is under-estimated, and for token with big embedding vector, it is over-estimated. To solve this problem, this paper designs several improvement methods with small additional computation cost, and the first two are embedding normalization methods.
2.2.1 -normalized Input Embedding Weight Matrix
The most direct way for getting free of the norm problem is to use a column-wise -normalized input embedding matrix W having the structure of:
And the still remain the same as . Using
-normalized embedding Matrix will lead to a unbiased estimation of. The proof is as following: For and which are uniformly i.i.d on unit sphere, and
Because due to symmetry it is easy to prove that the surface integral
when . Therefore, from equation (17),
. The variance ofwill be small when is .
One problem for this method is that it abandons the diversity of norms of embedding vectors, which also includes semantic information.
2.2.2 Square-normalized Output Embedding
If let and equals:
when is . By scaling the output embedding kernel matrix by square norm, the estimation is also a unbiased estimation, and the value of will never be over-estimated with large embedding vector norm. However, this method severely over-estimated the with a small embedding vector norm, as this method do not consider the non-negative -normalized constraints of , which means that will get no punishment and is even encouraged for output when it is over 1.
For Comparison and proof for the effectiveness of Embedding Normalization, this paper introduces 2 more methods of sharing input and output embeddings. These 2 methods are similar to embedding normalization, but are in different principles.
2.2.3 Calculate Distances for Estimation
Another way to find the for biggest is to calculate the ”similarities” between and . A scale-sensitive method is to calculated the -distance:
For smoothness, the square of it is used:
As is the same for each , and we’d like to find the mininum of , the can be in the following form:
and it can be implemented by just setting and .
2.2.4 Calculate Cosine Similarity for Estimation
This method reviews the value of , and holds the view that the scale of
may be unimportant, as the signal generated by the Neural Network layers can be amplifiled or reduced, and the most vital information is in its direction. Therefore, this method computes cosine-similarities betweenand :
It relaxed the constraint of and attempts to found the k of:
If the is scale-free, it will get better performances.
To theoretically evaluate these estimation methods, the following 3 important properties are raised, which a good estimation of should have, but the original embedding weight sharing method does not preserve.
1. Identity: if , then . This property means that, if the output embedding vector is the same as one input embedding vector, the output layer should return the corresponding token as output. The original embedding sharing method do not have this property. -normalized Input Embedding Weight Matrix and Calculate Cosine Similarity for Estimation have this attribute, and the other 2 methods have this when there exists no and that (in almost all practical circumstances it is true).
2. Normality: , therefore always represents a probability. Only -normalized Input Embedding Weight Matrix has this property.
3. Unbiased: . The 2 Embedding Normalization methods both have this property.
At last, the additional of computational costs of the 4 methods are all on computing the columnwise norms of , which has the time complexity of , and is the same as the original output Softmax Layer. Therefore, it is still trivial compared with the inner layers of Neural Network models.
In the following parts of the experiments, this paper will show the specific evaluation results and provide analysis for the 4 input-output embedding weight sharing methods.
3 Experiments and Results on Machine Translation Task
This section presents experiment settings, configurations ,results and analysis of the experiments this paper carries out for validating the effectiveness of embedding normalization methods. The experiments are based on Machine Translation task, because this task is one of the most important Text Generation tasks, and has clear benchmarks. The techinuqes can be also applied into other text generation tasks, as long as shared input-output shared embedding is included.
3.1 Experimental Settings
The embedding normalization methods designed by the paper are not constricted to one certain kind of models, but are compatible with various kinds Neural Network models with input-output shared embedding. This paper implements the methods on 2 kinds of state-of-the-art models in Neural Machine Translation task: Transformer [Vaswani et al.2017] and DynamicConv [Wu et al.2019] models on the following Datasets: WMT 16’ En-De, and IWSLT 14’ En-De, En-Es, and En-Ro datasets. The Transformer model is trained and tested on all datasets used, and the DynamicConv Models is trained and tested on WMT 16’ En-De and IWSLT 14’ En-De datasets.
On WMT 16’ En-De dataset, this paper follows the same settings from [Vaswani et al.2017, Ott et al.2018], uses 4.5M sentence pairs for training, validates on newstest2013 and tests on newstest2014. The vocabulary has 32K symbols generated by sentencepiece [Kudo and Richardson2018] and BPE [Sennrich et al.2015].
On IWSLT 14’ datasets, This paper uses English as target language, and uses German (De), Spanish (Es) and Romania (Ro) as source language. For the dataset splitting method of IWSLT 14’ datasets, in De-En translation task. the paper follows the same settings from [Edunov et al.2017] and [Xia et al.2019], and in Es-En and Ro-En tasks, this paper uses as valid set and combines for test set.
3.1.2 Model Configurations
For Transformer model on WMT’16 En-De dataset, this paper applies the Transformer-big configuration described in [Vaswani et al.2017], which has 6 blocks in both encoder and decoder parts, embedding dimension of 1024, hidden layer size of 4096, and 16 heads in multihead attention block. The label smoothing rate is 0.1. The training configurations are referenced from [Ott et al.2018], this paper simulates an 128-GPU training process with half-precision and batch size, 0.3 dropout rate, and an Adam optimizer with , , and . The learning rate scheduler is the same as [Vaswani et al.2017], with the summit learning rate of .
For Transformer model on IWSLT 14’ datasets, this paper applies a Transformer model which has 6 blocks in both encoder and decoder parts, embedding dimension of 512, hidden layer size of 1024, and 4 heads in multihead attention block. The label smoothing rate is 0.1. The training configurations are almost same as the Transformer in WMT dataset, except that a weight-decay rate of is used. Default configurations for ISWLT 14’ translation Transformer models use separate input and output vocabularies, but De-En translation model in this paper uses the same joined vocabulary like models on WMT 16’ En-De dataset, because this paper finds that this way can improve BLEUs. Therefore, in IWSLT 14’ De-En, all embedding weights are shared, and in Es-En and Ro-En only decoder embedding weights are shared.
For DynamicConv Models, as the model structures are similar between Transformer and DynamicConv models, the model configurations are also nearly the same between the models on corresponding datasets, except that DynamicConv models have 7 blocks in encoder parts. The training configuration is the same as [Wu et al.2019].
3.2.1 Evaluation Method
Like all other Neural Machine Translation experiments, this paper uses BLEU222https://github.com/moses-smt/mosesdecoder/blob/master/scripts/
generic/multi-bleu.perl as the evaluation method for the quality of generated translation texts.
The translation texts from trained models are generated with beam-search. For Transformer-big model on WMT 16’ En-De dataset, texts are generated with beam width of 4 and length penalty of 0.6. For Transformer models on IWSLT 14’ datasets, the beam width is set to 5 and the length penalty is set to 1. For DynamicConv model on WMT 16’ En-De dataset, texts are generated with beam width of 5 and length penalty of 0.5. For DynamicConv models on IWSLT 14’ datasets, the beam width is set to 4 and the length penalty is set to 1. For Transformer and DynamicConv models on WMT 16’ En-De dataset, this paper makes average on the last 10 checkpoints for test.
3.2.2 Experimental Results
Table 1 and 2 shows the BLEUs of Transformer-big models and DynamicConv models on WMT 16’ En-De dataset with different embedding sharing configurations (Baseline is the original sharing method). The model can gain BLEU improvement from each of the 4 designed method, and using -normalized input embedding get the best result for both models on WMT 16’ En-De dataset. As analyzed before, embedding sharing method of -normalized input embedding has good properties, that can give normalized and unbiased estimations of tokens’ scores for output. For Transformer-big model, 0.6 BLEU improvement is achieved and for DynamicConv model, 0.5 BLEU improvement is achieved.
Figure 2 is a histogram of -norms of embedding vectors for language tokens in a baseline Transformer-big model on WMT 16’ En-De Dataset. The histogram illustrates that the embedding vectors are scattered through training phase, get spreaded to have varied norm, which affects the quality of output scores with the original output embedding.
|-normalized input embedding||29.9|
|Square-normalized output embedding||29.4|
|-normalized input embedding||30.2|
|Square-normalized output embedding||30.0|
For Transformer models on IWSLT 14’ translation tasks, the 4 new weight sharing methods also outperforms the baseline. Table 3
shows the results of BLEUs. In this circumstance, square-normalized output embedding method is the best weight sharing method among all methods, which achieves BLEU improvements of 0.86/0.73/0.60 on English to German/Spanish/Romania translation tasks. The Transformer model for IWSLT 14’ translation tasks has smaller embedding size. As making the total embedding matrix normalized loses some degrees of freedom for the data of embedding vectors, it will have greater impact on embedding spaces with lower dimensions. Therefore, the method of-normalized input embedding can not perform as well as in WMT 16’ dataset.
|-normalized input embedding||35.00|
|Square-normalized output embedding||35.16|
|-normalized input embedding||34.65|
|Square-normalized output embedding||34.97|
|-normalized input embedding||37.09|
|Square-normalized output embedding||37.50|
For DynamicConv models on IWSLT 14’ De-En translation tasks, Table 4 gives the BLEUs of different embedding weight sharing methods. each method has close performance and the method of calculating cosine similarities is a slightly better than others, having a 0.41 BLEU improvement than baseline.
|l2-normalized input embedding||34.86|
|Square-normalized output embedding||34.74|
From all groups of experiments and their results, it can be concluded that the improvements of input-output embedding sharing method work well on state-of-the-art Neural Machine Translation models, and the methods of normalization of input-output shared embedding perform best at most scenraios. The statistical analysis has shown that the methods are unbiased, and the -normalized Input Embedding Weight Matrix method even is normalized, which contributes to the effectiveness of them.
To have a better text generation Neural Network model, normalization of input-output shared embedding is indeed useful. The normalization methods scale the embedding vectors, and make the parts of the Neural Network models more interpretable: The input and output embeddings are mapping between language token spaces and continuous numerical sapces with semantics, the encoder encrypts features with the embedding vectors, and the decoder provides output vectors on the embedding space.
4 Related Works
With the fast growth and development of computational power and network training techiques, Neural Networks of great sizes and huge amount of parameters have begun to take advantage and become state-of-the-art models in various tasks in Natural Language Processing. Based on Self-attention Transformer [Vaswani et al.2017], BERT [Devlin et al.2018] and its descendents [Liu et al.2019, Lan et al.2019, Raffel et al.2019]
dominate multi-task learning in Natural Language Processing. Due to the great number of words and tokens in natural language, each model like these needs to spend a portion of its parameters on building mappings between language words/tokens and numerical tensors, which are also known as input and output embeddings. This need contributes a lot to the computational cost and storage difficulty of the models.
To deal with this problem, one method is to reduce the structure of embeddings. [Raunak2017, Acharya et al.2019] present effective methods to make compression of input embeddings, therefore lowers the parameters needed in input embeddings. the shortcoming of them is that this kind of methods is data-based, which means that intensive training on full models is still needed. [Lan et al.2019] directly factorizes the input embedding layer, and can reduce the parameters of it in any scale. For output mapping, taking the long-tail distribution of words in natural laguages into account, [Grave et al.2017] designs a hierarchical Softmax layer for language models, called Adaptive Softmax, which saves a great portion of parameters in the output layer.
Another kind of parameter reducing techique does not adjust the structure of Neural Networks, but creates parameter sharing patterns in networks. [Xia et al.2019, Lan et al.2019] are examples of sharing parameters. [Baevski and Auli2018] raised Adaptive Embedding, which is like Adaptive Softmax to make use of different sizes of parameters to mapping word clusters of different frequencies. When Adaptive Embedding and Adaptive Softmax, the parameters can be shared between them in a way.
In Neural Machine Translation Task, traditional solutionships with Neural Networks are LSTM Network with Attention Mechanism, or Convolutional Networks. Recently, Self-attention based Transformer network shows its advantages: benefited from optimized training schemes and data augmentation techniques such as back translation[Edunov et al.2018], the translation results have outperforming quality. Some improvements and derivations from Transformer have also come out. [Dai et al.2019] expands the context width of Transformer to extract more features from context. [Wu et al.2019] raises new operations: Light Convlution and Dynamic Convolution as substitutions for self-attention in Transformer, and presents competitive test results in the experiments. [Gu et al.2019] is a Transformer-based model in the non-autoregressive paradigm of Neural Machine Translation.
To decrease parameters through different parts of Natural Language Processing models, the simple technique of using the same input and output embedding weights matrix is widely adopted by recent state-of-the-art models meanwhile preserves or imporves the performances of text generation models.
To go further on optimizing its effectiveness, this paper analyzes the reasons of how the embedding sharing technique works, explores its shortcomings, and presents improvements and adjustments of it based on theoretical analysis. By applying normalization methods on embedding weight matrices, the bias of estimations for output scores is eliminated. This paper’s work can be applied on various of Neural Networks in Natural Language Processing as long as input-output embedding sharing is included.
In the experiments on various datasets of 2 kinds of Neural Machine Translation models, the improvement methods this paper designs shows guaranteed effectiveness, at the same time nearly do not raise the training and inference time of models at all.
[Acharya et al.2019]
Anish Acharya, Rahul Goel, Angeliki Metallinou, and Inderjit Dhillon.
Online embedding compression for text classification using low rank
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6196–6203, 2019.
- [Baevski and Auli2018] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
- [Dai et al.2019] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- [Devlin et al.2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- [Edunov et al.2017] Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical structured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956, 2017.
- [Edunov et al.2018] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381, 2018.
[Grave et al.2017]
Edouard Grave, Armand Joulin, Moustapha Cissé, Hervé Jégou, et al.
Efficient softmax approximation for gpus.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1302–1310. JMLR. org, 2017.
- [Gu et al.2019] Jiatao Gu, Changhan Wang, and Jake Zhao. Levenshtein transformer. arXiv preprint arXiv:1905.11006, 2019.
- [Kudo and Richardson2018] Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- [Lan et al.2019] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- [Liu et al.2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- [Ott et al.2018] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. arXiv preprint arXiv:1806.00187, 2018.
- [Press and Wolf2016] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
- [Raffel et al.2019] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- [Raunak2017] Vikas Raunak. Simple and effective dimensionality reduction for word embeddings. arXiv preprint arXiv:1708.03629, 2017.
- [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- [Sergey et al.2017] Edunov Sergey, Ott Myle, and Gross Sam. Fairseq. https://github.com/pytorch/fairseq, 2017.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- [Wu et al.2019] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430, 2019.
- [Xia et al.2019] Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. Tied transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5466–5473, 2019.
- [Xing et al.2015] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011, 2015.
- [Yang et al.2019] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.