Normalization of Input-output Shared Embeddings in Text Generation Models

01/22/2020 ∙ by Jinyang Liu, et al. ∙ University of California, Riverside 0

Neural Network based models have been state-of-the-art models for various Natural Language Processing tasks, however, the input and output dimension problem in the networks has still not been fully resolved, especially in text generation tasks (e.g. Machine Translation, Text Summarization), in which input and output both have huge sizes of vocabularies. Therefore, input-output embedding weight sharing has been introduced and adopted widely, which remains to be improved. Based on linear algebra and statistical theories, this paper locates the shortcoming of existed input-output embedding weight sharing method, then raises methods for improving input-output weight shared embedding, among which methods of normalization of embedding weight matrices show best performance. These methods are nearly computational cost-free, can get combined with other embedding techniques, and show good effectiveness when applied on state-of-the-art Neural Network models. For Transformer-big models, the normalization techniques can get at best 0.6 BLEU improvement compared to the original version of model on WMT'16 En-De dataset, and similar BLEU improvements on IWSLT 14' datasets. For DynamicConv models, 0.5 BLEU improvement can be attained on WMT'16 En-De dataset, and 0.41 BLEU improvement on IWSLT 14' De-En translation task is achieved.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks have been general solutionships for various Natural Language Processing tasks. For example, Transformer-based models [Vaswani et al.2017, Ott et al.2018] have been state-of-the-art models for tasks such as Machine Translation, and huge pre-trained Neural Network models [Devlin et al.2018, Yang et al.2019, Lan et al.2019, Raffel et al.2019] have dominated multi-task learning task in Natural Language Processing.

Although the design of the inner parts of the networks may vary, each Neural Network model needs modules for mappings between natural language words/characters/tokens and numerical vectors, Which is called embeddings.

To relieve the great computational cost and storage need brought by huge size of embedding weights for input and output vocabularies, recent Neural Network models for text generation (e.g. Language Modeling or Neural Machine Translation models) are using input-output shared embedding weights

[Press and Wolf2016] with a randomized initialization: When the input and output spaces are the same, the same embedding weights can serve in both input and output layers, like figure 1 illustrates. The embedding weight sharing method greatly reduces the number of parameters of the model, at the same time preserves and even raises the performances of the model.

Figure 1: Neural network models with unshared and shared input-output shared embeddings. Model (a) on left has separate input and output embeddings, and model (b) on right shares embedding weights matrix through input and output layer.

Nevertheless, this paper points out that the weight-sharing method used in recent state-of-the-art Neural Network models has obvious shortcomings, and can be improved with low computational costs. After making insight into the existed input-output weight-sharing method, with linear algebra and statistical theories, this paper shows that normalization methods on the embedding weights should be applied on input-output shared embedding weights, and this kind of techniques can work well with shared embedding weight matrix in Neural Network text generation models. Moreover, the methods presented in this paper can also be well integrated with existed embedding parameter reduction method, such as Adaptive Embedding/Softmax [Grave et al.2017, Baevski and Auli2018], or Embedding Layer Factorization [Lan et al.2019].

The normalization of word embedding have been discussed before, such as [Xing et al.2015], but their works focus on different aspects from this paper (the domain-transfer problem, or density of word embedding). Totally speaking, the achievements of this paper are as following:

1. This paper makes deep analyis into and explains the effectiveness of input-output embedding weights sharing, then provides a theory-based method to improve it with trivial cost compared with the rest parts of the models.

2. The methods this paper presents can be simply integrated with other existed state-of-the-art embedding dimension reduction or parameter saving techniques.

3. In designed experiments, this paper proves the effectiveness of the methods in different kinds of Neural Network models for Natural Language Processing.

The outline of this paper is as following: Section 1 is the introduction of this paper, section 2 is the theory base and the practical normalization methods for input-output shared embeddings, section 3 is the experiment confurations, results and analysis. section 4 collects and introduces related works of this paper, and section 5 is a conclusion of this paper.

2 Normalization of Input-output Shared Embedding

2.1 Decoder Output Embedding and Input-output Embedding Weights Sharing

Input embedding and output Softmax layers are included in almost all Neural Network models for Natural Language Processing. Commonly speaking, An input embedding maps a language token (or regarded as an one-hot token vector) to a relatively low-dimension vector in continuous numerical space (mostly real Euclidean space). Using

to represent the input embedding Matrix, to represent an one-hot input word vector (no batch size or sequence length is involved), the input embedding does the following computation:

(1)

has the size of , has the size of (all 1D vectors are as default column vectors in this paper) and the generated embedding vector has size of . is the vocabulary size which is often big (e.g. tens of thousands), and is the input embedding dimension.

At the output layer, suppose

is the output embedding vector generated by the inner parts of the decoder, which is for an output token, the output Softmax layer does the following computations to find the most probable token:

(2)
(3)
(4)

is a matrix for Softmax layer kernel, and is a vector with size for layer bias. is the size of the output embedding vector.

In some text generation Natural Language Processing tasks, the input and output token space is the same, e.g. language modeling, text summrization, and machine translation (when using shared vocabulary). In this situation, if the output embedding vector and the input embedding vector has the same size (), it can be asserted that the space of output embedding vectors is isomorphy to the input embedding space. That is, given a output embedding vector , it should be a non-negative -normalized linear combination of input embedding vectors (pay attention that is the ith column of matrix ).

(5)

In which

is a probability distribution (it’s non-negative and normalized) showing the probabilites of the feature vector should generate the i-th token for output, which is just the conditional probabality

of output and input. Apparently, Softmax layers can generate , and a model can also use

vector to estimate

, as Softmax transformation has the isotone property. Because embedding dimension is always smaller than the vocabulary size , the problem of solving is on an underdefined linear equations set. Therefore, additional constraints need to be introduced into this problem. Using a priori knowledge, one can confidently make the assumption that the real distribution of should be sparse (there will be little number of tokens which are appropriate for output at a certain position of one text sentence), and the whole optimization problem is:

(6)
(7)
(8)
(9)

Or its relaxed convex version:

(10)
(11)
(12)
(13)

But this convex version is still computational-expensive to be directly solved. As it is for an output layer of a big Neural Network, it is natural to hope that the estimation can be computed with simplified computations. Because Neural Network models just use the estimations for Softmax/Argmax classification, only the order of (the components of on each dimension, which is the estimation score for each token) determines the output. Thus, one can also get isotone estimations for instead, which have the property of:

(14)

And the estimations are expected to be computed in a linear way, like the Fully Connected Layer:

(15)

Where is a matrix having the same size of (it serves for the kernel of the Fully Connected Layer), and

is the bias vector of the Fully Connected Layer. The most direct method is learning

and from scratch. However, the size of is huge, so weight sharing methods attempt to bind and together with simple operations. So far, the most widely used, and almost only used setting for and is , and (it will be the baseline in this paper’s experiments). It is introduced in [Press and Wolf2016], and has achieved good performances, therefore has been adopted by many state-of-the-art Natural Language Processing models. In the next subsection, it will be discussed how to improve this method with simple and low cost computations.

2.2 Normalization in Embedding Weight Matrix

First, let’s see specifically what the baseline method is using. Still assume that as the real distribution of , therefore,

(16)
(17)

Remind that is the i-th column of (the input embedding vector of the i-th token). As the target of Neural Network models is to find argmax, it is most important in the estimation of the largest . let , then

(18)

The approximation is based on the fact that is sparse, and is much larger than others. Obviously, using the baseline method, the estimation of is biased: For token with small embedding vector norm, it is under-estimated, and for token with big embedding vector, it is over-estimated. To solve this problem, this paper designs several improvement methods with small additional computation cost, and the first two are embedding normalization methods.

2.2.1 -normalized Input Embedding Weight Matrix

The most direct way for getting free of the norm problem is to use a column-wise -normalized input embedding matrix W having the structure of:

(19)

And the still remain the same as . Using

-normalized embedding Matrix will lead to a unbiased estimation of

. The proof is as following: For and which are uniformly i.i.d on unit sphere, and

(20)

Because due to symmetry it is easy to prove that the surface integral

(21)

when . Therefore, from equation (17),

. The variance of

will be small when is .

One problem for this method is that it abandons the diversity of norms of embedding vectors, which also includes semantic information.

2.2.2 Square-normalized Output Embedding

If let and equals:

(22)

So that

(23)

when is . By scaling the output embedding kernel matrix by square norm, the estimation is also a unbiased estimation, and the value of will never be over-estimated with large embedding vector norm. However, this method severely over-estimated the with a small embedding vector norm, as this method do not consider the non-negative -normalized constraints of , which means that will get no punishment and is even encouraged for output when it is over 1.

For Comparison and proof for the effectiveness of Embedding Normalization, this paper introduces 2 more methods of sharing input and output embeddings. These 2 methods are similar to embedding normalization, but are in different principles.

2.2.3 Calculate Distances for Estimation

Another way to find the for biggest is to calculate the ”similarities” between and . A scale-sensitive method is to calculated the -distance:

(24)

For smoothness, the square of it is used:

(25)

As is the same for each , and we’d like to find the mininum of , the can be in the following form:

(26)

and it can be implemented by just setting and .

2.2.4 Calculate Cosine Similarity for Estimation

This method reviews the value of , and holds the view that the scale of

may be unimportant, as the signal generated by the Neural Network layers can be amplifiled or reduced, and the most vital information is in its direction. Therefore, this method computes cosine-similarities between

and :

(27)
(28)

It relaxed the constraint of and attempts to found the k of:

(29)

If the is scale-free, it will get better performances.

To theoretically evaluate these estimation methods, the following 3 important properties are raised, which a good estimation of should have, but the original embedding weight sharing method does not preserve.

1. Identity: if , then . This property means that, if the output embedding vector is the same as one input embedding vector, the output layer should return the corresponding token as output. The original embedding sharing method do not have this property. -normalized Input Embedding Weight Matrix and Calculate Cosine Similarity for Estimation have this attribute, and the other 2 methods have this when there exists no and that (in almost all practical circumstances it is true).

2. Normality: , therefore always represents a probability. Only -normalized Input Embedding Weight Matrix has this property.

3. Unbiased: . The 2 Embedding Normalization methods both have this property.

At last, the additional of computational costs of the 4 methods are all on computing the columnwise norms of , which has the time complexity of , and is the same as the original output Softmax Layer. Therefore, it is still trivial compared with the inner layers of Neural Network models.

In the following parts of the experiments, this paper will show the specific evaluation results and provide analysis for the 4 input-output embedding weight sharing methods.

3 Experiments and Results on Machine Translation Task

This section presents experiment settings, configurations ,results and analysis of the experiments this paper carries out for validating the effectiveness of embedding normalization methods. The experiments are based on Machine Translation task, because this task is one of the most important Text Generation tasks, and has clear benchmarks. The techinuqes can be also applied into other text generation tasks, as long as shared input-output shared embedding is included.

3.1 Experimental Settings

The embedding normalization methods designed by the paper are not constricted to one certain kind of models, but are compatible with various kinds Neural Network models with input-output shared embedding. This paper implements the methods on 2 kinds of state-of-the-art models in Neural Machine Translation task: Transformer [Vaswani et al.2017] and DynamicConv [Wu et al.2019] models on the following Datasets: WMT 16’ En-De, and IWSLT 14’ En-De, En-Es, and En-Ro datasets. The Transformer model is trained and tested on all datasets used, and the DynamicConv Models is trained and tested on WMT 16’ En-De and IWSLT 14’ En-De datasets.

3.1.1 Datasets

On WMT 16’ En-De dataset, this paper follows the same settings from [Vaswani et al.2017, Ott et al.2018], uses 4.5M sentence pairs for training, validates on newstest2013 and tests on newstest2014. The vocabulary has 32K symbols generated by sentencepiece [Kudo and Richardson2018] and BPE [Sennrich et al.2015].

On IWSLT 14’ datasets, This paper uses English as target language, and uses German (De), Spanish (Es) and Romania (Ro) as source language. For the dataset splitting method of IWSLT 14’ datasets, in De-En translation task. the paper follows the same settings from [Edunov et al.2017] and [Xia et al.2019], and in Es-En and Ro-En tasks, this paper uses as valid set and combines for test set.

3.1.2 Model Configurations

For Transformer model on WMT’16 En-De dataset, this paper applies the Transformer-big configuration described in [Vaswani et al.2017], which has 6 blocks in both encoder and decoder parts, embedding dimension of 1024, hidden layer size of 4096, and 16 heads in multihead attention block. The label smoothing rate is 0.1. The training configurations are referenced from [Ott et al.2018], this paper simulates an 128-GPU training process with half-precision and batch size, 0.3 dropout rate, and an Adam optimizer with , , and . The learning rate scheduler is the same as [Vaswani et al.2017], with the summit learning rate of .

For Transformer model on IWSLT 14’ datasets, this paper applies a Transformer model which has 6 blocks in both encoder and decoder parts, embedding dimension of 512, hidden layer size of 1024, and 4 heads in multihead attention block. The label smoothing rate is 0.1. The training configurations are almost same as the Transformer in WMT dataset, except that a weight-decay rate of is used. Default configurations for ISWLT 14’ translation Transformer models use separate input and output vocabularies, but De-En translation model in this paper uses the same joined vocabulary like models on WMT 16’ En-De dataset, because this paper finds that this way can improve BLEUs. Therefore, in IWSLT 14’ De-En, all embedding weights are shared, and in Es-En and Ro-En only decoder embedding weights are shared.

For DynamicConv Models, as the model structures are similar between Transformer and DynamicConv models, the model configurations are also nearly the same between the models on corresponding datasets, except that DynamicConv models have 7 blocks in encoder parts. The training configuration is the same as [Wu et al.2019].

All models are implemented with the fairseq-py toolkit [Sergey et al.2017]

in PyTorch. The train and test are done on NVIDIA RTX 2080Ti, NVIDIA TESLA P100 and NVIDIA TESLA T4 GPUs.

3.2 Results

3.2.1 Evaluation Method

Like all other Neural Machine Translation experiments, this paper uses BLEU222https://github.com/moses-smt/mosesdecoder/blob/master/scripts/
generic/multi-bleu.perl
as the evaluation method for the quality of generated translation texts.

The translation texts from trained models are generated with beam-search. For Transformer-big model on WMT 16’ En-De dataset, texts are generated with beam width of 4 and length penalty of 0.6. For Transformer models on IWSLT 14’ datasets, the beam width is set to 5 and the length penalty is set to 1. For DynamicConv model on WMT 16’ En-De dataset, texts are generated with beam width of 5 and length penalty of 0.5. For DynamicConv models on IWSLT 14’ datasets, the beam width is set to 4 and the length penalty is set to 1. For Transformer and DynamicConv models on WMT 16’ En-De dataset, this paper makes average on the last 10 checkpoints for test.

3.2.2 Experimental Results

Table 1 and 2 shows the BLEUs of Transformer-big models and DynamicConv models on WMT 16’ En-De dataset with different embedding sharing configurations (Baseline is the original sharing method). The model can gain BLEU improvement from each of the 4 designed method, and using -normalized input embedding get the best result for both models on WMT 16’ En-De dataset. As analyzed before, embedding sharing method of -normalized input embedding has good properties, that can give normalized and unbiased estimations of tokens’ scores for output. For Transformer-big model, 0.6 BLEU improvement is achieved and for DynamicConv model, 0.5 BLEU improvement is achieved.

Figure 2 is a histogram of -norms of embedding vectors for language tokens in a baseline Transformer-big model on WMT 16’ En-De Dataset. The histogram illustrates that the embedding vectors are scattered through training phase, get spreaded to have varied norm, which affects the quality of output scores with the original output embedding.

Configuration BLEU
Baseline 29.3
-normalized input embedding 29.9
Square-normalized output embedding 29.4
Distances 29.8
Cosine similarities 29.7
Table 1: BLEUs of Transformer-big models on WMT 16’ En-De dataset with different embedding sharing configurations.
Configuration BLEU
Baseline 29.7
-normalized input embedding 30.2
Square-normalized output embedding 30.0
Distances 29.8
Cosine similarities 29.9
Table 2: BLEUs of DynamicConv models on WMT 16’ En-De dataset with different embedding sharing configurations.
Figure 2: Frequency histogram of norms of embedding vectors

For Transformer models on IWSLT 14’ translation tasks, the 4 new weight sharing methods also outperforms the baseline. Table 3

shows the results of BLEUs. In this circumstance, square-normalized output embedding method is the best weight sharing method among all methods, which achieves BLEU improvements of 0.86/0.73/0.60 on English to German/Spanish/Romania translation tasks. The Transformer model for IWSLT 14’ translation tasks has smaller embedding size. As making the total embedding matrix normalized loses some degrees of freedom for the data of embedding vectors, it will have greater impact on embedding spaces with lower dimensions. Therefore, the method of

-normalized input embedding can not perform as well as in WMT 16’ dataset.

Task Configuration BLEU
De-En Baseline 34.30
-normalized input embedding 35.00
Square-normalized output embedding 35.16
Distances 35.09
Cosine similarities 34.92
Es-En Baseline 34.24
-normalized input embedding 34.65
Square-normalized output embedding 34.97
Distances 34.54
Cosine similarities 34.81
Ro-En Baseline 36.90
-normalized input embedding 37.09
Square-normalized output embedding 37.50
Distances 37.22
Cosine similarities 37.34
Table 3: BLEUs for Transformer models on IWSLT 14’ datasets with different embedding sharing techniques.

For DynamicConv models on IWSLT 14’ De-En translation tasks, Table 4 gives the BLEUs of different embedding weight sharing methods. each method has close performance and the method of calculating cosine similarities is a slightly better than others, having a 0.41 BLEU improvement than baseline.

Configuration BLEU
Baseline 34.52
l2-normalized input embedding 34.86
Square-normalized output embedding 34.74
Distances 34.88
Cosine similarities 34.93
Table 4: BLEUs for DynamicConv models on IWSLT 14’ En-De dataset (De to En translation) with different embedding sharing techniques.

From all groups of experiments and their results, it can be concluded that the improvements of input-output embedding sharing method work well on state-of-the-art Neural Machine Translation models, and the methods of normalization of input-output shared embedding perform best at most scenraios. The statistical analysis has shown that the methods are unbiased, and the -normalized Input Embedding Weight Matrix method even is normalized, which contributes to the effectiveness of them.

To have a better text generation Neural Network model, normalization of input-output shared embedding is indeed useful. The normalization methods scale the embedding vectors, and make the parts of the Neural Network models more interpretable: The input and output embeddings are mapping between language token spaces and continuous numerical sapces with semantics, the encoder encrypts features with the embedding vectors, and the decoder provides output vectors on the embedding space.

4 Related Works

With the fast growth and development of computational power and network training techiques, Neural Networks of great sizes and huge amount of parameters have begun to take advantage and become state-of-the-art models in various tasks in Natural Language Processing. Based on Self-attention Transformer [Vaswani et al.2017], BERT [Devlin et al.2018] and its descendents [Liu et al.2019, Lan et al.2019, Raffel et al.2019]

dominate multi-task learning in Natural Language Processing. Due to the great number of words and tokens in natural language, each model like these needs to spend a portion of its parameters on building mappings between language words/tokens and numerical tensors, which are also known as input and output embeddings. This need contributes a lot to the computational cost and storage difficulty of the models.

To deal with this problem, one method is to reduce the structure of embeddings. [Raunak2017, Acharya et al.2019] present effective methods to make compression of input embeddings, therefore lowers the parameters needed in input embeddings. the shortcoming of them is that this kind of methods is data-based, which means that intensive training on full models is still needed. [Lan et al.2019] directly factorizes the input embedding layer, and can reduce the parameters of it in any scale. For output mapping, taking the long-tail distribution of words in natural laguages into account, [Grave et al.2017] designs a hierarchical Softmax layer for language models, called Adaptive Softmax, which saves a great portion of parameters in the output layer.

Another kind of parameter reducing techique does not adjust the structure of Neural Networks, but creates parameter sharing patterns in networks. [Xia et al.2019, Lan et al.2019] are examples of sharing parameters. [Baevski and Auli2018] raised Adaptive Embedding, which is like Adaptive Softmax to make use of different sizes of parameters to mapping word clusters of different frequencies. When Adaptive Embedding and Adaptive Softmax, the parameters can be shared between them in a way.

In Neural Machine Translation Task, traditional solutionships with Neural Networks are LSTM Network with Attention Mechanism, or Convolutional Networks. Recently, Self-attention based Transformer network shows its advantages: benefited from optimized training schemes and data augmentation techniques such as back translation

[Edunov et al.2018], the translation results have outperforming quality. Some improvements and derivations from Transformer have also come out. [Dai et al.2019] expands the context width of Transformer to extract more features from context. [Wu et al.2019] raises new operations: Light Convlution and Dynamic Convolution as substitutions for self-attention in Transformer, and presents competitive test results in the experiments. [Gu et al.2019] is a Transformer-based model in the non-autoregressive paradigm of Neural Machine Translation.

5 Conclusion

To decrease parameters through different parts of Natural Language Processing models, the simple technique of using the same input and output embedding weights matrix is widely adopted by recent state-of-the-art models meanwhile preserves or imporves the performances of text generation models.

To go further on optimizing its effectiveness, this paper analyzes the reasons of how the embedding sharing technique works, explores its shortcomings, and presents improvements and adjustments of it based on theoretical analysis. By applying normalization methods on embedding weight matrices, the bias of estimations for output scores is eliminated. This paper’s work can be applied on various of Neural Networks in Natural Language Processing as long as input-output embedding sharing is included.

In the experiments on various datasets of 2 kinds of Neural Machine Translation models, the improvement methods this paper designs shows guaranteed effectiveness, at the same time nearly do not raise the training and inference time of models at all.

References