Transformer [vaswani2017attention]hochreiter1997long] and gated recurrent neural networks [chung2014empirical], have been firmly established as state of the art approaches in sequence modelling and machine translation [sutskever2014sequence, bahdanau2014neural, cho2014learning]
. Without one single exception, these models use the distributed vector representations of words, referred to as word embeddings, as their cornerstone. Furthermore, researches have shown that a word embedding set with better quality can benefit the whole model[kocmi2017exploration], and methods of "meta-embedding", first proposed by [yin2016learning], can yield an embedding set with improved quality. Therefore, meta-embedding can benefit the language modelling.
To yield an embedding set with better quality, several methods have been proposed in terms of meta-embeddings, e.g., 1toN+ [yin2016learning] takes the ensemble of pre-trained embedding sets, and use a neural network to recover its corresponding vector within each source embedding set. An unsupervised approach is empolyed by [bollegala2017think]: for each word, a representation as a linear combination of its nearest neighbours is learnt. Other methods, despite their simplicity, such as concatenation [yin2016learning], or averaging source word embedding [coates2018frustratingly] has been used to provide a good baseline of performance for meta-embedding. In this work, we explore a new way of meta-embedding, namely, the Duo. To the best of our knowledge, our model is the first meta-embedding method based on self-attention mechanism.
Our meta-embedding language learning model uses the Transformer as our back stone, which is a model architecture eschewing recurrence and relying only on attention, of which the mechanism is drawing of global dependencies between input and output. As recurrence is deducted, the parallelization is greatly enhanced. Because we use two times the embeddings to learn, the number of heads in the Transformer doubles and better performance is gained. Moreover, we use weight sharing in duo multi-head attention; thus the number of parameters is reduced.
The mechanism of the Duo is that instead of merely adding the dimension of word embedding, which leads to enormous increasing number in parameters, we use separately trained embedding as key and value for each word in the self-attention mechanism of the Transformer. As the number of word embedding doubles, the information in attention also doubles. The discrepancy between the two pieces of independent embedding describes two different aspects of the same information. As our results demonstrate, this independence is quite beneficial to the training of the model.
Moreover, recent research [tang2018self] has shown that the Transformer has shortcomings in long sequence learning, the Transformer-XL and other methods [child2019generating, sukhbaatar2019adaptive] are therefore proposed to address the long sequence problem. The good news is that our model is very general that the Transformer-XL, along with other language models based on the Transformer, can employ the Duo mechanism to perform meta-embedding learning.
We examine our model in two representative tasks: the text classification task and the machine translation task. When it comes to a text classification problem, the Duo mechanism exploits the information in two pieces of word embedding, each separately trained, e.g., GloVe [pennington2014glove] and fastText [joulin2016bag]. This meta-embedding model allows the language model to have more previous knowledge in independent aspects, thus leading to a better result. The machine translation task is more tricky for meta-embedding learning, as it concerns the devising of a decoder. However, in this paper, we proposed a sequence-to-sequence meta-embedding language model to handle this problem, and the experiment shows that learning in such a way leads to better performance and a faster convergence.
All in all, the contributions of us are threefold:
We propose an attention-based way of meta-embedding for a better language modelling.
To the best of our knowledge, we devise the first sequence-to-sequence encoder-decoder language model which directly uses two independent embeddings.
The Duo mechanism we propose is very general and can be employed on any language model based on the Transformer.
For the deep learning method in the text classification problem, word embedding has been a focus of much research[mikolov2013distributed, pennington2014glove] as several studies showed that the text classification task depends enormously on the effectiveness of the work embedding [shen2018baseline, wang2018joint]. The first part of our work focuses on combining different pre-trained word embedding for text classification. In other words, the Duo mechanism enables two different pieces of pre-trained word embedding to perform on the same stage.
As for methods for the meta embedding [yin2016learning], they concernc conducting a complementary combination of information from an ensemble of distinct word embedding sets, each trained using different methods, and resources, to yield an embedding set with improved overall quality [kiela-etal-2018-dynamic, neill2018angular, coates2018frustratingly, muromagi2017linear, artetxe2018uncovering]. Thus we believe this is one of the benefits of applying Duo.
There have also been exhausted studies on the refinement of the architecture of the Transformer. Adversarial training has been proved to be beneficial to language modelling [wang2019improving] . Additionally, to deal with the fix-length problem, [dai2019transformer] extend the vanilla Transformer with recurrent units, which greatly enhances the original model. The Duo mechanism is a meta-embedding way to approach the Transformer.
3 Model Architecture
3.1 Duo Classifier
3.1.1 Duo Word Embedding
As is demonstrated in figure 1, we use different embedding namely, Spongebob and Patrick, to represent the word embedding of the same input sequence , where is the unfixed length of the -th sentence. Later, we use embedding Spongebob and Patrick to represent separately trained word embedding, e.g., Spongebob can be GloVe 300d, and Patrick can be Word2Vec 30d. For the simply of notation, we use to represent the text with different word embedding.
3.1.2 Duo Classifier Attention
We will simply let the model to learn the parameter of attention to balance the weight of different dimension in our text classifier. In practice, we initialize two parameter , and .
In our later experiment, we will drop the softmax function, because doing this will have even faster computation while maintaining satisfying results.
3.1.3 Duo Sentence Embedding
The duo sentence embedding would be a fusion layer of and , so, we will introduce another fusion parameter .
where is concatenate operation.
The is the final representation of sentence embedding. Its value is the weighted sum of based on the attention of each other. In other words, we learn the attention and value separately by giving them different embedding.
3.1.4 Model Complexity
The number of parameters to be learned in our model is in Duo Classifier Attention layer, in Sentence Duo Embedding layer and
in the final softmax layer. If we set, and number of label = 20. The number summed up is no more than 0.4M parameters. When running on a machine with 8 GPUs, we can achieve a state-of-the result on text classification tasks 20NG in less than half an hour.
3.2 Duo Transformer
3.2.1 Duo Attention
After reviewing on the duo classifier, the Duo multi-head attention seems simple and straight-forward.
We have multi-head attention:
We can use similar formulation to calculate the Duo multi-head attention.
From figure 2, it seems that the number of parameters has doubled compared to vanilla attention. In order to eschew this overcomplexity, we share weight in multi-head attention of each layer. Specifically, and , and and in each layer share the same projection parameters. So in the final multi-head attention, we only have more projection parameters, and our experiments show that the weight sharing result in faster convergence.
3.3 Duo Decoder
The Duo Decoder is quite similar to the original Transformer decoder, except the fact that the original Transformer has the same and , while the Duo Decoder has different ones. We interpret that each and encode different information from each word embedding. Thus they need to be decoded separately.
The vanilla Transformer has the same weights matrix between the two embedding layers and pre-softmax linear transformation similar to[press2016using]. However, as we have a fusion layer, we still share weights, but after a linear projection of the original concatenated duo embedding layer, and the parameters of this projection is to be learnt in the training step.
3.4 Duo Layer Normalization
Another intriguing part is the Duo Layer Normalization. The output of the traditional layer normalization [ba2016layer]he2016deep] is LayerNorm( + Sublayer()) in each unit. However, considering the dimensional difference in different word embedding, meanwhile guaranteeing more fluid cross information flow. We modify the original LayerNrom to the following formula:
This mechanism is used in the decoder layer between the masked multi-head attention, and the feed-forward unit demonstrated in figure 2.
3.5 Why Duo
Attention is undoubtedly a good idea in natural language processing. However, even the author of the Transformer is aware of the limitation of equaling attention value to the word value. Thus, we have multi-head attention to deal with the problem, where the embedding is linearly transformed and then fed into the scaled dot-product attention. However, such a linear transformation may not contain as much information; after all, it is linear, meaning there are still unbreakable constrains in the attention value and word value.
For example, when we think of the abstract word ’duo’, the concrete word ’Spongebob’, ’Patrick’, ’Tom’ and ’Jerry’ are among the things we come up. However, in order to let the noun ’duo’ to pay close attention to these concrete example, this word should be among the ’name cluster’ in the embedding space. However, ’duo’ is undoubtedly not a name; it should be among these names in the abstract attention space, but not in the value space.
Another example is we have a word vector ’Spongebob’ by adding a vector to the word vector ’Patrick’. Moreover, we get ’Tom’ by adding the same vector to ’Jerry’ due to the linear substructure of the embedding space. However, we have no idea what we will get by adding this vector to the word ’duo’.
Loosening the attention-value constrains enables the model to have diversity in concrete embedding space while maintaining the homogeneity in abstract embedding space, which we use to calculate attention.
In this section, we will first demonstrate the performance of our Duo Classifier on public text classification tasks. Then we will show the results of running our model on machine translation tasks. We ran our models on 8 NVIDIA RTX 2080 Ti GPUs.
4.1 Duo Classifier
We compare our model with multiple state-of-the-art baselines on many public datasets in terms of accuracy. We use GloVe 50d and GloVe 300d as pre-trained embedding, which we find are the best duo couple. Then, we will run a series of self-compare experiment on different combinations of word embedding.
We explored a variety of duo couple, and it turns out the GloVe 50d and GloVe 300d can yield the best results. Other parameters including dropout, learning rate are the same as the original transformer. We randomly selected 10% of the training set a validation. We trained our model for a maximum of 200 epochs using Adam[kingma2014adam] and stop if the validation loss does not decrease for ten consecutive epochs. The results of other models on the same datasets are from [yao2019graph]. We run out models for ten times and calculate its mean. We will then further explore the result of different combinations of duo couple.
We ran our experiments on five popular benchmark corpora including 20-Newsgroups (20NG)111 http://qwone.com/ jason/20Newsgroups/,Ohsumed222http://disi.unitn.it/moschitti/corpora.htm, R52 and R8 of Reuters 21578333https://www.cs.umb.edu/ smimarog/textmining/datasets/ and Movie Review (MR)444 http://www.cs.cornell.edu/people/pabo/movie-review-data/. These datasets are widely used and recognized in recent publications, and we will skip the details of them. The readers could go to [yao2019graph] for more detailed settings.
As it turns out, our model achieves the best results on 4 out of 5 benchmarks1. It still ranks second in R8 dataset, and we think it is because the words number in this datasets are less than the others(with only 7,688 words) that information simply from word embedding are not enough .
The main reasons why duo model works well are obvious. Firstly, we use separately trained embedding. And previous research has shown that this meta embedding technology can greatly improve the performance. Secondly, we use Transformer to combine these embedding and this model is proved to be more efficient than traditional RNN-based models. Let alone we simply calculate the average of the word embedding in each documents for text classification in way of meta embedding way.
bag-of-words model with term frequency-inverse document frequency weighting. Logistic Regression is used as the classifier.
|CNN-non-static666CNN-non-static uses pre-trained word embeddings||82.15||95.71||87.59||58.44||77.75|
|Text GCN [yao2019graph]||86.34||97.07||93.56||68.36||76.74|
Which Couple Is The Best
We explored various couples of word embedding on datasets. Including different dimensions of embedding from GloVe [pennington2014glove], CBOW [mikolov2013distributed], and fastText [joulin2016bag], and the results are demonstrated on 2, 3 and , and 4. And it turns out that the GloVe 50d and GloVe 300d duo win the competition. The result is obtained by running 10 times of different couples and calculating their mean performance on 20NG, Ohsumed and MR dataset. Without any exceptions, the Duo couple of GloVe 50d and GloVe 300d has the best results on all the tasks. These results further proves the advantages of GloVe word embedding. Additionally, it is no surprise to us that the the diagnose of the table shows relatively less satisfying results, . Because the duo embedding employs the same embedding, they are simple one-layer single-head transformer models.
|Word Embedding||GloVe 50d||GloVe 300d||fastTEXT 300d||CBOW 50d||CBOW 300d|
|Word Embedding||GloVe 50d||GloVe 300d||fastTEXT 300d||CBOW 50d||CBOW 300d|
|Word Embedding||GloVe 50d||GloVe 300d||fastTEXT 300d||CBOW 50d||CBOW 300d|
4.2 Duo Machine Translation
After exploring the performance of Duo in the text classification task, we further investigate whether this meta-embedding mechanism could be applied to the machine translation tasks. The potential of the model is considerable, as a good performance in task classification tasks means such a mechanism could encode a sentence much better. However, the real difficulties lay in the design of the decoder. We finally figure out a meta embedding decoder architecture based on the backbone of the Transformer demonstrated in 3.2. In this part, we will examine the Duo Translator in terms of its BLEU score, and its convergence speed.
For the machine translation models, we followed the same hyper-parameter setup described in [vaswani2017attention]. Specifically, we set the , and the was set to 2048. The number of layers for the encoder and the decoder was set to 8. Additionally, We use weight sharing in the Duo Multi-head to decrease the model complexity. Worth mentioning, we use gloVe 300d word embedding followed by a feed-forward neural networks to fix the discrepancy of dimensionality.
On the machine translation task, we report results on three mainstream benchmark datasets: WMT 2014 English to German (En-De) consisting of about 4.5 million sentence pairs, and WMT 2014 English to French (En-Fr) of 36M sentences. We used byte-pair encoding [britz2017massive] of size 32K and 40K tokens for each task.
We demonstrate the effectiveness of our model in table 5, which shows that meta embedding could clearly benefit the process of translation. Specifically, our model is able to achieve a state-of-the-art score on WMT 2014 En-De benchmark, and still competitive in WMT 2014 English to France bench mark. Worth mentioning, the meta-embedding Duo Transformer has outperformed the vanilla transformer by 1.3 and 1.1 BLEU score on each task, further proving the advantage of the meta-embedding mechanism.
|Model||Param||WMT En-De||WMT En-Fr|
|Transformer big [vaswani2017attention]||213M||28.4||41.0|
|Weighted Transformer [ahmed2017weighted]||213M||28.9||41.4|
|Transformer with RPP [shaw2018self]||-||29.2||41.5|
|TaLK Convolution [lioutas2020time]||209M||29.6||43.2|
Figure 3, alongside with table 8 and 8 also demonstrates the faster convergence, as well as better performance results by meta embeddings. The results are obtained by calculating the average of 3 separate runnings of each model on the WMT 2014 En-De Validation Set.
|+ Meta Embeddings||246M|
|+ Weight Sharing in Duo Multihead||220M|
|+ Duo Normalization||220M|
|+ Fusion Layer||220M|
In order to evaluate the function of different parts in our architecture, we did an ablation test on the WMT 2014 En-De Validation set. We used the same hyper-parameters as before, and the results are reported in Table 6. Initially, we add the meta embeddings to the Vanilla Transformer Model, and it seems that this gives the most salient advancement of the performance. However, the number of parameters increased quite a lot, and the improvement may merely come from the additional parameter numbers. Therefore, we decided to shrink the model’s size by using weight sharing in Duo Multihead. It turns out that this operation not only reduced the number of parameters but also improves the performance. The following Normalization and Fusion layer has also been proved to be beneficial.
In this work, we presented the Duo Model, the first meta-embeddings mechanism based on self-attention, which improves the performance of language modelling by exploiting more than one word-embedding.
For text-classification tasks, a single-layer Duo Classifier can achieve the state-of-the-art results on many public benchmarks. Moreover, for machine translation tasks, we introduce the first encoder-decoder models with more than one embedding. Furthermore, we prove that this meta embedding mechanism benefits the vanilla transformer in terms of not only better performance but also a faster convergence.
Nowadays, though there is more and more attention paid to meta-mebeddings in natural language processing, we still think that this mechanism has potential other than the text classification task. We sincerely expect more investigations into this field.