Transformer based NMT systems perform well on multiple translation tasks (vaswani2017attention). Multi-head attention is a very important component of the Transformer model (vaswani2017attention). Multiple heads improve performance compared to a single head, as they allow the model to jointly look at different subspaces, and hence capture enhanced features from sentences. For example, a head can capture positional information by attending to adjacent tokens, or it can capture syntactic information by attending to tokens in a particular syntactic dependency relation (voita2019analyzing). However, the performance of the transformer-base model with 8 heads at each layer is only 1 BLEU point higher than that of a similar model with just a single head at each layer (voita2019analyzing). This is due to the fact that majority of the heads learn similar weights, and therefore, multiple heads attend to the same parts of the input. Hence, most of the heads are redundant, leading to an increased computational complexity without improving performance.
To avoid this redundancy, one approach is to prune the redundant heads based on certain importance score. In this work, we focus on designing an importance computation method to compute the importance score for each head. Some recent work has analyzed the importance of heads by considering average attention weights of each head at some specific position (voita2018context). However, average of attention weights is a static measure of the head importance as it does not consider the varying importance of each head with respect to the input. The importance of a head is dynamic, as a head can be very important for a particular word, but can be less important for other words. Thus, in this work, we propose a Dynamic Head Importance Computation Mechanism (DHICM) to calculate the importance score for each head, and this can be later utilized to design a pruning strategy. Our key idea is to apply a second level attention on the outputs of all heads, to dynamically calculate the importance score for each head, that varies with the input, while training. We also propose to add a new loss term to prevent our approach from assigning equal importance to all heads. Note that we apply DHICM for both self attention heads and encoder-decoder attention heads present in the encoder and decoder of the transformer architecture.
To evaluate the performance of our method, we considered multiple translation tasks with different language pairs such as Hindi-English, Belarusian-English, German-English. Results show that DHICM achieves a much higher performance compared to the standard transformer model, particularly, in low-resource conditions where much less training data is available. Moreover, DHICM requires only additional parameters (
is the word embedding dimension), that is much less than the total number of parameters in the transformer base model. The transformer model has a large number of hyperparameters, due to which, it is computationally challenging to search for their optimal values. Thus, much of the previous work used default values of the hyperparameters(gu2018meta; aharoni2019massively). However, these are not guaranteed to yield optimal performance on different datasets. Grid search over all hyperparameters is computationally intensive due to the exponential number of combinations across all possible values. Therefore, in this work, we perform grid search over a subset of hyperparameters, i.e., architecture hyperparameters and regularisation hyperparameters, and experiments show that the hyperparameter values obtained from our method yield significantly better performance compared to the default values. To summarize, our work makes the following major contributions:
We propose a Dynamic Head Importance Computation Mechanism for transformer based NMT systems, to compute the importance scores for all heads dynamically with respect to an input token.
We propose to add an additional loss function that helps to compute different attention for different heads, and filter the most important heads.
Our hyperparameter tuning method yields significantly better performance than the default values.
2.1 Single-Head Attention
Given a sequence of
-dimensional vectorsand a query vector , a single-head attention is a weighted aggregate of ,
, followed by a linear transformation. The weights are obtained using a function
e.g., multi-layer perceptron(bahdanau2014neural) or scaled dot product (vaswani2017attention), and the attention is computed as , where and are learnable weights. In a transformer based NMT system, there is an encoder and a decoder. The encoder encodes the input sequence of tokens and outputs a sequence of vectors . The decoder uses to generate a sequence of tokens. If the query vector is generated using the encoder, then the computed attention is known as self-attention. Whereas if the query vector is generated from the decoder, then the computed attention is known as encoder-decoder attention.
2.2 Multi-Head Attention
Multi-head attention mechanism runs through multiple single head attention mechanisms in parallel (vaswani2017attention). Let there be a total of heads, where each head corresponds to an independent single head attention. The output of each head is calculated independently, and the final output of multiple heads is calculated using the outputs of all heads, i.e., , where, are learnable weights for each head .
3.1 Dynamic Head Importance Computation Mechanism (DHICM)
In the traditional transformer model, the output of the multi-head attention is a linear transformation over the concatenation of outputs of all heads. Therefore, the outputs of all heads have equal contribution. However, since all heads are not equally important to the input (Sec. 1), we propose to compute the importance of each head with respect to the input dynamically.
Our idea is that an additional attention layer will allow the model to pay more attention to the head that is more important to the input. Thus, we design a second level attention that uses the input and output of all heads to compute attention scores, i.e., importance for all the heads with respect to the input, described as follows. Let be a -dimensional input to the multi-head attention module, and be the output of head (without applying the linear transformation described in Sections 2.1 and 2.2). We first learn a function to determine the attention, i.e., importance score for head . To approximate , we considered both multi layer perceptron and scaled dot product. In our experiments, we observed that both achieve similar performance, and since scaled dot product requires less number of parameters, we used the latter to compute :
Here, are learnable parameters, and are scaling factors for the multi-head attention and second level attention, respectively. We also add a dropout layer (srivastava2014dropout) after computing in Equation 2. Next, we compute the output of the second-level attention layer () using the attention scores for each head, as follows:
where, , and and are learnable parameters. The output of the second level attention is then passed to the feed forward network. Note that DHICM learns only additional parameters corresponding to in the second layer added, and this is much less than the total number of parameters in the standard transformer model (typical value of is 512).
Let represent the cross entropy loss that is minimized to ensure that the model generates accurate tokens. However, by only considering as the objective, it might be possible that the model learns equal values of for all . This would indicate that all heads are equally important to the input , and thus, prevent us from filtering the most important heads. To avoid this, we add an extra loss term to penalize the model if the value of becomes equal for all . More formally, let be a vector representing the importance score of all heads according to the model, where is the importance score of head . Let be a vector representing equal importance of all heads, i.e., , where is a constant. Both and denote the importance distribution of the heads, where is learned by the model using the second level attention, and
is a uniform distribution with equal importance for all heads. In order to avoid the model from assigning equal importance to all the heads, we maximize the Kullback-Leibler divergence (KL Divergence) between distributionsand . Note that both the distributions sum up to 1, i.e., , and , and that for all . Specifically, we add an extra loss term as the KL Divergence between and , given as:
The overall loss , where we minimize and maximize , is computed as:
where is a hyperparameter used to control the effect of on the overall loss . The objective is to minimize the overall loss .
4.1 Dataset Description
|Feed forward dim.||2048||2048||1024||128|
|Dropout (Section 3.1)||N/A||0.5||0.2||0.2|
We used German-English (De-En) parallel corpus obtained from IWSLT14 (cettolo2014report) and WMT17 (bojar2017results) shared translation tasks to evaluate the performance of our proposed method. Table 1 reports the number of parallel sentences in training, validation and test splits of different datasets that are considered in our experiments. To compare with (iida2019attention), we used WMT17 De-En training corpus as training set and newstest13 as validation set. Similar to (iida2019attention), we concatenated newstest14 and newstest17 to make one test set. We call this WMT17 dataset with the modified test set as WMT17-CS dataset. To assess the performance of our method for low resource language pairs, we used Hindi-English (Hi-En) parallel corpus obtained from HindEnCorp0.5 (11858/00-097C-0000-0023-625F-0). Also, we created smaller training sets from the complete IWSLT14 training set. We randomly sampled 10K, 20K, 30K, 40K, 80K, 120K and 160K sentence pairs from the full training data. The validation and test datasets were the same across all training sets. We also evaluated the performance of our method on extremely low resource language pairs. We used Belarusian-English (Be-En) parallel corpus from TED talks (qi2018and) that contains only 4.5K parallel sentences in the training set. The HindEnCorp0.5 dataset contains 270K sentence pairs, out of which we randomly sampled 7K sentence pairs each for validation and test sets, and used the remaining sentences as the training set. We used moses toolkit (koehn2007moses) to tokenize German, Belarusian and English sentences, and IndicNLP Library111IndicNLP Library to tokenize Hindi sentences. For open-vocabulary translation, we segmented words using byte-pair encoding (BPE)222https://github.com/rsennrich/subword-nmt (sennrich2015neural). For Be-En parallel corpus, we learned 5K merge operations for both Be and En separately. For other datasets, we combined the source and target sentences of the training set for learning BPE. We learned 10K merge operations for IWSLT14 dataset, and 20K merge operations for other datasets.
4.2 Hyperparameter Optimization
The transformer model has a large number of hyperparameters, and hence the total number of combinations of possible values for these hyperparameters is exponential. Therefore, although the language pairs are different from the original pairs used to determine the default values, much of the previous work uses the default hyperparameters (e.g., (gu2018meta; aharoni2019massively)). However, different languages have different characteristics, and using the hyperparameters tuned for one language pair, might not yield the optimal performance for another language pair. Furthermore, the amount of data available for training also affects the choice of hyperparameters. Hence, for each language pair, we perform extensive hyperparameter tuning to get better performance. Since there are exponential number of combinations, grid search is computationally very intensive, and random search is not guaranteed to yield optimal hyperparameters. Hence, we perform hyperparameter search using different values for a subset of hyperparameters. We majorly tune on two types of hyper-parameters - architecture hyper-parameters (e.g., number of attention heads, feed-forward dimension), and regularization hyper-parameters (e.g., dropout, attention dropout, activation dropout, label smoothing). The remaining hyper-parameters such as word embedding size, number of layers, for both encoder and decoder are set to their default values (similar to (vaswani2017attention)), and kept constant throughout the search. We first tune the architecture hyperparameters and keep the regularization hyperparameters constant with their default values. Next, we tune the regularization hyperparameters using the optimal values for architecture hyperparameters. Since we consider only a small subset of hyperparameters, the number of combinations are not exponential, and hence we are able to use grid search to tune the hyperparameters. The optimal hyperparameters chosen are the ones that correspond to the minimum loss on the validation set. Also, we use early stopping (described in Section 4.3) to prevent our model from overfitting. Although our hyperparameter tuning method does not guarantee a global optimum, we observe a substantial improvement over the default hyperparameters in our experiments (Section 5). The values of default and optimal hyperparameters obtained using our hyperparameter search, are reported in Table 2.
4.3 Experimental Setup and Baselines
We consider the Standard Transformer-base model vaswani2017attention as a baseline, and for implementation, we used fairseq toolkit ott2019fairseq. We also analyzed the effect of applying our proposed approach DHICM to different layers of both encoder and decoder of the transformer model, and observed that applying the second level attention at the last layer of both encoder and decoder yields the best score.
We refer to the hyperparameters reported in the Standard Transformer-base model vaswani2017attention as the Default Hyperparameters, and those obtained using our hyperparameter search described in Section 4.2 are referred to as the Optimal Hyperparameters. We trained all the models on 4 Nvidia GeForce RTX 2080 Ti GPUs. The number of layers of encoder and decoder was set to 6, number of tokens per batch was set to 8000, and the word embedding dimension was set to 512. We used Adam optimizer () (kingma2014adam) with a learning rate of . We used inverse square root learning rate scheduler with 4000 warmup steps, and used beam search with beam size of 5 for generating the sentences. In our proposed approach, we add two additional hyperparameters, that is, (described in Section 3.1), and a dropout in the second level attention (described in Section 3.1). The optimal values for the dropout added are provided in Table 2, and we set
as 0.1, for all experiments, corresponding to the minimum loss on the validation set. We save model checkpoints after every epoch and select the best checkpoint based on the lowest validation loss. In order to minimize overfitting, we stop training if the validation loss does not decrease for 10 consecutive epochs.
For training the models on smaller, randomly sampled training sets from the full IWSLT14 training set (Sec. 4.1), we used the optimal hyper-parameters learned using the full IWSLT14 training set. We used BLEU (papineni2002bleu)
as the evaluation metric to compare the performance of our approach with two versions of the baseline model, (i) T-base, which is the Transformer-base model trained using Default hyperparameters, and (ii) T-optimal, which is the Transformer-base model trained using Optimal hyperparameters (Sec.4.2). Please note that, for all our experiments, the hyperparameters for T-optimal and DHICM are same.
Table 3 shows the performance of different methods. We observe that T-optimal outperforms T-base, and this demonstrates that the optimal hyperparameters found in our extensive hyperparameter search yield higher performance compared to the default hyperparameters in vaswani2017attention
. Also, DHICM achieves a higher BLEU score, and outperforms T-optimal on HindEnCorp and WMT17-CS datasets by 3.45 and 1.12 BLEU points, respectively. We also performed experiment on the extremely low resource language pair Be-En, and observed that T-base achieved 4.09 BLEU score, and T-optimal achieved 5.49 BLEU score. Thus, T-optimal outperformed T-base by 1.4 BLEU points. Moreover, DHICM achieved 6.29 BLEU score, thus outperforming T-optimal by 0.8 BLEU points. We also compared the performance of our method with the multi-hop multi-head attention modeliida2019attention on WMT17-CS De-En dataset. We observed that DHICM outperforms iida2019attention by 1.77 BLEU points.
Table 4 shows the BLEU score achieved by the models trained with smaller training sets that are randomly sampled from full IWSLT 2014 training set. We observe that the performance of all methods increases with an increase in the training set size, and DHICM achieves a much higher performance compared to T-base for all training set sizes. The performance of T-optimal and DHICM is similar for larger datasets, however, for low-resource datasets, our approach outperforms T-optimal by a large margin.
Since the hyperaparameters for both T-optimal and DHICM are same, we can see that the gain in the performance of our method is due to the proposed second layer attention over the multi-head attention. In addition, our proposed loss function (Section 3.1) prevents the model from assigning the same importance to all heads. Thus, we are able to filter more important heads.
Our proposed approach DHICM outperforms T-base and T-optimal by a large margin in the low resource conditions. We further analyzed the performance of the baseline model and DHICM, and observed that DHICM learns better word alignment especially, in low resource conditions. One of the reasons for learning better alignment can be that for each word, all heads are not equally important. The second level attention that we designed in our model allows the tokens to pay more attention to the heads that capture more relevant information for translation. Since the heads that are more relevant receive more attention, the parts of the input to which these heads attend, in turn receive more attention, and thus, the alignment becomes better. For example, providing more attention to the heads that capture the syntactic or semantic information, and relatively less attention to the heads that capture positional information. This justifies our hypothesis mentioned in Section 3.1.
We also verified this using the encoder-decoder attention distribution of the models shown in Figure 1 (low resource conditions) and Figure 2 (high resource conditions). The decoder of the transformer model uses the outputs of the encoder to generate the tokens in the target language. Each generated token pays some attention to each token in the source language. The attention distribution matrix shows the attention paid by the generated tokens in the target sentence (rows) to the tokens in the source sentence (columns). In Figure 0(a) and Figure 1(a), we can see that most of the tokens on the source side get similar attention for the baseline approach. Moreover, the highest attention a source token receives is approximately 0.12 and 0.5 in Figure 0(a) and Figure 1(a), respectively. This implies that the most important source token for translation does not receive enough attention, resulting in a poor word alignment. On the contrary, for DHICM (Figure 0(b) and Figure 1(b)
), we observe a large variance in the distribution of the attention paid by a target token to the source tokens. Thus, more appropriate source tokens receive higher attention scoresin DHICM, leading to a better word alignment, as shown in both Figure 0(b) and Figure 1(b). Also when 160K training sentences are used for IWSLT14, although the performance of the baseline and DHICM is similar, DHICM learns better word alignments compared to the baseline (shown in Figure 2), as DHICM helps the model to pay more attention to more relevant source tokens. Moreover, DHICM allows the model to pay higher attention () to the appropriate source words compared to the baseline model where highest attention received by a source token is . This shows that for both low resource and high resource conditions, DHICM helps the model to pay higher attention to the more relevant source tokens.
We also analysed the additional attention layer introduced in DHICM. We compute the attention paid by each token to each head. Using the second level attention, we compute the attention paid by a particular token to all the heads and plot the attention values to create an attention distribution matrix. Figure 3 shows the attention distribution for the second level attention added on top of the multi-head self attention in the last layer of the encoder. The attention distribution matrix shows the attention paid by each source token (rows) to all the 4 heads (columns). The distribution shows that each token pays different amount of attention to each head, and this justifies our hypothesis that all heads are not equally important. Also, different tokens pay different amount of attention to a particular head, which also supports our hypothesis that the importance of a head is dynamic in nature, i.e., it varies as the input token changes. The attention distribution matrix also shows that the additional loss term indeed allows the model to compute different importance scores for different heads. In Figure 3, we can see that the second head gets the least attention from all the tokens. This shows that our proposed method identifies the least important heads, and thus, by incorporating DHICM, an appropriate pruning strategy can be developed to prune the least important heads.
7 Related Work
Some recent work has shown that most of the heads in a multi-head attention model become redundant during test time (michel2019sixteen). (voita2018context; voita2019analyzing) analyzed the heads in a multi-head attention model, based on some importance score that is calculated after the model is fully trained. In contrast, in this work, we propose to calculate the importance scores dynamically while training.
A recent work iida2019attention proposed to apply attention on top of the output of multi-head attention. However, they apply an additional attention layer only on the encoder, whereas, in our proposed method, we apply the second level attention on both encoder and decoder, that helps the generated target words to pay significant attention to appropriate source words, which in turn enhances the encoder-decoder attention distribution as shown in Figure 0(b). Moreover, their proposed approach might learn equal attention weights for the additional attention layer, which would make all the heads equally important. In such a case, their approach would perform similar to transformer base model, even after adding more number of parameters compared to the standard transformer. To address this, we add an extra loss term in our method, to penalize for learning similar weights for the second level attention. This helps our method to compute different importance scores for different heads. Furthermore, during the calculation of the final attention, they transform the output of each head using a different transformation matrix for each head, while our proposed approach DHICM uses a single transformation matrix for the outputs of all heads. Thus, DHICM learns much fewer number of parameters in addition to achieving greater performance (the number of additional parameters learned in their approach is 550K, whereas DHICM learns 500K additional parameters).
8 Conclusion and Future Work
In this work, we proposed an effective Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of different heads during training. Our idea is to calculate the importance with an additional attention layer along with the standard multi-head attention. We also proposed a loss function to prevent our method from computing equal importance for all heads, which together with the second-level attention facilitates to dynamically identify heads that are most important to the input word. Thus, the target words generated pay significantly high attention to the more appropriate/relevant source words. We also performed extensive hyperparameter tuning on a subset of hyperparameters, and observed that the optimal hyper-parameters obtained from our search yield a much higher BLEU score compared to the default hyper-parameters. Experiments on multiple translation tasks show that DHICM outperforms the standard transformer model by a large margin, especially in low resource settings. In the future, we will use the importance scores of the heads computed using DHICM and implement a strategy for pruning the less important heads. We would also like to explore further in the direction of reducing redundancy in multi-head attention.