Log In Sign Up

Multi-Head Attention with Disagreement Regularization

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.


page 1

page 2

page 3

page 4


Information Aggregation for Multi-Head Attention with Routing-by-Agreement

Multi-head attention is appealing for its ability to jointly extract dif...

Improving Multi-Head Attention with Capsule Networks

Multi-head attention advances neural machine translation by working out ...

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

Pre-trained language models (PLM) have demonstrated their effectiveness ...

Orthogonality Constrained Multi-Head Attention For Keyword Spotting

Multi-head attention mechanism is capable of learning various representa...

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural l...

Scheduled DropHead: A Regularization Method for Transformer Models

In this paper, we introduce DropHead, a structured dropout method specif...

Neuron Interaction Based Representation Composition for Neural Machine Translation

Recent NLP studies reveal that substantial linguistic information can be...

1 Introduction

Attention model is now a standard component of the deep learning networks, contributing to impressive results in neural machine translation Bahdanau et al. (2015); Luong et al. (2015), image captioning Xu et al. (2015), speech recognition Chorowski et al. (2015), among many other applications. Recently, Vaswani:2017:NIPS introduced a multi-head attention mechanism to capture different context with multiple individual attention functions.

One strong point of multi-head attention is the ability to jointly attend to information from different representation subspaces at different positions. However, there is no mechanism to guarantee that different attention heads indeed capture distinct features. In response to this problem, we introduce a disagreement regularization term to explicitly encourage the diversity among multiple attention heads. The disagreement regularization serves as an auxiliary objective to guide the training of the related attention component.

Specifically, we propose three types of disagreement regularization, which are applied to the three key components that refer to the calculation of feature vector using multi-head attention. Two regularization terms are respectively to maximize cosine distances of the input subspaces and output representations, while the last one is to disperse the positions attended by multiple heads with element-wise multiplication of the corresponding attention matrices. The three regularization terms can be either used individually or in combination.

We validate our approach on top of advanced Transformer model Vaswani et al. (2017) for both EnglishGerman and ChineseEnglish translation tasks. Experimental results show that our approach consistently improves translation performance across language pairs. One encouraging finding is that Transformer-Base with disagreement regularization achieves comparable performance with Transformer-Big, while the training speed is nearly twice faster.

2 Background: Multi-Head Attention

Figure 1:

Illustration of the multi-head attention, which jointly attends to different representation subspaces (colored boxes) at different positions (darker color denotes higher attention probability).

Attention mechanism aims at modeling the strength of relevance between representation pairs, such that a representation is allowed to build a direct relation with another representation. Instead of performing a single attention function, Vaswani:2017:NIPS found it is beneficial to capture different context with multiple individual attention functions. Figure 1 shows an example of a two-head attention model. For the query word “Bush”, green and red head pay attention to different positions of “talk” and “Sharon” respectively.

Attention function softly maps a sequence of query and a set of key-value pairs to outputs. More specifically, multi-head attention model first transforms , , and into subspaces, with different, learnable linear projections, namely:

where are respective the query, key, and value representations of the -th head. denote parameter matrices, and represent the dimensionality of the model and its subspace. Furthermore, attention functions are applied in parallel to produce the output states , among them:

Here is the attention distribution produced by the -th attention head. Finally, the output states are concatenated to produce the final state.

3 Approach

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. To further guarantee the diversity, we enlarge the distances among multiple attention heads with disagreement regularization (Section 3.1). Specifically, we propose three types of disagreement regularization to encourage each head vector to be different from other heads (Section 3.2).

3.1 Framework

In this work, we take the machine translation tasks as application. Given a source sentence and its translation , a neural machine translation model is trained to maximize the conditional translation probability over a parallel training corpus.

We introduce an auxiliary regularization term in order to encourage the diversity among multiple attention heads. Formally, the training objective is revised as:

where is the referred attention matrices, is a hyper-parameter and is empirically set to 1.0 in this paper. The auxiliary regularization term guides the related attention component to capture different features from the corresponding projected subspaces.

Note that the introduced regularization term works like and terms, which do not introduce any new parameters and only influence the training of the standard model parameters.

3.2 Disagreement Regularization

Three types of regularization term, which are applied to three parts of the original multi-head attention, are introduced in this section.

Disagreement on Subspaces (Sub.)

This disagreement is designed to maximize the cosine distance between the projected values. Specifically, we first calculate the cosine similarity

between the vector pair and in different value subspaces, through the dot product of the normalized vectors111We did not employ the Euler Distance between vectors since we do not care the absolute value in each vector., which measures the cosine of the angle between and . Thus, the cosine distance is defined as negative similarity, i.e, . Our training objective is to enlarge the average cosine distance among all head pairs. The regularization term is formally expressed as:


Disagreement on Attended Positions (Pos.)

Another strategy is to disperse the attended positions predicted by multiple heads. Inspired by the agreement regularization Liang et al. (2006); Cheng et al. (2016) which encourages multiple alignments to be similar, in this work, we deploy a variant of the original term by introducing an alignment disagreement regularization. Formally, we employ the sum of element-wise multiplication of corresponding matrix cells222We also used the squared element-wise subtraction of two matrices in our preliminary experiments, and found it underperforms its multiplication counterpart, which is consistent with the results in  Cheng et al. (2016)., to measure the similarity between two matrices and of two heads:


Disagreement on Outputs (Out.)

This disagreement directly applies regularization on the outputs of each attention head, by maximizing the difference among them. Similar to the subspace strategy, we employ negative cosine similarity to measure the distance:


4 Related Work

The regularization on attended positions is inspired by agreement learning in prior works, which encourages alignments or hidden variables of multiple models to be similar. Liang:2006:NAACL first assigned agreement terms for jointly training word alignment in phrase-based statistic machine translation Koehn et al. (2003)

. The idea was further extended into other natural language processing tasks such as grammar induction

Liang et al. (2008)

. Levinboim:2015:NAACL extended the agreement for general bidirectional sequence alignment models with model inevitability regularization. Cheng:2016:IJCAI further explored the agreement on modeling the source-target and target-source alignments in neural machine translation model. In contrast to the mentioned approaches which assigned agreement terms into loss function, we deploy an alignment disagreement regularization by maximizing the distance among multiple attention heads.

As standard multi-head attention model lacks effective control on the influence of different attention heads, ahmed2017weighted used a weighted mechanism to combine them rather than simple concatenation. As an alternative approach to multi-head attention, Shen:2018:AAAI extended the single relevance score to multi-dimensional attention weights, demonstrating the effectiveness of modeling multiple features for attention networks. Our approach is complementary to theirs: our model encourages the diversity among multiple heads, while theirs enhance the power of each head.

5 Experiments

5.1 Setup

To compare with the results reported by previous work Gehring et al. (2017); Vaswani et al. (2017); Hassan et al. (2018), we conduct experiments on both WMT2017 ChineseEnglish (ZhEn) and WMT2014 EnglishGerman (EnDe) translation tasks. The ZhEn corpus consists of 20M sentence pairs, and the EnDe corpus consists of 4M sentence pairs. We follow previous work to select the validation and test sets. Byte-pair encoding (BPE) is employed to alleviate the Out-of-Vocabulary problem Sennrich et al. (2016) with 32K merge operations for both language pairs. We use the case-sensitive 4-gram NIST BLEU score Papineni et al. (2002)

as evaluation metric, and

sign-test Collins et al. (2005) for statistical significance test.

We evaluate the proposed approaches on the advanced Transformer model Vaswani et al. (2017), and implement on top of an open-source toolkit – THUMT Zhang et al. (2017). We follow Vaswani:2017:NIPS to set the configurations and have reproduced their reported results on the EnDe task. All the evaluations are conducted on the test sets. We have tested both Base and Big models, which differ at hidden size (512 vs. 1024) and number of attention heads (8 vs. 16). We study model variations with Base model on the ZhEn task (Section 5.2 and 5.3), and evaluate overall performance with Big model on both ZhEn and EnDe tasks (Section 5.4).

# Regularization Speed BLEU
Sub. Pos. Out.
1 × × × 1.21 24.13
2 × × 1.15 24.64
3 × × 1.14 24.42
4 × × 1.15 24.78
5 × 1.12 24.73
6 × 1.11 24.38
7 1.05 24.60
Table 1: Effect of regularization terms, which are applied to the encoder self-attention only. “Speed” denotes the training speed (steps/second).
System Architecture ZhEn EnDe
Speed BLEU Speed BLEU
Existing NMT systems
Wu et al. (2016) GNMT n/a n/a n/a 26.30
Gehring et al. (2017) ConvS2S n/a n/a n/a 26.36
Vaswani et al. (2017) Transformer-Base n/a n/a n/a 27.3
Transformer-Big n/a n/a n/a 28.4
Hassan et al. (2018) Transformer-Big n/a 24.2 n/a n/a
Our NMT systems
this work Transformer-Base 1.21 24.13 1.28 27.64
    + Disagreement 1.06 24.85 1.10 28.51
Transformer-Big 0.58 24.56 0.61 28.58
    + Disagreement 0.47 25.08 0.51 29.28
Table 3: Comparing with existing NMT systems on WMT17 ChineseEnglish and WMT14 EnglishGerman translation tasks. “” indicates that the model is significantly better than its baseline counterpart ().

5.2 Effect of Regularization Terms

In this section, we evaluate the impact of different regularization terms on the ZhEn task using Transformer-Base. For simplicity and efficiency, here we only apply regularizations on the encoder side. As shown in Table 1, all the models with the proposed disagreement regularizations (Rows 2-4) consistently outperform the vanilla Transformer (Row 1). Among them, the Output term performs best which is +0.65 BLEU score better than the baseline model, the Position term is less effective than the other two. In terms of training speed, we do not observe obvious decrease, which in turn demonstrates the advantage of our disagreement regularizations.

However, the combinations of different disagreement regularizations fail to further improve translation performance (Rows 5-7). One possible reason is that different regularization terms have overlapped guidance, and thus combining them does not introduce too much new information while makes training more difficult.

Applying to Speed BLEU
Enc E-D Dec
× × × 1.21 24.13
× × 1.15 24.78
× 1.10 24.67
× 1.11 24.69
1.06 24.85
Table 2: Effect of regularization on different attention networks, i.e., encoder self-attention (“Enc”), encoder-decoder attention (“E-D”), and decoder self-attention (“Dec”).

5.3 Effect on Different Attention Networks

The Transformer consists of three attention networks, including encoder self-attention, decoder self-attention, and encoder-decoder attention. In this experiment, we investigate how each attention network benefits from the disagreement regularization. As seen from Table 2, all models consistently improve upon the baseline model. When applying disagreement regularization to all three attention networks, we achieve the best performance, which is +0.72 BLEU score better than the baseline model. The training speed decreases by 12%, which is acceptable considering the performance improvement.

5.4 Main Results

Finally, we validate the proposed disagreement regularization on both WMT17 Chinese-to-English and WMT14 English-to-German translation tasks. Specifically, we adopt the Output disagreement regularization, which is applied to all three attention networks. The results are concluded in Table 3. We can see that our implementation of Transformer outperforms all existing NMT systems, and matches the results of Transformer reported in previous works. Incorporating disagreement regularization consistently improves translation performance for both base and big Transformer models across language pairs, demonstrating the effectiveness of the proposed approach. It is encouraging to see that Transformer-Base with disagreement regularization achieves comparable performance with Transformer-Big, while the training speed is nearly twice faster.

Regularization on Disagreement on
Sub. Pos. Out.
n/a 0.882 0.007 0.881
Subspace 0.999 0.006 0.935
Position 0.882 0.219 0.882
Output 0.989 0.006 0.997
Table 4: Effect of different regularization terms on the three disagreement measurements. “n/a” denotes the baseline model without any regularization term. Larger value denotes more disagreement (at most 1.0).
Reg. Layer
n/a 0.040 0.009 0.002 0.003 0.008 0.006
Sub. 0.039 0.009 0.001 0.003 0.006 0.005
Pos. 0.217 0.167 0.219 0.242 0.233 0.249
Out. 0.048 0.009 0.002 0.003 0.008 0.006
Table 5: Disagreement on attended positions with respect to the levels of the encoder layers.

5.5 Quantitative Analysis of Regularization

In this section, we empirically investigate how the regularization terms affect the multi-head attention. To this end, we compare the disagreement scores on subspaces (“Sub.”), attended positions (“Pos.”), and outputs (“Out.”). Since the scores are negative values, we list for readability, which has a maximum value of 1.0. Table 4 lists the results of encoder-side multi-head attention on the ZhEn validation set. As seen, the disagreement score on the individual component indeed increases with the corresponding regularization term. For example, the disagreement of outputs increases to almost 1.0 by using the Output regularization, which means that the output vectors are almost perpendicular to each other as we measure the cosine distance as the disagreement.

One interesting finding is that attending to different positions may not be the essential strength of multi-head attention on the translation task. As seen, the disagreement score on the attended positions for the standard multi-head attention is only 0.007, which indicates that almost all the heads attend to the same positions. Table 5 shows the disagreement scores on attended positions across encoder layers. Except for the layer that attends to the input word embeddings, the disagreement scores on other layers (i.e. ranging from the to layer) are very low, which confirms out above hypothesis.

Concerning the regularization terms, except that on position, the other two regularization terms (i.e. “Sub.” and “Out.”) do not increase the disagreement score on the attended positions. This can explain why positional regularization term does not work well with the other two terms, as shown in Table 1. This is also consistent with the finding in Tu et al. (2016)

, which indicates that neural networks can model linguistic information in their own way. In contrast to attended positions, it seems that the multi-head attention prefer to encoding the differences among multiple heads in the learned representations.

6 Conclusion

In this work, we propose several disagreement regularizations to augment the multi-head attention model, which encourage the diversity among attention heads so that different head can learn distinct features. Experimental results across language pairs validate the effectiveness of the proposed approaches.

The models also suggest a wide range of potential advantages and extensions, from being able to improve the performance of multi-head attention in other tasks such as reading comprehension and language inference, to being able to combine with other techniques Shaw et al. (2018); Shen et al. (2018); Dou et al. (2018); Yang et al. (2018) to further improve performance.


The work was supported by the National Natural Science Foundation of China (Project No. 61332010 and No. 61472338), the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14234416 of the General Research Fund), and Microsoft Research Asia (2018 Microsoft Research Asia Collaborative Research Award). We thank the anonymous reviewers for their insightful comments.