Log In Sign Up

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

by   Elena Voita, et al.

Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.


Efficient Inference For Neural Machine Translation

Large Transformer models have achieved state-of-the-art results in neura...

Multi-Head Self-Attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of ...

Differentiable Subset Pruning of Transformer Heads

Multi-head attention, a collection of several attention mechanisms that ...

Hard-Coded Gaussian Attention for Neural Machine Translation

Recent work has questioned the importance of the Transformer's multi-hea...

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

This work investigates the alignment problem in state-of-the-art multi-h...

Understanding Multi-Head Attention in Abstractive Summarization

Attention mechanisms in deep learning architectures have often been used...

SANVis: Visual Analytics for Understanding Self-Attention Networks

Attention networks, a deep neural network architecture inspired by human...

1 Introduction

The Transformer Vaswani et al. (2017) has become the dominant modeling paradigm in neural machine translation. It follows the encoder-decoder framework using stacked multi-head self-attention and fully connected layers. Multi-head attention was shown to make more efficient use of the model’s capacity: performance of the model with 8 heads is almost 1 BLEU point higher than that of a model of the same size with single-head attention Vaswani et al. (2017). The Transformer achieved state-of-the-art results in recent shared translation tasks Bojar et al. (2018); Niehues et al. (2018). Despite the model’s widespread adoption and recent attempts to investigate the kinds of information learned by the model’s encoder Raganato and Tiedemann (2018), the analysis of multi-head attention and its importance for translation is challenging. Previous analysis of multi-head attention either assumed that all heads are equally important by looking at the average of attention weights over all heads at a given position or focused only on the maximum attention weights Voita et al. (2018); Tang et al. (2018). We argue that this obscures the roles played by individual heads which, as we show, influence the generated translations to differing extents. We attempt to answer the following questions:

  • To what extent does translation quality depend on individual encoder heads?

  • Do individual encoder heads play consistent and interpretable roles? If so, which are the most important ones for translation quality?

  • Which types of model attention (encoder self-attention, decoder self-attention or decoder-encoder attention) are most sensitive to the number of attention heads and on which layers?

  • Can we significantly reduce the number of attention heads while preserving translation quality?

We start by identifying the most important heads in each encoder layer using layer-wise relevance propagation Ding et al. (2017). For heads judged to be important, we then attempt to characterize the roles they perform. We observe the following types of role: positional (heads attending to an adjacent token), syntactic (heads attending to tokens in a specific syntactic dependency relation) and attention to rare words (heads pointing to the least frequent tokens in the sentence).

To understand whether the remaining heads perform vital but less easily defined roles, or are simply redundant to the performance of the model as measured by translation quality, we introduce a method for pruning heads based on Louizos et al. (2018). While we cannot easily incorporate the number of active heads as a penalty term in our learning objective (i.e. the regularizer), we can use a differentiable relaxation. We prune attention heads in a continuous learning scenario starting from the converged full model and identify the roles of those which remain in the model. These experiments corroborate the findings of layer-wise relevance propagation; in particular, heads with clearly identifiable positional and syntactic functions are pruned last and hence shown to be most important for the translation task.

Our key findings are as follows:

  • Only a small subset of heads are important for translation;

  • Important heads have one or more specialized and interpretable functions in the model;

  • The functions correspond to attention to neighbouring words and to tokens in specific syntactic dependency relations.

2 Transformer Architecture

In this section, we briefly describe the Transformer architecture Vaswani et al. (2017) introducing the terminology used in the rest of the paper.

The Transformer is an encoder-decoder model that uses stacked self-attention and fully connected layers for both the encoder and decoder. The encoder consists of layers, each containing two sub-layers: (a) a multi-head attention mechanism, and (b) a feed-forward network. The multi-head attention mechanism relies on scaled dot-product attention, which operates on a query , a key and a value :


where is the key dimensionality.

The multi-head attention mechanism obtains (i.e. one per head) different representations of (, , ), computes scaled dot-product attention for each representation, concatenates the results, and projects the concatenation through a feed-forward layer. This can be expressed in the same notation as Equation (1):


where the and are parameter matrices.

The second component of each layer of the Transformer network is a feed-forward network. The authors propose using a two-layer network with a ReLU activation.

Analogously, each layer of the decoder contains the two sub-layers mentioned above as well as an additional multi-head attention sub-layer. This additional sub-layer receives as input the output of the corresponding encoding layer.

The Transformer uses multi-head attention in three different ways: encoder self-attention, decoder self-attention and decoder-encoder attention. In this work, we concentrate primarily on encoder self-attention.

3 Data and setting

We focus on English as a source language and consider three target languages: Russian, German and French. For each language pair, we use the same number of sentence pairs from WMT data to control for the amount of training data and train Transformer models with the same numbers of parameters. We use 2.5m sentence pairs, corresponding to the amount of English–Russian parallel training data (excluding UN and Paracrawl). In Section 5.2 we use the same held-out data for all language pairs; these are 50k English sentences taken from the WMT EN-FR data not used in training.

For English-Russian, we perform additional experiments using the publicly available OpenSubtitles2018 corpus Lison et al. (2018) to evaluate the impact of domain on our results.

In Section 6 we concentrate on English-Russian and two domains: WMT and OpenSubtitles.

Model hyperparameters, preprocessing and training details are provided in the supplementary materials.

(a) LRP
(b) confidence
(c) head functions
Figure 1: Importance (according to LRP), confidence, and function of self-attention heads. In each layer, heads are sorted by their relevance according to LRP. Model trained on 6m OpenSubtitles EN-RU data.
(a) LRP (EN-DE)
(b) head functions
(c) LRP (EN-FR)
(d) head functions
Figure 2: Importance (according to LRP) and function of self-attention heads. In each layer, heads are sorted by their relevance according to LRP. Models trained on 2.5m WMT EN-DE (a, b) and EN-FR (c, d).

4 Identifying Important Heads

Previous work analyzing how representations are formed by the Transformer’s multi-head attention mechanism have implicitly assumed that all heads are equally important by taking either the average or the maximum attention weights over all heads Voita et al. (2018); Tang et al. (2018). We argue that this obscures the roles played by individual heads which, as we will show, influence the generated translations to differing extents.

We define the “confidence” of a head as the average of its maximum attention weight excluding the end of sentence symbol.111We exclude EOS on the grounds that it is not a real token. A confident head is one that usually assigns a high proportion of its attention to a single token. Intuitively, we might expect confident heads to be important to the translation task.

Layer-wise relevance propagation (LRP) Ding et al. (2017)

is a method for computing the relative contribution of neurons at one point in a network to neurons at another. Here we propose to use LRP to evaluate the degree to which different heads at each layer contribute to the top-1 logit predicted by the model. Heads whose outputs have a higher relevance value may be judged to be more important to the model’s predictions.

The results of LRP are shown in Figures 0(a), 1(a), 1(c). In each layer, LRP ranks a small number of heads as much more important than all others.

The confidence for each head is shown in Figure 0(b). We can observe that the relevance of a head as computed by LRP agrees to a reasonable extent with its confidence. The only clear exception to this pattern is the head judged by LRP to be the most important in the first layer. It is the most relevant head in the first layer but its average maximum attention weight is low. We will discuss this head further in Section 5.3.

5 Characterizing heads

We now turn to investigating whether heads play consistent and interpretable roles within the model.

We examined some attention matrices paying particular attention to heads ranked highly by LRP and identified three functions which heads might be playing:

  1. positional: the head points to an adjacent token,

  2. syntactic: the head points to tokens in a specific syntactic relation,

  3. rare words: the head points to the least frequent tokens in a sentence.

Now we discuss the criteria used to determine if a head is performing one of these functions and examine properties of the corresponding heads.

5.1 Positional heads

We refer to a head as “positional” if at least 90% of the time its maximum attention weight is assigned to a specific relative position (in practice either -1 or +1).

Such heads are shown in purple in Figures 0(c) for English-Russian, 1(b) for English-German, 1(d) for English-French and marked with the relative position.

As can be seen, the positional heads correspond to a large extent to the most confident heads and the most important heads as ranked by LRP. In fact, the average maximum attention weight exceeds for every positional head for all language pairs considered here.

5.2 Syntactic heads

We hypothesize that, when used to perform translation, the Transformer’s encoder may be responsible for disambiguating the syntactic structure of the source sentence. We therefore wish to know whether a head attends to tokens corresponding to any of the major syntactic relations in a sentence. In our analysis, we looked at the following dependency relations: nominal subject (nsubj), direct object (dobj), adjectival modifier (amod) and adverbial modifier (advmod). These include the main verbal arguments of a sentence and some other common relations. They also include those relations which might inform morphological agreement or government in one or more of the target languages considered here.

5.2.1 Methodology

We evaluate to what extent each head in the Transformer’s encoder accounts for a specific dependency relation by comparing its attention weights to a predicted dependency structure generated using CoreNLP Manning et al. (2014) on a large number of held-out sentences. We calculate for each head how often it assigns its maximum attention weight (excluding EOS) to a token with which it is in one of the aforementioned dependency relations. We count each relation separately and allow the relation to hold in either direction between the two tokens.

We refer to this relative frequency as the “accuracy” of head on a specific dependency relation in a specific direction. Note that under this definition, we may evaluate the accuracy of a head for multiple dependency relations.

Many dependency relations are frequently observed in specific relative positions (for example, often they hold between adjacent tokens, see Figure 3). We say that a head is “syntactic” if its accuracy is at least higher than the baseline that looks at the most frequent relative position for this dependency relation.

Figure 3: Distribution of the relative position of dependent for different dependency relations (WMT).

5.2.2 Results

dep. direction best head / baseline
WMT OpenSubtitles
v s 45 / 35 77 / 45
s v 52 / 35 70 / 45
v o 78 / 41 61 / 46
o v 73 / 41 84 / 46
noun adj.mod. 74 / 72 81 / 80
adj.mod. noun 82 / 72 81 / 80
v adv.mod. 48 / 46 38 / 33
adv.mod. v 52 / 46 42 / 33
Table 1: Dependency scores for EN-RU. Models trained on 2.5m WMT data and 6m OpenSubtitles data.
Figure 4: Dependency scores for EN-RU, EN-DE, EN-FR each trained on 2.5m WMT data.

Table 1 shows the accuracy of the most accurate head for each of the considered dependency relations on the two domains for English-Russian. Figure 4 compares the scores of the models trained on WMT with different target languages.

Clearly certain heads learn to detect syntactic relations with accuracies significantly higher than the baseline. This supports the hypothesis that the encoder does indeed perform some amount of syntactic disambiguation of the source sentence.

Several heads appear to be responsible for the same dependency relation. These heads are shown in green in Figures 0(c), 1(b), 1(d).

Unfortunately, it is not possible to draw any strong conclusions from these results regarding the impact of target language morphology on the accuracy of the syntactic attention heads although relations with strong target morphology are among those that are most accurately learned.

Note the difference in accuracy of the verb-subject relation heads across the two domains for English-Russian. We hypothesize that this is due to the greater variety of grammatical person present222First, second and third person subjects are encountered in approximately , and of cases in WMT data and in , and of cases in OpenSubtitles data in the Subtitles data which requires more attention to this relation. However, we leave proper analysis of this to future work.

5.3 Rare words

In all models (EN-RU, EN-DE, EN-FR on WMT and EN-RU on OpenSubtitles), we find that one head in the first layer is judged to be much more important to the model’s predictions than any other heads in this layer.

We find that this head points to the least frequent tokens in a sentence. For models trained on OpenSubtitles, among sentences where the least frequent token in a sentence is not in the top-500 most frequent tokens, this head points to the rarest token in 66 of cases, and to one of the two least frequent tokens in 83 of cases. For models trained on WMT, this head points to one of the two least frequent tokens in more than 50 of such cases. This head is shown in orange in Figures 0(c), 1(b), 1(d). Examples of attention maps for this head for models trained on WMT data with different target languages are shown in Figure 5.

Figure 5: Attention maps of the rare words head. Models trained on WMT: (a) EN-RU, (b) EN-DE, (c) EN-FR

6 Pruning Attention Heads

We have identified certain functions of the most relevant heads at each layer and showed that to a large extent they are interpretable. What of the remaining heads? Are they redundant to translation quality or do they play equally vital but simply less easily defined roles? We introduce a method for pruning attention heads to try to answer these questions. Our method is based on Louizos et al. (2018)

. Whereas they pruned individual neural network weights, we prune entire model components (i.e. heads). We start by describing our method and then examine how performance changes as we remove heads, identifying the functions of heads retained in the sparsified models.

6.1 Method

Figure 6: Concrete distribution: (a) Concrete and its stretched and rectified version (Hard Concrete); (b) Hard Concrete distributions with different parameters.

We modify the original Transformer architecture by multiplying the representation computed by each head by a scalar gate . Equation (3) turns into

Unlike usual gates, are parameters specific to heads and are independent of the input (i.e. the sentence). As we would like to disable less important heads completely rather than simply downweighting them, we would ideally apply regularization to the scalars . The norm equals the number of non-zero components and would push the model to switch off less important heads:

where is the number of heads, and denotes the indicator function.

Unfortunately, the norm is non-differentiable and so cannot be directly incorporated as a regularization term in the objective function. Instead, we use a stochastic relaxation: each gate

is now a random variable drawn independently from a head-specific distribution.

333In training, we resample gate values for each batch. We use the Hard Concrete distributions Louizos et al. (2018), a parameterized family of mixed discrete-continuous distributions over the closed interval , see Figure 5(a)

. The distributions have non-zero probability mass at 0 and 1,

and , where are the distribution parameters. Intuitively, the Hard Concrete distribution is obtained by stretching the binary version of the Concrete (aka Gumbel softmax) distribution Maddison et al. (2017); Jang et al. (2017) from the original support of to and then collapsing the probability mass assigned to and to single points, 0 and 1, respectively. These stretching and rectification operations yield a mixed discrete-continuous distribution over . Now the sum of the probabilities of heads being non-zero can be used as a relaxation of the norm:

The new training objective is

where are the parameters of the original Transformer, is cross-entropy loss for the translation model, and is the regularizer described above. The objective is easy to optimize: the reparameterization trick  Kingma and Welling (2014); Rezende et al. (2014)

can be used to backpropagate through the sampling process for each

, whereas the regularizer and its gradients are available in the closed form. Interestingly, we observe that the model converges to solutions where gates are either almost completely closed (i.e. the head is pruned, ) or completely open (), the latter not being explicitly encouraged.444The ‘noise’ pushes the network not to use middle values. The combination of noise and rectification has been previously used to achieve discretization (e.g., Kaiser and Bengio (2018)). This means that at test time we can treat the model as a standard Transformer and use only a subset of heads.

When applying this regularizer, we start from the converged model trained without the penalty (i.e. parameters  are initialized with the parameters of the converged model) and then add the gates and continue training the full objective. By varying the coefficient in the optimized objective, we obtain models with different numbers of heads retained.555The code of the model will be made freely available at the time of publication.

6.2 Pruning encoder heads

To determine which head functions are most important in the encoder and how many heads the model needs, we conduct a series of experiments with gates applied only to encoder self-attention. Here we prune a model by fine-tuning a trained model with the regularized objective.666In preliminary experiments, we observed that fine-tuning a trained model gives slightly better results (0.2–0.6 BLEU) than applying the regularized objective, or training a model with the same number of self-attention heads, from scratch. During pruning, the parameters of the decoder are fixed and only the encoder parameters and head gates are fine-tuned. By not fine-tuning the decoder, we ensure that the functions of the pruned encoder heads do not migrate to the decoder.

6.2.1 Quantitative results: BLEU score

Figure 7: BLEU score as a function of number of retained encoder heads (EN-RU). Regularization applied by fine-tuning trained model.

BLEU scores are provided in Figure 7. Surprisingly, for OpenSubtitles, we lose only BLEU when we prune all but 4 heads out of 48. For the more complex WMT task, 10 heads in the encoder are sufficient to stay within BLEU of the full model.

6.2.2 Functions of retained heads

Results in Figure 7 suggest that the encoder remains effective even with only a few heads. In this section, we investigate the function of those heads that remain in the encoder during pruning. Figure 8 shows all heads color-coded for their function in a pruned model. Each column corresponds to a model with a particular number of heads retained after pruning. Heads from all layers are ordered by their function. Some heads can perform several functions (e.g., and ); in this case the number of functions is shown.

Figure 8: Functions of encoder heads retained after pruning. Each column represents all remaining heads after varying amount of pruning (EN-RU; Subtitles).

First, we note that the model with 17 heads retains heads with all the functions that we identified in Section 5, even though 2/3 of the heads have been pruned.

This indicates that these functions are indeed the most important. Furthermore, when we have fewer heads in the model, some functions “drift” to other heads: for example, we see positional heads starting to track syntactic dependencies; hence some heads are assigned more than one color at certain stages in Figure 8.

6.3 Pruning all types of attention heads

We found our pruning technique to be efficient at reducing the number of heads in the encoder without a major drop in translation quality. Now we investigate the effect of pruning all types of attention heads in the model (not just in the encoder). This allows us to evaluate the importance of different types of attention in the model for the task of translation. In these experiments, we add gates to all multi-head attention heads in the Transformer, i.e. encoder and decoder self-attention and attention from the decoder to the encoder.

6.3.1 Quantitative results: BLEU score

Results of experiments pruning heads in all attention layers are provided in Table 2. For models trained on WMT data, we are able to prune almost 3/4 of encoder heads and more than 1/3 of heads in decoder self-attention and decoder-encoder attention without any noticeable loss in translation quality (sparse heads, row 1). We can also prune more than half of all heads in the model and lose no more than 0.25 BLEU.

While these results show clearly that the majority of attention heads can be removed from the fully trained model without significant loss in translation quality, it is not clear whether a model can be trained from scratch with such a small number of heads. In the rightmost column in Table 2 we provide BLEU scores for models trained with exactly the same number and configuration of heads in each layer as the corresponding pruned models but starting from a random initialization of parameters. Here the degradation in translation quality is more significant than for pruned models with the same number of heads.

attention BLEU
heads from from
(e/d/d-e) trained scratch
WMT, 2.5m
baseline 48/48/48 29.6
sparse heads 14/31/30 29.62 29.47
12/21/25 29.36 28.95
8/13/15 29.06 28.56
5/9/12 28.90 28.41
OpenSubtitles, 6m
baseline 48/48/48 32.4
sparse heads 27/31/46 32.24 32.23
13/17/31 32.23 31.98
6/9/13 32.27 31.84
Table 2: BLEU scores for gates in all attentions, EN-RU. Number of attention heads is provided in the following order: encoder self-attention, decoder self-attention, decoder-encoder attention.

6.3.2 Heads importance

Figure 9 shows the number of retained heads for each attention type at different pruning rates. We can see that the model prefers to prune encoder self-attention heads first, while decoder-encoder attention heads appear to be the most important for both datasets. Obviously, without decoder-encoder attention no translation can happen.

The importance of decoder self-attention heads, which function primarily as a target side language model, varies across domains. These heads appear to be almost as important as decoder-encoder attention heads for WMT data with its long sentences (24 tokens on average), and slightly more important than encoder self-attention heads for OpenSubtitles dataset where sentences are shorter (8 tokens on average).

Figure 9: Number of active heads of different attention type for models with different sparsity rate
Figure 10: Number of active heads in different layers of the decoder for models with different sparsity rate (EN-RU, WMT)

Figure 10 shows the number of active self-attention and decoder-encoder attention heads at different layers in the decoder for models with different sparsity rate (to reduce noise, we plot the sum of heads remaining in pairs of adjacent layers). It can be seen that self-attention heads are retained more readily in the lower layers, while decoder-encoder attention heads are retained in the higher layers. This suggests that lower layers of the Transformer’s decoder are mostly responsible for language modeling, while higher layers are mostly responsible for conditioning on the source sentence. These observations are similar for both datasets we use.

7 Related work

One popular approach to the analysis of NMT representations is to evaluate how informative they are for various linguistic tasks. Different levels of linguistic analysis have been considered including morphology Belinkov et al. (2017a); Dalvi et al. (2017); Bisazza and Tump (2018), syntax Shi et al. (2016) and semantics Hill et al. (2017); Belinkov et al. (2017b); Raganato and Tiedemann (2018).

Bisazza and Tump (2018) showed that target language determines which information gets encoded. This agrees with our results for different domains on the English-Russian translation task in Section 5.2.2. There we observed that attention heads are more likely to track syntactic relations requiring more complex agreement in the target language (in this case the subject-verb relation).

A line of work studying the ability of language models to capture hierarchical information Linzen et al. (2016); Gulordava et al. (2018) was extended by Tang et al. (2018) and Tran et al. (2018) to machine translation models.

There are several works analyzing attention weights of different NMT models Ghader and Monz (2017); Voita et al. (2018); Tang et al. (2018); Raganato and Tiedemann (2018). Raganato and Tiedemann (2018) use the self-attention weights of the Transformer’s encoder to induce a tree structure for each sentence and compute the unlabeled attachment score of these trees. However they do not evaluate specific syntactic relations (i.e. labeled attachment scores) or consider how different heads specialize to specific dependency relations.

Recently Bau et al. (2019) proposed a method for identifying important individual neurons in NMT models. They show that similar important neurons emerge in different models. Rather than verifying the importance of individual neurons, we identify the importance of entire attention heads using layer-wise relevance propagation and verify our findings by observing which heads are retained when pruning the model.

8 Conclusions

We evaluate the contribution made by individual attention heads to Transformer model performance on translation. We use layer-wise relevance propagation to show that the relative contribution of heads varies: only a small subset of heads appear to be important for the translation task. Important heads have one or more interpretable functions in the model, including attending to adjacent words and tracking specific syntactic relations. To determine if the remaining less-interpretable heads are crucial to the model’s performance, we introduce a new approach to pruning attention heads.

We observe that specialized heads are the last to be pruned, confirming their importance directly. Moreover, the vast majority of heads, especially the encoder self-attention heads, can be removed without seriously affecting performance. In the future work, we would like to investigate how our pruning method compares to alternative methods of model compression in NMT See et al. (2016).


We would like to thank anonymous reviewers for their comments. We thank Wilker Aziz and Joost Bastings for their helpful suggestions. The authors also thank Yandex Machine Translation team for helpful discussions and inspiration. Ivan Titov acknowledges support of the European Research Council (ERC StG BroadSem 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518).


Appendix A Experimental setup

a.1 Data preprocessing

Sentences were encoded using byte-pair encoding Sennrich et al. (2016), with source and target vocabularies of about 32000 tokens. For OpenSubtitles data, we pick only sentence pairs with a relative time overlap of subtitle frames between source and target language subtitles of at least to reduce noise in the data. Translation pairs were batched together by approximate sequence length. Each training batch contained a set of translation pairs containing approximately 16000777This can be reached by using several of GPUs or by accumulating the gradients for several batches and then making an update. source tokens. It has been shown that Transformer’s performance depends heavily on a batch size Popel and Bojar (2018), and we chose a large value of batch size to ensure that models show their best performance.

a.2 Model parameters

We follow the setup of Transformer base model Vaswani et al. (2017). More precisely, the number of layers in the encoder and in the decoder is . We employ parallel attention layers, or heads. The dimensionality of input and output is , and the inner-layer of a feed-forward networks has dimensionality .

We use regularization as described in Vaswani et al. (2017).

a.3 Optimizer

The optimizer we use is the same as in Vaswani et al. (2017). We use the Adam optimizer Kingma and Ba (2015) with , and . We vary the learning rate over the course of training, according to the formula:

We use , .