Log In Sign Up

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with alpha-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the alpha parameter -- which controls the shape and sparsity of alpha-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.


page 1

page 2

page 3

page 4


Sparse Attention with Linear Units

Recently, it has been argued that encoder-decoder models can be made mor...

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

To overcome the quadratic cost of self-attention, recent works have prop...

Scratching Visual Transformer's Back with Uniform Attention

The favorable performance of Vision Transformers (ViTs) is often attribu...

Neural Architecture Search on Efficient Transformers and Beyond

Recently, numerous efficient Transformers have been proposed to reduce t...

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semanti...

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...

1 Introduction

The Transformer architecture (Vaswani et al., 2017)

for deep neural networks has quickly risen to prominence in NLP through its efficiency and performance, leading to improvements in the state of the art of Neural Machine Translation

(NMT; Junczys-Dowmunt et al., 2018; Ott et al., 2018), as well as inspiring other powerful general-purpose models like BERT (Devlin et al., 2019)

and GPT-2 

(Radford et al., 2019). At the heart of the Transformer lie multi-head attention mechanisms: each word is represented by multiple different weighted averages of its relevant context. As suggested by recent works on interpreting attention head roles, separate attention heads may learn to look for various relationships between tokens (Tang et al., 2018; Raganato and Tiedemann, 2018; Mareček and Rosa, 2018; Tenney et al., 2019; Voita et al., 2019).


Figure 1: Attention distributions of different self-attention heads for the time step of the token “over”, shown to compare our model to other related work. While the sparse Transformer (Child et al., 2019) and the adaptive span Transformer (Sukhbaatar et al., 2019) only attend to words within a contiguous span of the past tokens, our model is not only able to obtain different and not necessarily contiguous sparsity patterns for each attention head, but is also able to tune its support over which tokens to attend adaptively.

The attention distribution of each head is predicted typically using the softmax

normalizing transform. As a result, all context words have non-zero attention weight. Recent work on single attention architectures suggest that using sparse normalizing transforms in attention mechanisms such as sparsemax – which can yield exactly zero probabilities for irrelevant words – may improve performance and interpretability 

(Malaviya et al., 2018; Deng et al., 2018; Peters et al., 2019). Qualitative analysis of attention heads (Vaswani et al., 2017, Figure 5) suggests that, depending on what phenomena they capture, heads tend to favor flatter or more peaked distributions.

Recent works have proposed sparse Transformers and adaptive span Transformers (Child et al., 2019; Sukhbaatar et al., 2019). However, the “sparsity” of those models only limits the attention to a contiguous span of past tokens, while in this work we propose a highly adaptive Transformer model that is capable of attending to a sparse set of words that are not necessarily contiguous. Figure 1 shows the relationship of these methods with ours.

Our contributions are the following:

  • We introduce sparse attention into the Transformer architecture, showing that it eases interpretability and leads to slight accuracy gains.

  • We propose an adaptive version of sparse attention, where the shape of each attention head is learnable and can vary continuously and dynamically between the dense limit case of softmax and the sparse, piecewise-linear sparsemax case.111 Code and pip package available at

  • We make an extensive analysis of the added interpretability of these models, identifying both crisper examples of attention head behavior observed in previous work, as well as novel behaviors unraveled thanks to the sparsity and adaptivity of our proposed model.

2 Background

2.1 The Transformer

In NMT, the Transformer (Vaswani et al., 2017) is a sequence-to-sequence (seq2seq) model which maps an input sequence to an output sequence through hierarchical multi-head attention mechanisms, yielding a dynamic, context-dependent strategy for propagating information within and across sentences. It contrasts with previous seq2seq models, which usually rely either on costly gated recurrent operations (often LSTMs: Bahdanau et al., 2015; Luong et al., 2015) or static convolutions (Gehring et al., 2017).

Given query contexts and sequence items under consideration, attention mechanisms compute, for each query, a weighted representation of the items. The particular attention mechanism used in Vaswani et al. (2017) is called scaled dot-product attention, and it is computed in the following way:


where contains representations of the queries, are the keys and values of the items attended over, and is the dimensionality of these representations. The mapping normalizes row-wise using softmax, , where


In words, the keys are used to compute a relevance score between each item and query. Then, normalized attention weights are computed using softmax, and these are used to weight the values of each item at each query context.

However, for complex tasks, different parts of a sequence may be relevant in different ways, motivating multi-head attention in Transformers. This is simply the application of Equation 1 in parallel

times, each with a different, learned linear transformation that allows specialization:


In the Transformer, there are three separate multi-head attention mechanisms for distinct purposes:

  • Encoder self-attention: builds rich, layered representations of each input word, by attending on the entire input sentence.

  • Context attention: selects a representative weighted average of the encodings of the input words, at each time step of the decoder.

  • Decoder self-attention: attends over the partial output sentence fragment produced so far.

Together, these mechanisms enable the contextualized flow of information between the input sentence and the sequential decoder.

2.2 Sparse Attention

The softmax mapping (Equation 2) is elementwise proportional to , therefore it can never assign a weight of exactly zero. Thus, unnecessary items are still taken into consideration to some extent. Since its output sums to one, this invariably means less weight is assigned to the relevant items, potentially harming performance and interpretability (Jain and Wallace, 2019). This has motivated a line of research on learning networks with sparse mappings (Martins and Astudillo, 2016; Niculae and Blondel, 2017; Louizos et al., 2018; Shao et al., 2019). We focus on a recently-introduced flexible family of transformations, -entmax (Blondel et al., 2019; Peters et al., 2019), defined as:


where is the probability simplex, and, for , is the Tsallis continuous family of entropies (Tsallis, 1988):


This family contains the well-known Shannon and Gini entropies, corresponding to the cases and , respectively.

Equation 4 involves a convex optimization subproblem. Using the definition of , the optimality conditions may be used to derive the following form for the solution (Appendix B.2):



is the positive part (ReLU) function,

denotes the vector of all ones, and

– which acts like a threshold – is the Lagrange multiplier corresponding to the constraint.

Properties of -entmax.

The appeal of -entmax for attention rests on the following properties. For (i. e., when becomes the Shannon entropy), it exactly recovers the softmax mapping (We provide a short derivation in Appendix B.3.). For all it permits sparse solutions, in stark contrast to softmax. In particular, for , it recovers the sparsemax mapping (Martins and Astudillo, 2016), which is piecewise linear. In-between, as increases, the mapping continuously gets sparser as its curvature changes.

To compute the value of -entmax, one must find the threshold such that the r. h .s. in Equation 6 sums to one. Blondel et al. (2019) propose a general bisection algorithm. Peters et al. (2019) introduce a faster, exact algorithm for , and enable using with fixed within a neural network by showing that the -entmax Jacobian w. r .t. for is


Our work furthers the study of -entmax by providing a derivation of the gradient w. r .t. the hyper-parameter (Section 3), thereby allowing the shape and sparsity of the mapping to be learned automatically. This is particularly appealing in the context of multi-head attention mechanisms, where we shall show in Section 5.1 that different heads tend to learn different sparsity behaviors.

3 Adaptively Sparse Transformers
 with -entmax

We now propose a novel Transformer architecture wherein we simply replace softmax with -entmax in the attention heads. Concretely, we replace the row normalization in Equation 1 by


This change immediately leads to sparse attention weights, as long as ; in particular, is a sensible starting point (Peters et al., 2019).

Different per head.

Unlike LSTM-based seq2seq models, where can be more easily tuned by grid search, in a Transformer, there are many attention heads in multiple layers. Crucial to the power of such models, the different heads capture different linguistic phenomena, some of them isolating important words, others spreading out attention across phrases (Vaswani et al., 2017, Figure 5). This motivates using different, adaptive values for each attention head, such that some heads may learn to be sparser, and others may become closer to softmax. We propose doing so by treating the values as neural network parameters, optimized via stochastic gradients along with the other weights.

Gradients w. r .t. .

In order to optimize automatically via gradient methods, we must compute the gradient of the entmax output w. r .t. . Since entmax is defined through an optimization problem, this is non-trivial and cannot be simply handled through automatic differentiation; it falls within the domain of argmin differentiation, an active research topic in optimization (Gould et al., 2016; Amos and Kolter, 2017).

One of our key contributions is the derivation of a closed-form expression for this gradient. The next proposition provides the expression of this gradient, enabling entmax layers with adaptive . To the best of our knowledge, ours is the first neural network module that can automatically, continuously vary in shape away from softmax and toward sparse mappings like sparsemax.

Proposition 1.

Let be the solution of Equation 4. Denote the distribution and let . The th component of the gradient is


The proof uses implicit function differentiation and is given in Appendix C.

Proposition 1 provides the remaining missing piece needed for training adaptively sparse Transformers. In the following section, we evaluate this strategy on neural machine translation, and analyze the behavior of the learned attention heads.

4 Experiments

We apply our adaptively sparse Transformers on four machine translation tasks. For comparison, a natural baseline is the standard Transformer architecture using the softmax transform in its multi-head attention mechanisms. We consider two other model variants in our experiments that make use of different normalizing transformations:

  • 1.5-entmax: a Transformer with sparse entmax attention with fixed for all heads. This is a novel model, since 1.5-entmax had only been proposed for RNN-based NMT models (Peters et al., 2019), but never in Transformers, where attention modules are not just one single component of the seq2seq model but rather an integral part of all of the model components.

  • -entmax: an adaptive Transformer with sparse entmax attention with a different, learned for each head.

The adaptive model has an additional scalar parameter per attention head per layer for each of the three attention mechanisms (encoder self-attention, context attention, and decoder self-attention), i. e.,


and we set . All or some of the values can be tied if desired, but we keep them independent for analysis purposes.

activation deen jaen roen ende
29.79 21.57 32.70 26.02
29.83 22.13 33.10 25.89
29.90 21.74 32.89 26.93
Table 1: Machine translation tokenized BLEU test results on IWSLT 2017 deen, KFTT jaen, WMT 2016 roen and WMT 2014 ende, respectively.


Our models were trained on 4 machine translation datasets of different training sizes:

  • [itemsep=.5ex,leftmargin=2ex]

  • IWSLT 2017 German English (deen, Cettolo et al., 2017):  200K sentence pairs.

  • KFTT Japanese English (jaen, Neubig, 2011):  300K sentence pairs.

  • WMT 2016 Romanian English (roen, Bojar et al., 2016):  600K sentence pairs.

  • WMT 2014 English German (ende, Bojar et al., 2014):  4.5M sentence pairs.

All of these datasets were preprocessed with byte-pair encoding (BPE; Sennrich et al., 2016), using joint segmentations of 32k merge operations.


We follow the dimensions of the Transformer-Base model of Vaswani et al. (2017): The number of layers is and number of heads is in the encoder self-attention, the context attention, and the decoder self-attention. We use a mini-batch size of 8192 tokens and warm up the learning rate linearly until 20k steps, after which it decays according to an inverse square root schedule. All models were trained until convergence of validation accuracy, and evaluation was done at each 10k steps for roen and ende and at each 5k steps for deen and jaen. The end-to-end computational overhead of our methods, when compared to standard softmax, is relatively small; in training tokens per second, the models using -entmax and -entmax are, respectively, and the speed of the softmax model.


We report test set tokenized BLEU (Papineni et al., 2002) results in Table 1. We can see that replacing softmax by entmax does not hurt performance in any of the datasets; indeed, sparse attention Transformers tend to have slightly higher BLEU, but their sparsity leads to a better potential for analysis. In the next section, we make use of this potential by exploring the learned internal mechanics of the self-attention heads.

5 Analysis

We conduct an analysis for the higher-resource dataset WMT 2014 English German of the attention in the sparse adaptive Transformer model (-entmax) at multiple levels: we analyze high-level statistics as well as individual head behavior. Moreover, we make a qualitative analysis of the interpretability capabilities of our models.

5.1 High-Level Statistics

What kind of values are learned?

Figure 2 shows the learning trajectories of the parameters of a selected subset of heads. We generally observe a tendency for the randomly-initialized parameters to decrease initially, suggesting that softmax-like behavior may be preferable while the model is still very uncertain. After around one thousand steps, some heads change direction and become sparser, perhaps as they become more confident and specialized. This shows that the initialization of does not predetermine its sparsity level or the role the head will have throughout. In particular, head in the encoder self-attention layer first drops to around before becoming one of the sparsest heads, with .

The overall distribution of values at convergence can be seen in Figure 3. We can observe that the encoder self-attention blocks learn to concentrate the values in two modes: a very sparse one around , and a dense one between softmax and 1.5-entmax. However, the decoder self and context attention only learn to distribute these parameters in a single mode. We show next that this is reflected in the average density of attention weight vectors as well.


Figure 2: Trajectories of values for a subset of the heads during training. Initialized at random, most heads become denser in the beginning, before converging. This suggests that dense attention may be more beneficial while the network is still uncertain, being replaced by sparse attention afterwards.


Figure 3: Distribution of learned values per attention block. While the encoder self-attention has a bimodal distribution of values of , the decoder self-attention and context attention have a single mode.

Attention weight density when translating.

For any , it would still be possible for the weight matrices in Equation 3 to learn re-scalings so as to make attention sparser or denser. To visualize the impact of adaptive values, we compare the empirical attention weight density (the average number of tokens receiving non-zero attention) within each module, against sparse Transformers with fixed .

Figure 4 shows that, with fixed , heads tend to be sparse and similarly-distributed in all three attention modules. With learned , there are two notable changes: (i) a prominent mode corresponding to fully dense probabilities, showing that our models learn to combine sparse and dense attention, and (ii) a distinction between the encoder self-attention – whose background distribution tends toward extreme sparsity – and the other two modules, who exhibit more uniform background distributions. This suggests that perhaps entirely sparse Transformers are suboptimal.

The fact that the decoder seems to prefer denser attention distributions might be attributed to it being auto-regressive, only having access to past tokens and not the full sentence. We speculate that it might lose too much information if it assigned weights of zero to too many tokens in the self-attention, since there are fewer tokens to attend to in the first place.

Teasing this down into separate layers, Figure 5 shows the average (sorted) density of each head for each layer. We observe that

-entmax is able to learn different sparsity patterns at each layer, leading to more variance in individual head behavior, to clearly-identified dense and sparse heads, and overall to different tendencies compared to the fixed case of



Figure 4: Distribution of attention densities (average number of tokens receiving non-zero attention weight) for all attention heads and all validation sentences. When compared to 1.5-entmax, -entmax distributes the sparsity in a more uniform manner, with a clear mode at fully dense attentions, corresponding to the heads with low . In the softmax case, this distribution would lead to a single bar with density 1.

[ width=.95]figures/head_density_per_layer.pdf

Figure 5: Head density per layer for fixed and learned . Each line corresponds to an attention head; lower values mean that that attention head is sparser. Learned has higher variance.

Head diversity.

To measure the overall disagreement between attention heads, as a measure of head diversity, we use the following generalization of the Jensen-Shannon divergence:


where is the vector of attention weights assigned by head to each word in the sequence, and is the Shannon entropy, base-adjusted based on the dimension of such that . We average this measure over the entire validation set. The higher this metric is, the more the heads are taking different roles in the model.

Figure 6 shows that both sparse Transformer variants show more diversity than the traditional softmax one. Interestingly, diversity seems to peak in the middle layers of the encoder self-attention and context attention, while this is not the case for the decoder self-attention.

The statistics shown in this section can be found for the other language pairs in Appendix A.


Figure 6: Jensen-Shannon Divergence between heads at each layer. Measures the disagreement between heads: the higher the value, the more the heads are disagreeing with each other in terms of where to attend. Models using sparse entmax have more diverse attention than the softmax baseline.

5.2 Identifying Head Specializations

Previous work pointed out some specific roles played by different heads in the softmax Transformer model (Voita et al., 2018; Tang et al., 2018; Voita et al., 2019). Identifying the specialization of a head can be done by observing the type of tokens or sequences that the head often assigns most of its attention weight; this is facilitated by sparsity.

Positional heads.

One particular type of head, as noted by Voita et al. (2019), is the positional head. These heads tend to focus their attention on either the previous or next token in the sequence, thus obtaining representations of the neighborhood of the current time step. In Figure 7

, we show attention plots for such heads, found for each of the studied models. The sparsity of our models allows these heads to be more confident in their representations, by assigning the whole probability distribution to a single token in the sequence. Concretely, we may measure a positional head’s

confidence as the average attention weight assigned to the previous token. The softmax model has three heads for position , with median confidence . The -entmax model also has three heads for this position, with median confidence . The adaptive model has four heads, with median confidences , the lowest-confidence head being dense with , while the highest-confidence head being sparse ().

For position , the models each dedicate one head, with confidence around , slightly higher for entmax. The adaptive model sets for this head.


Figure 7: Self-attention from the most confidently previous-position head in each model. The learned parameter in the -entmax model is . Quantitatively more confident, visual inspection confirms that the adaptive head behaves more consistently.

BPE-merging head.

Due to the sparsity of our models, we are able to identify other head specializations, easily identifying which heads should be further analysed. In Figure 8 we show one such head where the value is particularly high (in the encoder, layer 1, head 4 depicted in Figure 2). We found that this head most often looks at the current time step with high confidence, making it a positional head with offset . However, this head often spreads weight sparsely over 2-3 neighboring tokens, when the tokens are part of the same BPE cluster222BPE-segmented words are denoted by in the figures. or hyphenated words. As this head is in the first layer, it provides a useful service to the higher layers by combining information evenly within some BPE clusters.

For each BPE cluster or cluster of hyphenated words, we computed a score between 0 and 1 that corresponds to the maximum attention mass assigned by any token to the rest of the tokens inside the cluster in order to quantify the BPE-merging capabilities of these heads.333If the cluster has size 1, the score is the weight the token assigns to itself. There are not any attention heads in the softmax model that are able to obtain a score over , while for -entmax and -entmax there are two heads in each ( and for -entmax and and for -entmax).


Figure 8: BPE-merging head discovered in the -entmax model. Found in the first encoder layer, this head learns to discover some subword units and combine their information, leaving most words intact. It places of its probability mass within the same BPE cluster as the current token: more than any head in any other model.



Figure 9: Interrogation-detecting heads in the three models. The top sentence is interrogative while the bottom one is declarative but includes the interrogative word “what”. In the top example, these interrogation heads assign a high probability to the question mark in the time step of the interrogative word (with probability), while in the bottom example since there is no question mark, the same head does not assign a high probability to the last token in the sentence during the interrogative word time step. Surprisingly, this head prefers a low , as can be seen from the dense weights. This allows the head to identify the noun phrase “Armani Polo” better.

[ width=]figures/sparsity_difference.pdf

Figure 10: Example of two sentences of similar length where the same head () exhibits different sparsity. The longer phrase in the example on the right “a sexually transmitted disease” is handled with higher confidence, leading to more sparsity.

Interrogation head.

On the other hand, in Figure 9 we show a head for which our adaptively sparse model chose an close to 1, making it closer to softmax (also shown in encoder, layer 1, head 3 depicted in Figure 2). We observe that this head assigns a high probability to question marks at the end of the sentence in time steps where the current token is interrogative, thus making it an interrogation-detecting head. We also observe this type of heads in the other models, which we also depict in Figure 9. The average attention weight placed on the question mark when the current token is an interrogative word is for softmax, for -entmax, and for -entmax.

Furthermore, we can examine sentences where some tendentially sparse heads become less so, thus identifying sources of ambiguity where the head is less confident in its prediction. An example is shown in Figure 10 where sparsity in the same head differs for sentences of similar length.

6 Related Work

Sparse attention.

Prior work has developed sparse attention mechanisms, including applications to NMT (Martins and Astudillo, 2016; Malaviya et al., 2018; Niculae and Blondel, 2017; Shao et al., 2019; Maruf et al., 2019). Peters et al. (2019) introduced the entmax function this work builds upon. In their work, there is a single attention mechanism which is controlled by a fixed . In contrast, this is the first work to allow such attention mappings to dynamically adapt their curvature and sparsity, by automatically adjusting the continuous parameter. We also provide the first results using sparse attention in a Transformer model.

Fixed sparsity patterns.

Recent research improves the scalability of Transformer-like networks through static, fixed sparsity patterns (Child et al., 2019; Wu et al., 2019). Our adaptively-sparse Transformer can dynamically select a sparsity pattern that finds relevant words regardless of their position (e. g., Figure 9). Moreover, the two strategies could be combined. In a concurrent line of research, Sukhbaatar et al. (2019) propose an adaptive attention span for Transformer language models. While their work has each head learn a different contiguous span of context tokens to attend to, our work finds different sparsity patterns in the same span. Interestingly, some of their findings mirror ours – we found that attention heads in the last layers tend to be denser on average when compared to the ones in the first layers, while their work has found that lower layers tend to have a shorter attention span compared to higher layers.

Transformer interpretability.

The original Transformer paper (Vaswani et al., 2017) shows attention visualizations, from which some speculation can be made of the roles the several attention heads have. Mareček and Rosa (2018) study the syntactic abilities of the Transformer self-attention, while Raganato and Tiedemann (2018) extract dependency relations from the attention weights. Tenney et al. (2019) find that the self-attentions in BERT (Devlin et al., 2019) follow a sequence of processes that resembles a classical NLP pipeline. Regarding redundancy of heads, Voita et al. (2019) develop a method that is able to prune heads of the multi-head attention module and make an empirical study of the role that each head has in self-attention (positional, syntactic and rare words). Li et al. (2018) also aim to reduce head redundancy by adding a regularization term to the loss that maximizes head disagreement and obtain improved results. While not considering Transformer attentions, Jain and Wallace (2019) show that traditional attention mechanisms do not necessarily improve interpretability since softmax attention is vulnerable to an adversarial attack leading to wildly different model predictions for the same attention weights. Sparse attention may mitigate these issues; however, our work focuses mostly on a more mechanical aspect of interpretation by analyzing head behavior, rather than on explanations for predictions.

7 Conclusion and Future Work

We contribute a novel strategy for adaptively sparse attention, and, in particular, for adaptively sparse Transformers. We present the first empirical analysis of Transformers with sparse attention mappings (i. e., entmax), showing potential in both translation accuracy as well as in model interpretability.

In particular, we analyzed how the attention heads in the proposed adaptively sparse Transformer can specialize more and with higher confidence. Our adaptivity strategy relies only on gradient-based optimization, side-stepping costly per-head hyper-parameter searches. Further speed-ups are possible by leveraging more parallelism in the bisection algorithm for computing -entmax.

Finally, some of the automatically-learned behaviors of our adaptively sparse Transformers, for instance, the near-deterministic positional heads or the subword joining head, may provide new ideas for designing static variations of the Transformer architecture.


This work was supported by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal). We are grateful to Ben Peters for the -entmax code and Erick Fonseca, Marcos Treviso, Pedro Martins, and Tsvetomila Mihaylova for insightful group discussion. We thank Mathieu Blondel for the idea to learn . We would also like to thank the anonymous reviewers for their helpful feedback.


Appendix A High-Level Statistics Analysis of Other Language Pairs


(a) WMT 2016 roen.


(b) KFTT jaen.


(c) WMT 2014 ende.


(d) IWSLT 2017 deen.
Figure 11: Histograms of values.


(a) WMT 2016 roen.


(b) KFTT jaen.


(c) WMT 2014 ende.


(d) IWSLT 2017 deen.
Figure 12: Histograms of head densities.


(a) WMT 2016 roen.


(b) KFTT jaen.


(c) WMT 2014 ende.


(d) IWSLT 2017 deen.
Figure 13: Jensen-Shannon divergence over layers.


(a) WMT 2016 roen.


(b) KFTT jaen.


(c) WMT 2014 ende.


(d) IWSLT 2017 deen.
Figure 14: Head densities over layers.

Appendix B Background

b.1 Regularized Fenchel-Young prediction functions

Definition 1 (Blondel et al. 2019).

Let be a strictly convex regularization function. We define the prediction function as


b.2 Characterizing the -entmax mapping

Lemma 1 (Peters et al. 2019).

For any , there exists a unique such that


Proof: From the definition of ,


we may easily identify it with a regularized prediction function (Def. 1):

We first note that for all ,


From the constant invariance and scaling properties of (Blondel et al., 2019, Proposition 1, items 4–5),

Using (Blondel et al., 2019, Proposition 5), noting that and , yields


Since is strictly convex on the simplex, has a unique solution . Equation 16 implicitly defines a one-to-one mapping between and as long as , therefore is also unique.   

b.3 Connections to softmax and sparsemax

The Euclidean projection onto the simplex, sometimes referred to, in the context of neural attention, as sparsemax (Martins and Astudillo, 2016), is defined as


The solution can be characterized through the unique threshold such that and (Held et al., 1974)


Thus, each coordinate of the sparsemax solution is a piecewise-linear function. Visibly, this expression is recovered when setting in the -entmax expression (Equation 21); for other values of , the exponent induces curvature.

On the other hand, the well-known softmax is usually defined through the expression


which can be shown to be the unique solution of the optimization problem


where is the Shannon entropy. Indeed, setting the gradient to yields the condition , where and are Lagrange multipliers for the simplex constraints and , respectively. Since the l. h .s. is only finite for , we must have for all , by complementary slackness. Thus, the solution must have the form , yielding Equation 19.

Appendix C Gradient of -entmax w. r .t. the shape parameter : Proof of Proposition 1

Recall that the entmax transformation is defined as:


where and is the Tsallis entropy,


and is the Shannon entropy.

In this section, we derive the gradient of with respect to the scalar parameter .

c.1 General case of

From the KKT conditions associated with the optimization problem in Eq. 21, we have that the solution has the following form, coordinate-wise:


where is a scalar Lagrange multiplier that ensures that normalizes to 1, i. e., it is defined implicitly by the condition:


For general values of , Eq. 24 lacks a closed form solution. This makes the computation of the gradient


non-trivial. Fortunately, we can use the technique of implicit differentiation to obtain this gradient.

The gradient exists almost everywhere, and the expressions we derive expressions yield a generalized gradient (Clarke, 1990) at any non-differentiable points that may occur for certain (, ) pairs. We begin by noting that if , because increasing keeps sparse coordinates sparse.444This follows from the margin property of (Blondel et al., 2019). Therefore we need to worry only about coordinates that are in the support of . We will assume hereafter that the th coordinate of is non-zero. We have:


We can see that this gradient depends on , which we now compute using implicit differentiation.

Let ). By differentiating both sides of Eq. 24, re-using some of the steps in Eq. 26, and recalling Eq. 23, we get


from which we obtain:


Finally, plugging Eq. 28 into Eq. 26, we get:


where we denote by


The distribution

can be interpreted as a “skewed” distribution obtained from

, which appears in the gradient of w. r .t. as well Peters et al. (2019).

c.2 Solving the indetermination for

We can write Eq. 29 as


When , we have , which leads to a indetermination.

To solve this indetermination, we will need to apply L’Hôpital’s rule twice. Let us first compute the derivative of with respect to . We have




Differentiating the numerator and denominator in Eq. 31, we get:






When , becomes again a indetermination, which we can solve by applying again L’Hôpital’s rule. Differentiating the numerator and denominator in Eq. 36:


Finally, summing Eq. 35 and Eq. 37, we get


c.3 Summary

To sum up, we have the following expression for the gradient of with respect to :