1 Introduction
The Transformer architecture (Vaswani et al., 2017)
for deep neural networks has quickly risen to prominence in NLP through its efficiency and performance, leading to improvements in the state of the art of Neural Machine Translation
(NMT; JunczysDowmunt et al., 2018; Ott et al., 2018), as well as inspiring other powerful generalpurpose models like BERT (Devlin et al., 2019)and GPT2
(Radford et al., 2019). At the heart of the Transformer lie multihead attention mechanisms: each word is represented by multiple different weighted averages of its relevant context. As suggested by recent works on interpreting attention head roles, separate attention heads may learn to look for various relationships between tokens (Tang et al., 2018; Raganato and Tiedemann, 2018; Mareček and Rosa, 2018; Tenney et al., 2019; Voita et al., 2019).The attention distribution of each head is predicted typically using the softmax
normalizing transform. As a result, all context words have nonzero attention weight. Recent work on single attention architectures suggest that using sparse normalizing transforms in attention mechanisms such as sparsemax – which can yield exactly zero probabilities for irrelevant words – may improve performance and interpretability
(Malaviya et al., 2018; Deng et al., 2018; Peters et al., 2019). Qualitative analysis of attention heads (Vaswani et al., 2017, Figure 5) suggests that, depending on what phenomena they capture, heads tend to favor flatter or more peaked distributions.Recent works have proposed sparse Transformers and adaptive span Transformers (Child et al., 2019; Sukhbaatar et al., 2019). However, the “sparsity” of those models only limits the attention to a contiguous span of past tokens, while in this work we propose a highly adaptive Transformer model that is capable of attending to a sparse set of words that are not necessarily contiguous. Figure 1 shows the relationship of these methods with ours.
Our contributions are the following:

We introduce sparse attention into the Transformer architecture, showing that it eases interpretability and leads to slight accuracy gains.

We propose an adaptive version of sparse attention, where the shape of each attention head is learnable and can vary continuously and dynamically between the dense limit case of softmax and the sparse, piecewiselinear sparsemax case.^{1}^{1}1 Code and pip package available at https://github.com/deepspin/entmax.

We make an extensive analysis of the added interpretability of these models, identifying both crisper examples of attention head behavior observed in previous work, as well as novel behaviors unraveled thanks to the sparsity and adaptivity of our proposed model.
2 Background
2.1 The Transformer
In NMT, the Transformer (Vaswani et al., 2017) is a sequencetosequence (seq2seq) model which maps an input sequence to an output sequence through hierarchical multihead attention mechanisms, yielding a dynamic, contextdependent strategy for propagating information within and across sentences. It contrasts with previous seq2seq models, which usually rely either on costly gated recurrent operations (often LSTMs: Bahdanau et al., 2015; Luong et al., 2015) or static convolutions (Gehring et al., 2017).
Given query contexts and sequence items under consideration, attention mechanisms compute, for each query, a weighted representation of the items. The particular attention mechanism used in Vaswani et al. (2017) is called scaled dotproduct attention, and it is computed in the following way:
(1) 
where contains representations of the queries, are the keys and values of the items attended over, and is the dimensionality of these representations. The mapping normalizes rowwise using softmax, , where
(2) 
In words, the keys are used to compute a relevance score between each item and query. Then, normalized attention weights are computed using softmax, and these are used to weight the values of each item at each query context.
However, for complex tasks, different parts of a sequence may be relevant in different ways, motivating multihead attention in Transformers. This is simply the application of Equation 1 in parallel
times, each with a different, learned linear transformation that allows specialization:
(3) 
In the Transformer, there are three separate multihead attention mechanisms for distinct purposes:

Encoder selfattention: builds rich, layered representations of each input word, by attending on the entire input sentence.

Context attention: selects a representative weighted average of the encodings of the input words, at each time step of the decoder.

Decoder selfattention: attends over the partial output sentence fragment produced so far.
Together, these mechanisms enable the contextualized flow of information between the input sentence and the sequential decoder.
2.2 Sparse Attention
The softmax mapping (Equation 2) is elementwise proportional to , therefore it can never assign a weight of exactly zero. Thus, unnecessary items are still taken into consideration to some extent. Since its output sums to one, this invariably means less weight is assigned to the relevant items, potentially harming performance and interpretability (Jain and Wallace, 2019). This has motivated a line of research on learning networks with sparse mappings (Martins and Astudillo, 2016; Niculae and Blondel, 2017; Louizos et al., 2018; Shao et al., 2019). We focus on a recentlyintroduced flexible family of transformations, entmax (Blondel et al., 2019; Peters et al., 2019), defined as:
(4) 
where is the probability simplex, and, for , is the Tsallis continuous family of entropies (Tsallis, 1988):
(5) 
This family contains the wellknown Shannon and Gini entropies, corresponding to the cases and , respectively.
Equation 4 involves a convex optimization subproblem. Using the definition of , the optimality conditions may be used to derive the following form for the solution (Appendix B.2):
(6) 
where
is the positive part (ReLU) function,
denotes the vector of all ones, and
– which acts like a threshold – is the Lagrange multiplier corresponding to the constraint.Properties of entmax.
The appeal of entmax for attention rests on the following properties. For (i. e., when becomes the Shannon entropy), it exactly recovers the softmax mapping (We provide a short derivation in Appendix B.3.). For all it permits sparse solutions, in stark contrast to softmax. In particular, for , it recovers the sparsemax mapping (Martins and Astudillo, 2016), which is piecewise linear. Inbetween, as increases, the mapping continuously gets sparser as its curvature changes.
To compute the value of entmax, one must find the threshold such that the r. h .s. in Equation 6 sums to one. Blondel et al. (2019) propose a general bisection algorithm. Peters et al. (2019) introduce a faster, exact algorithm for , and enable using with fixed within a neural network by showing that the entmax Jacobian w. r .t. for is
(7)  
Our work furthers the study of entmax by providing a derivation of the gradient w. r .t. the hyperparameter (Section 3), thereby allowing the shape and sparsity of the mapping to be learned automatically. This is particularly appealing in the context of multihead attention mechanisms, where we shall show in Section 5.1 that different heads tend to learn different sparsity behaviors.
3 Adaptively Sparse Transformers
with entmax
We now propose a novel Transformer architecture wherein we simply replace softmax with entmax in the attention heads. Concretely, we replace the row normalization in Equation 1 by
(8) 
This change immediately leads to sparse attention weights, as long as ; in particular, is a sensible starting point (Peters et al., 2019).
Different per head.
Unlike LSTMbased seq2seq models, where can be more easily tuned by grid search, in a Transformer, there are many attention heads in multiple layers. Crucial to the power of such models, the different heads capture different linguistic phenomena, some of them isolating important words, others spreading out attention across phrases (Vaswani et al., 2017, Figure 5). This motivates using different, adaptive values for each attention head, such that some heads may learn to be sparser, and others may become closer to softmax. We propose doing so by treating the values as neural network parameters, optimized via stochastic gradients along with the other weights.
Gradients w. r .t. .
In order to optimize automatically via gradient methods, we must compute the gradient of the entmax output w. r .t. . Since entmax is defined through an optimization problem, this is nontrivial and cannot be simply handled through automatic differentiation; it falls within the domain of argmin differentiation, an active research topic in optimization (Gould et al., 2016; Amos and Kolter, 2017).
One of our key contributions is the derivation of a closedform expression for this gradient. The next proposition provides the expression of this gradient, enabling entmax layers with adaptive . To the best of our knowledge, ours is the first neural network module that can automatically, continuously vary in shape away from softmax and toward sparse mappings like sparsemax.
Proposition 1.
Let be the solution of Equation 4. Denote the distribution and let . The ^{th} component of the gradient is
(9) 
The proof uses implicit function differentiation and is given in Appendix C.
Proposition 1 provides the remaining missing piece needed for training adaptively sparse Transformers. In the following section, we evaluate this strategy on neural machine translation, and analyze the behavior of the learned attention heads.
4 Experiments
We apply our adaptively sparse Transformers on four machine translation tasks. For comparison, a natural baseline is the standard Transformer architecture using the softmax transform in its multihead attention mechanisms. We consider two other model variants in our experiments that make use of different normalizing transformations:

1.5entmax: a Transformer with sparse entmax attention with fixed for all heads. This is a novel model, since 1.5entmax had only been proposed for RNNbased NMT models (Peters et al., 2019), but never in Transformers, where attention modules are not just one single component of the seq2seq model but rather an integral part of all of the model components.

entmax: an adaptive Transformer with sparse entmax attention with a different, learned for each head.
The adaptive model has an additional scalar parameter per attention head per layer for each of the three attention mechanisms (encoder selfattention, context attention, and decoder selfattention), i. e.,
(10) 
and we set . All or some of the values can be tied if desired, but we keep them independent for analysis purposes.
activation  deen  jaen  roen  ende 

29.79  21.57  32.70  26.02  
29.83  22.13  33.10  25.89  
29.90  21.74  32.89  26.93 
Datasets.
Our models were trained on 4 machine translation datasets of different training sizes:

[itemsep=.5ex,leftmargin=2ex]

IWSLT 2017 German English (deen, Cettolo et al., 2017): 200K sentence pairs.

KFTT Japanese English (jaen, Neubig, 2011): 300K sentence pairs.

WMT 2016 Romanian English (roen, Bojar et al., 2016): 600K sentence pairs.

WMT 2014 English German (ende, Bojar et al., 2014): 4.5M sentence pairs.
All of these datasets were preprocessed with bytepair encoding (BPE; Sennrich et al., 2016), using joint segmentations of 32k merge operations.
Training.
We follow the dimensions of the TransformerBase model of Vaswani et al. (2017): The number of layers is and number of heads is in the encoder selfattention, the context attention, and the decoder selfattention. We use a minibatch size of 8192 tokens and warm up the learning rate linearly until 20k steps, after which it decays according to an inverse square root schedule. All models were trained until convergence of validation accuracy, and evaluation was done at each 10k steps for roen and ende and at each 5k steps for deen and jaen. The endtoend computational overhead of our methods, when compared to standard softmax, is relatively small; in training tokens per second, the models using entmax and entmax are, respectively, and the speed of the softmax model.
Results.
We report test set tokenized BLEU (Papineni et al., 2002) results in Table 1. We can see that replacing softmax by entmax does not hurt performance in any of the datasets; indeed, sparse attention Transformers tend to have slightly higher BLEU, but their sparsity leads to a better potential for analysis. In the next section, we make use of this potential by exploring the learned internal mechanics of the selfattention heads.
5 Analysis
We conduct an analysis for the higherresource dataset WMT 2014 English German of the attention in the sparse adaptive Transformer model (entmax) at multiple levels: we analyze highlevel statistics as well as individual head behavior. Moreover, we make a qualitative analysis of the interpretability capabilities of our models.
5.1 HighLevel Statistics
What kind of values are learned?
Figure 2 shows the learning trajectories of the parameters of a selected subset of heads. We generally observe a tendency for the randomlyinitialized parameters to decrease initially, suggesting that softmaxlike behavior may be preferable while the model is still very uncertain. After around one thousand steps, some heads change direction and become sparser, perhaps as they become more confident and specialized. This shows that the initialization of does not predetermine its sparsity level or the role the head will have throughout. In particular, head in the encoder selfattention layer first drops to around before becoming one of the sparsest heads, with .
The overall distribution of values at convergence can be seen in Figure 3. We can observe that the encoder selfattention blocks learn to concentrate the values in two modes: a very sparse one around , and a dense one between softmax and 1.5entmax. However, the decoder self and context attention only learn to distribute these parameters in a single mode. We show next that this is reflected in the average density of attention weight vectors as well.
Attention weight density when translating.
For any , it would still be possible for the weight matrices in Equation 3 to learn rescalings so as to make attention sparser or denser. To visualize the impact of adaptive values, we compare the empirical attention weight density (the average number of tokens receiving nonzero attention) within each module, against sparse Transformers with fixed .
Figure 4 shows that, with fixed , heads tend to be sparse and similarlydistributed in all three attention modules. With learned , there are two notable changes: (i) a prominent mode corresponding to fully dense probabilities, showing that our models learn to combine sparse and dense attention, and (ii) a distinction between the encoder selfattention – whose background distribution tends toward extreme sparsity – and the other two modules, who exhibit more uniform background distributions. This suggests that perhaps entirely sparse Transformers are suboptimal.
The fact that the decoder seems to prefer denser attention distributions might be attributed to it being autoregressive, only having access to past tokens and not the full sentence. We speculate that it might lose too much information if it assigned weights of zero to too many tokens in the selfattention, since there are fewer tokens to attend to in the first place.
Teasing this down into separate layers, Figure 5 shows the average (sorted) density of each head for each layer. We observe that
entmax is able to learn different sparsity patterns at each layer, leading to more variance in individual head behavior, to clearlyidentified dense and sparse heads, and overall to different tendencies compared to the fixed case of
.Head diversity.
To measure the overall disagreement between attention heads, as a measure of head diversity, we use the following generalization of the JensenShannon divergence:
(11) 
where is the vector of attention weights assigned by head to each word in the sequence, and is the Shannon entropy, baseadjusted based on the dimension of such that . We average this measure over the entire validation set. The higher this metric is, the more the heads are taking different roles in the model.
Figure 6 shows that both sparse Transformer variants show more diversity than the traditional softmax one. Interestingly, diversity seems to peak in the middle layers of the encoder selfattention and context attention, while this is not the case for the decoder selfattention.
The statistics shown in this section can be found for the other language pairs in Appendix A.
5.2 Identifying Head Specializations
Previous work pointed out some specific roles played by different heads in the softmax Transformer model (Voita et al., 2018; Tang et al., 2018; Voita et al., 2019). Identifying the specialization of a head can be done by observing the type of tokens or sequences that the head often assigns most of its attention weight; this is facilitated by sparsity.
Positional heads.
One particular type of head, as noted by Voita et al. (2019), is the positional head. These heads tend to focus their attention on either the previous or next token in the sequence, thus obtaining representations of the neighborhood of the current time step. In Figure 7
, we show attention plots for such heads, found for each of the studied models. The sparsity of our models allows these heads to be more confident in their representations, by assigning the whole probability distribution to a single token in the sequence. Concretely, we may measure a positional head’s
confidence as the average attention weight assigned to the previous token. The softmax model has three heads for position , with median confidence . The entmax model also has three heads for this position, with median confidence . The adaptive model has four heads, with median confidences , the lowestconfidence head being dense with , while the highestconfidence head being sparse ().For position , the models each dedicate one head, with confidence around , slightly higher for entmax. The adaptive model sets for this head.
BPEmerging head.
Due to the sparsity of our models, we are able to identify other head specializations, easily identifying which heads should be further analysed. In Figure 8 we show one such head where the value is particularly high (in the encoder, layer 1, head 4 depicted in Figure 2). We found that this head most often looks at the current time step with high confidence, making it a positional head with offset . However, this head often spreads weight sparsely over 23 neighboring tokens, when the tokens are part of the same BPE cluster^{2}^{2}2BPEsegmented words are denoted by in the figures. or hyphenated words. As this head is in the first layer, it provides a useful service to the higher layers by combining information evenly within some BPE clusters.
For each BPE cluster or cluster of hyphenated words, we computed a score between 0 and 1 that corresponds to the maximum attention mass assigned by any token to the rest of the tokens inside the cluster in order to quantify the BPEmerging capabilities of these heads.^{3}^{3}3If the cluster has size 1, the score is the weight the token assigns to itself. There are not any attention heads in the softmax model that are able to obtain a score over , while for entmax and entmax there are two heads in each ( and for entmax and and for entmax).
Interrogation head.
On the other hand, in Figure 9 we show a head for which our adaptively sparse model chose an close to 1, making it closer to softmax (also shown in encoder, layer 1, head 3 depicted in Figure 2). We observe that this head assigns a high probability to question marks at the end of the sentence in time steps where the current token is interrogative, thus making it an interrogationdetecting head. We also observe this type of heads in the other models, which we also depict in Figure 9. The average attention weight placed on the question mark when the current token is an interrogative word is for softmax, for entmax, and for entmax.
Furthermore, we can examine sentences where some tendentially sparse heads become less so, thus identifying sources of ambiguity where the head is less confident in its prediction. An example is shown in Figure 10 where sparsity in the same head differs for sentences of similar length.
6 Related Work
Sparse attention.
Prior work has developed sparse attention mechanisms, including applications to NMT (Martins and Astudillo, 2016; Malaviya et al., 2018; Niculae and Blondel, 2017; Shao et al., 2019; Maruf et al., 2019). Peters et al. (2019) introduced the entmax function this work builds upon. In their work, there is a single attention mechanism which is controlled by a fixed . In contrast, this is the first work to allow such attention mappings to dynamically adapt their curvature and sparsity, by automatically adjusting the continuous parameter. We also provide the first results using sparse attention in a Transformer model.
Fixed sparsity patterns.
Recent research improves the scalability of Transformerlike networks through static, fixed sparsity patterns (Child et al., 2019; Wu et al., 2019). Our adaptivelysparse Transformer can dynamically select a sparsity pattern that finds relevant words regardless of their position (e. g., Figure 9). Moreover, the two strategies could be combined. In a concurrent line of research, Sukhbaatar et al. (2019) propose an adaptive attention span for Transformer language models. While their work has each head learn a different contiguous span of context tokens to attend to, our work finds different sparsity patterns in the same span. Interestingly, some of their findings mirror ours – we found that attention heads in the last layers tend to be denser on average when compared to the ones in the first layers, while their work has found that lower layers tend to have a shorter attention span compared to higher layers.
Transformer interpretability.
The original Transformer paper (Vaswani et al., 2017) shows attention visualizations, from which some speculation can be made of the roles the several attention heads have. Mareček and Rosa (2018) study the syntactic abilities of the Transformer selfattention, while Raganato and Tiedemann (2018) extract dependency relations from the attention weights. Tenney et al. (2019) find that the selfattentions in BERT (Devlin et al., 2019) follow a sequence of processes that resembles a classical NLP pipeline. Regarding redundancy of heads, Voita et al. (2019) develop a method that is able to prune heads of the multihead attention module and make an empirical study of the role that each head has in selfattention (positional, syntactic and rare words). Li et al. (2018) also aim to reduce head redundancy by adding a regularization term to the loss that maximizes head disagreement and obtain improved results. While not considering Transformer attentions, Jain and Wallace (2019) show that traditional attention mechanisms do not necessarily improve interpretability since softmax attention is vulnerable to an adversarial attack leading to wildly different model predictions for the same attention weights. Sparse attention may mitigate these issues; however, our work focuses mostly on a more mechanical aspect of interpretation by analyzing head behavior, rather than on explanations for predictions.
7 Conclusion and Future Work
We contribute a novel strategy for adaptively sparse attention, and, in particular, for adaptively sparse Transformers. We present the first empirical analysis of Transformers with sparse attention mappings (i. e., entmax), showing potential in both translation accuracy as well as in model interpretability.
In particular, we analyzed how the attention heads in the proposed adaptively sparse Transformer can specialize more and with higher confidence. Our adaptivity strategy relies only on gradientbased optimization, sidestepping costly perhead hyperparameter searches. Further speedups are possible by leveraging more parallelism in the bisection algorithm for computing entmax.
Finally, some of the automaticallylearned behaviors of our adaptively sparse Transformers, for instance, the neardeterministic positional heads or the subword joining head, may provide new ideas for designing static variations of the Transformer architecture.
Acknowledgments
This work was supported by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal). We are grateful to Ben Peters for the entmax code and Erick Fonseca, Marcos Treviso, Pedro Martins, and Tsvetomila Mihaylova for insightful group discussion. We thank Mathieu Blondel for the idea to learn . We would also like to thank the anonymous reviewers for their helpful feedback.
References
 Amos and Kolter (2017) Brandon Amos and J. Zico Kolter. 2017. OptNet: Differentiable optimization as a layer in neural networks. In Proc. ICML.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. ICLR.
 Blondel et al. (2019) Mathieu Blondel, André FT Martins, and Vlad Niculae. 2019. Learning classifiers with FenchelYoung losses: Generalized entropies, margins, and algorithms. In Proc. AISTATS.
 Bojar et al. (2014) Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve SaintAmand, et al. 2014. Findings of the 2014 workshop on statistical machine translation. In Proc. Workshop on Statistical Machine Translation.
 Bojar et al. (2016) Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. 2016. Findings of the 2016 conference on machine translation. In Proc. WMT.
 Cettolo et al. (2017) M Cettolo, M Federico, L Bentivogli, J Niehues, S Stüker, K Sudoh, K Yoshino, and C Federmann. 2017. Overview of the IWSLT 2017 evaluation campaign. In Proc. IWSLT.
 Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse Transformers. preprint arXiv:1904.10509.
 Clarke (1990) Frank H Clarke. 1990. Optimization and Nonsmooth Analysis. SIAM.
 Deng et al. (2018) Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. 2018. Latent alignment and variational attention. In Proc. NeurIPS.
 Devlin et al. (2019) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proc. NAACLHLT.
 Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proc. ICML.
 Gould et al. (2016) Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. 2016. On differentiating parameterized argmin and argmax problems with application to bilevel optimization. preprint arXiv:1607.05447.
 Held et al. (1974) Michael Held, Philip Wolfe, and Harlan P Crowder. 1974. Validation of subgradient optimization. Mathematical Programming, 6(1):62–88.
 Jain and Wallace (2019) Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In Proc. NAACLHLT.
 JunczysDowmunt et al. (2018) Marcin JunczysDowmunt, Kenneth Heafield, Hieu Hoang, Roman Grundkiewicz, and Anthony Aue. 2018. Marian: Costeffective highquality neural machine translation in C++. In Proc. WNMT.
 Li et al. (2018) Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu, and Tong Zhang. 2018. MultiHead Attention with Disagreement Regularization. In Proc. EMNLP.
 Louizos et al. (2018) Christos Louizos, Max Welling, and Diederik P Kingma. 2018. Learning sparse neural networks through regularization. Proc. ICLR.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proc. EMNLP.
 Malaviya et al. (2018) Chaitanya Malaviya, Pedro Ferreira, and André FT Martins. 2018. Sparse and constrained attention for neural machine translation. In Proc. ACL.
 Mareček and Rosa (2018) David Mareček and Rudolf Rosa. 2018. Extracting syntactic trees from Transformer encoder selfattentions. In Proc. BlackboxNLP.
 Martins and Astudillo (2016) André FT Martins and Ramón Fernandez Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multilabel classification. In Proc. of ICML.
 Maruf et al. (2019) Sameen Maruf, André FT Martins, and Gholamreza Haffari. 2019. Selective attention for contextaware neural machine translation. preprint arXiv:1903.08788.
 Neubig (2011) Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt.
 Niculae and Blondel (2017) Vlad Niculae and Mathieu Blondel. 2017. A regularized framework for sparse and structured neural attention. In Proc. NeurIPS.
 Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In Proc. WMT.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL.
 Peters et al. (2019) Ben Peters, Vlad Niculae, and André FT Martins. 2019. Sparse sequencetosequence models. In Proc. ACL.
 Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. preprint.
 Raganato and Tiedemann (2018) Alessandro Raganato and Jörg Tiedemann. 2018. An analysis of encoder representations in Transformerbased machine translation. In Proc. BlackboxNLP.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. ACL.
 Shao et al. (2019) Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang, Yudian Li, Xiaogang Wang, and Ping Luo. 2019. SSN: Learning sparse switchable normalization via SparsestMax. In Proc. CVPR.
 Sukhbaatar et al. (2019) Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive Attention Span in Transformers. In Proc. ACL.
 Tang et al. (2018) Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why selfattention? A targeted evaluation of neural machine translation architectures. In Proc. EMNLP.
 Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proc. ACL.
 Tsallis (1988) Constantino Tsallis. 1988. Possible generalization of BoltzmannGibbs statistics. Journal of Statistical Physics, 52:479–487.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS.
 Voita et al. (2018) Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Contextaware neural machine translation learns anaphora resolution. In Proc. ACL.
 Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multihead selfattention: Specialized heads do the heavy lifting, the rest can be pruned. In Proc. ACL.
 Wu et al. (2019) Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In Proc. ICLR.
Appendix A HighLevel Statistics Analysis of Other Language Pairs
Appendix B Background
b.1 Regularized FenchelYoung prediction functions
Definition 1 (Blondel et al. 2019).
Let be a strictly convex regularization function. We define the prediction function as
(12) 
b.2 Characterizing the entmax mapping
Lemma 1 (Peters et al. 2019).
For any , there exists a unique such that
(13) 
Proof: From the definition of ,
(14) 
we may easily identify it with a regularized prediction function (Def. 1):
We first note that for all ,
(15) 
From the constant invariance and scaling properties of (Blondel et al., 2019, Proposition 1, items 4–5),
Using (Blondel et al., 2019, Proposition 5), noting that and , yields
(16) 
Since is strictly convex on the simplex, has a unique solution . Equation 16 implicitly defines a onetoone mapping between and as long as , therefore is also unique.
b.3 Connections to softmax and sparsemax
The Euclidean projection onto the simplex, sometimes referred to, in the context of neural attention, as sparsemax (Martins and Astudillo, 2016), is defined as
(17) 
The solution can be characterized through the unique threshold such that and (Held et al., 1974)
(18) 
Thus, each coordinate of the sparsemax solution is a piecewiselinear function. Visibly, this expression is recovered when setting in the entmax expression (Equation 21); for other values of , the exponent induces curvature.
On the other hand, the wellknown softmax is usually defined through the expression
(19) 
which can be shown to be the unique solution of the optimization problem
(20) 
where is the Shannon entropy. Indeed, setting the gradient to yields the condition , where and are Lagrange multipliers for the simplex constraints and , respectively. Since the l. h .s. is only finite for , we must have for all , by complementary slackness. Thus, the solution must have the form , yielding Equation 19.
Appendix C Gradient of entmax w. r .t. the shape parameter : Proof of Proposition 1
Recall that the entmax transformation is defined as:
(21) 
where and is the Tsallis entropy,
(22) 
and is the Shannon entropy.
In this section, we derive the gradient of with respect to the scalar parameter .
c.1 General case of
From the KKT conditions associated with the optimization problem in Eq. 21, we have that the solution has the following form, coordinatewise:
(23) 
where is a scalar Lagrange multiplier that ensures that normalizes to 1, i. e., it is defined implicitly by the condition:
(24) 
For general values of , Eq. 24 lacks a closed form solution. This makes the computation of the gradient
(25) 
nontrivial. Fortunately, we can use the technique of implicit differentiation to obtain this gradient.
The gradient exists almost everywhere, and the expressions we derive expressions yield a generalized gradient (Clarke, 1990) at any nondifferentiable points that may occur for certain (, ) pairs. We begin by noting that if , because increasing keeps sparse coordinates sparse.^{4}^{4}4This follows from the margin property of (Blondel et al., 2019). Therefore we need to worry only about coordinates that are in the support of . We will assume hereafter that the ^{th} coordinate of is nonzero. We have:
(26)  
We can see that this gradient depends on , which we now compute using implicit differentiation.
c.2 Solving the indetermination for
To solve this indetermination, we will need to apply L’Hôpital’s rule twice. Let us first compute the derivative of with respect to . We have
(32) 
therefore
(33)  
Differentiating the numerator and denominator in Eq. 31, we get:
(34)  
with
(35)  
and
(36) 
When , becomes again a indetermination, which we can solve by applying again L’Hôpital’s rule. Differentiating the numerator and denominator in Eq. 36:
(37)  
Finally, summing Eq. 35 and Eq. 37, we get
(38) 
c.3 Summary
To sum up, we have the following expression for the gradient of with respect to :
(39) 