Log In Sign Up

Sparse Sequence-to-Sequence Models

by   Ben Peters, et al.
Unbabel Inc.

Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This density is wasteful, making models less interpretable and assigning probability mass to many implausible outputs. In this paper, we propose sparse sequence-to-sequence models, rooted in a new family of α-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any α > 1. We provide fast algorithms to evaluate these transformations and their gradients, which scale well for large vocabulary sizes. Our models are able to produce sparse alignments and to assign nonzero probability to a short list of plausible outputs, sometimes rendering beam search exact. Experiments on morphological inflection and machine translation reveal consistent gains over dense models.


page 1

page 2

page 3

page 4


Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

The Softmax function is used in the final layer of nearly all existing s...

Smoothing and Shrinking the Sparse Seq2Seq Search Space

Current sequence-to-sequence models are trained to minimize cross-entrop...

Talking Drums: Generating drum grooves with neural networks

Presented is a method of generating a full drum kit part for a provided ...

A Deep Memory-based Architecture for Sequence-to-Sequence Learning

We propose DEEPMEMORY, a novel deep architecture for sequence-to-sequenc...

Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

In many natural language processing (NLP) tasks the same input (e.g. sou...

Sequence-to-Sequence Learning as Beam-Search Optimization

Sequence-to-Sequence (seq2seq) modeling has rapidly become an important ...

The Neural Noisy Channel

We formulate sequence to sequence transduction as a noisy channel decodi...

1 Introduction

Attention-based sequence-to-sequence (seq2seq) models have proven useful for a variety of NLP applications, including machine translation Bahdanau et al. (2015); Vaswani et al. (2017), speech recognition Chorowski et al. (2015), abstractive summarization Chopra et al. (2016), and morphological inflection generation Kann and Schütze (2016), among others. In part, their strength comes from their flexibility: many tasks can be formulated as transducing a source sequence into a target sequence of possibly different length.

However, conventional seq2seq models are dense: they compute both attention weights and output probabilities with the softmax function (Bridle, 1990), which always returns positive values. This results in dense attention alignments, in which each source position is attended to at each target position, and in dense output probabilities, in which each vocabulary type always has nonzero probability of being generated. This contrasts with traditional statistical machine translation systems, which are based on sparse, hard alignments, and decode by navigating through a sparse lattice of phrase hypotheses. Can we transfer such notions of sparsity to modern neural architectures? And if so, do they improve performance?

In this paper, we provide an affirmative answer to both questions by proposing neural sparse seq2seq models that replace the softmax transformations (both in the attention and output) by sparse transformations. Our innovations are rooted in the recently proposed sparsemax transformation (Martins and Astudillo, 2016) and Fenchel-Young losses (Blondel et al., 2019). Concretely, we consider a family of transformations (dubbed -entmax), parametrized by a scalar , based on the Tsallis entropies (Tsallis, 1988). This family includes softmax () and sparsemax () as particular cases. Crucially, entmax transforms are sparse for all .













Figure 1: The full beam search of our best performing morphological inflection model when generating the past participle of the verb “draw”. The model gives nonzero probability to exactly three hypotheses, including the correct form (“drawn”) and the form that would be correct if “draw” were regular (“drawed”).
Figure 2: Forced decoding using sparsemax attention and 1.5-entmax output for the German source sentence, “Dies ist ein weiterer Blick auf den Baum des Lebens.” Predictions with nonzero probability are shown at each time step. All other target types have probability exactly zero. When consecutive predictions consist of a single word, we combine their borders to showcase auto-completion potential. The selected gold targets are in boldface.

Our models are able to produce both sparse attention, a form of inductive bias that increases focus on relevant source words and makes alignments more interpretable, and sparse output probabilities

, which together with auto-regressive models can lead to probability distributions that are nonzero only for a finite subset of all possible strings. In certain cases, a short list of plausible outputs can be enumerated without ever exhausting the beam (Figure 

1), rendering beam search exact. Sparse output seq2seq models can also be used for adaptive, sparse next word suggestion (Figure 2).

Overall, our contributions are as follows:

  • We propose an entmax sparse output layer

    , together with a natural loss function. In large-vocabulary settings, sparse outputs avoid wasting probability mass on unlikely outputs, substantially improving accuracy. For tasks with little output ambiguity, entmax losses, coupled with beam search, can often produce

    exact finite sets with only one or a few sequences. To our knowledge, this is the first study of sparse output probabilities in seq2seq problems.

  • We construct entmax sparse attention, improving interpretability at no cost in accuracy. We show that the entmax gradient has a simple form (Proposition 2), revealing an insightful missing link between softmax and sparsemax.

  • We derive a novel exact algorithm for the case of 1.5-entmax, achieving processing speed close to softmax on the GPU, even with large vocabulary sizes. For arbitrary , we investigate a GPU-friendly approximate algorithm.111

    Our Pytorch code is available at

We experiment on two tasks: one character-level with little ambiguity (morphological inflection generation) and another word-level, with more ambiguity (neural machine translation). The results show clear benefits of our approach, both in terms of accuracy and interpretability.

2 Background

The underlying architecture we focus on is an RNN-based seq2seq with global attention and input-feeding (Luong et al., 2015). We provide a brief description of this architecture, with an emphasis on the attention mapping and the loss function.


Scalars, vectors, and matrices are denoted respectively as

, , and . We denote the –probability simplex (the set of vectors representing probability distributions over choices) by . We denote the positive part as , and by its elementwise application to vectors. We denote the indicator vector .


Given an input sequence of tokens , the encoder applies an embedding lookup followed by layered bidirectional LSTMs (Hochreiter and Schmidhuber, 1997), resulting in encoder states .


The decoder generates output tokens , one at a time, terminated by a stop symbol. At each time step , it computes a probability distribution for the next generated word , as follows. Given the current state of the decoder LSTM, an attention mechanism (Bahdanau et al., 2015) computes a focused, fixed-size summary of the encodings , using as a query vector. This is done by computing token-level scores , then taking a weighted average


The contextual output is the non-linear combination , yielding the predictive distribution of the next word


The output , together with the embedding of the predicted , feed into the decoder LSTM for the next step, in an auto-regressive manner. The model is trained to maximize the likelihood of the correct target sentences, or equivalently, to minimize


A central building block in the architecture is the transformation ,


which maps a vector of scores into a probability distribution (i. e., a vector in ). As seen above, the mapping plays two crucial roles in the decoder: first, in computing normalized attention weights (Eq. 1), second, in computing the predictive probability distribution (Eq. 2). Since , softmax never assigns a probability of zero to any word, so we may never fully rule out non-important input tokens from attention, nor unlikely words from the generation vocabulary. While this may be advantageous for dealing with uncertainty, it may be preferrable to avoid dedicating model resources to irrelevant words. In the next section, we present a strategy for differentiable sparse probability mappings. We show that our approach can be used to learn powerful seq2seq models with sparse outputs and sparse attention mechanisms.

3 Sparse Attention and Outputs

3.1 The sparsemax mapping and loss

To pave the way to a more general family of sparse attention and losses, we point out that softmax (Eq. 4) is only one of many possible mappings from to . Martins and Astudillo (2016) introduce sparsemax: an alternative to softmax which tends to yield sparse probability distributions:


Since Eq. 5 is a projection onto , which tends to yield sparse solutions, the predictive distribution is likely to assign exactly zero probability to low-scoring choices. They also propose a corresponding loss function to replace the negative log likelihood loss (Eq. 3):


This loss is smooth and convex on and has a margin: it is zero if and only if for any (Martins and Astudillo, 2016, Proposition 3). Training models with the sparsemax loss requires its gradient (cf. Appendix A.2):

For using the sparsemax mapping in an attention mechanism, Martins and Astudillo (2016) show that it is differentiable almost everywhere, with

where if , otherwise .

Entropy interpretation.

At first glance, sparsemax appears very different from softmax, and a strategy for producing other sparse probability mappings is not obvious. However, the connection becomes clear when considering the variational form of softmax (Wainwright and Jordan, 2008):


where is the well-known Gibbs-Boltzmann-Shannon entropy with base .

Likewise, letting be the Gini entropy, we can rearrange Eq. 5 as


crystallizing the connection between softmax and sparsemax: they only differ in the choice of entropic regularizer.

3.2 A new entmax mapping and loss family




Figure 3: Illustration of entmax in the two-dimensional case . All mappings except softmax saturate at . While sparsemax is piecewise linear, mappings with have smooth corners.

The parallel above raises a question: can we find

interesting interpolations between softmax and sparsemax?

We answer affirmatively, by considering a generalization of the Shannon and Gini entropies proposed by Tsallis (1988): a family of entropies parametrized by a scalar which we call Tsallis -entropies:


This family is continuous, i. e., for any (cf. Appendix A.1). Moreover, . Thus, Tsallis entropies interpolate between the Shannon and Gini entropies. Starting from the Tsallis entropies, we construct a probability mapping, which we dub entmax:


and, denoting , a loss function


The motivation for this loss function resides in the fact that it is a Fenchel-Young loss (Blondel et al., 2019), as we briefly explain in Appendix A.2. Then, and . Similarly, is the negative log likelihood, and is the sparsemax loss. For all , entmax tends to produce sparse probability distributions, yielding a function family continuously interpolating between softmax and sparsemax, cf. Figure 3. The gradient of the entmax loss is


Tsallis entmax losses have useful properties including convexity, differentiability, and a hinge-like separation margin property: the loss incurred becomes zero when the score of the correct class is separated by the rest by a margin of . When separation is achieved, (Blondel et al., 2019). This allows entmax seq2seq models to be adaptive to the degree of uncertainty present: decoders may make fully confident predictions at “easy” time steps, while preserving sparse uncertainty when a few choices are possible (as exemplified in Figure 2).

Tsallis entmax probability mappings have not, to our knowledge, been used in attention mechanisms. They inherit the desirable sparsity of sparsemax, while exhibiting smoother, differentiable curvature, whereas sparsemax is piecewise linear.

3.3 Computing the entmax mapping

Whether we want to use as an attention mapping, or as a loss function, we must be able to efficiently compute , i. e., to solve the maximization in Eq. 10. For , the closed-form solution is given by Eq. 4. For , given , we show that there is a unique threshold such that (Appendix C.1, Lemma 2):


i. e., entries with score get zero probability. For sparsemax (), the problem amounts to Euclidean projection onto , for which two types of algorithms are well studied:

  1. [label=.]

  2. exact, based on sorting Held et al. (1974); Michelot (1986),

  3. iterative, bisection-based Liu and Ye (2009).

The bisection approach searches for the optimal threshold numerically. Blondel et al. (2019) generalize this approach in a way applicable to . The resulting algorithm is (cf. Appendix C.1 for details):


1Define , set
3for  do
6     if  then  else 
Algorithm 1 Compute by bisection.

Algorithm 1 works by iteratively narrowing the interval containing the exact solution by exactly half. Line 7 ensures that approximate solutions are valid probability distributions, i. e., that .

Although bisection is simple and effective, an exact sorting-based algorithm, like for sparsemax, has the potential to be faster and more accurate. Moreover, as pointed out by Condat (2016), when exact solutions are required, it is possible to construct inputs for which bisection requires arbitrarily many iterations. To address these issues, we propose a novel, exact algorithm for 1.5-entmax, halfway between softmax and sparsemax.


1Sort , yielding ; set
2for  do
6     if  then
7         return      
Algorithm 2 Compute exactly.

We give a full derivation in Appendix C.2. As written, Algorithm 2 is because of the sort; however, in practice, when the solution has no more than nonzeros, we do not need to fully sort , just to find the largest values. Our experiments in §4.2 reveal that a partial sorting approach can be very efficient and competitive with softmax on the GPU, even for large . Further speed-ups might be available following the strategy of Condat (2016), but our simple incremental method is very easy to implement on the GPU using primitives available in popular libraries (Paszke et al., 2017).

Our algorithm resembles the aforementioned sorting-based algorithm for projecting onto the simplex (Michelot, 1986). Both algorithms rely on the optimality conditions implying an analytically-solvable equation in : for sparsemax (), this equation is linear, for it is quadratic (Eq. 36 in Appendix C.2). Thus, exact algorithms may not be available for general values of .

3.4 Gradient of the entmax mapping

The following result shows how to compute the backward pass through , a requirement when using as an attention mechanism.

Proposition 1.

Let . Assume we have computed , and define the vector


Proof: The result follows directly from the more general Proposition 2, which we state and prove in Appendix B, noting that .   

The gradient expression recovers the softmax and sparsemax Jacobians with and , respectively (Martins and Astudillo, 2016, Eqs. 8 and 12), thereby providing another relationship between the two mappings. Perhaps more interestingly, Proposition 1 shows why the sparsemax Jacobian depends only on the support and not on the actual values of : the sparsemax Jacobian is equal for and . This is not the case for with , suggesting that the gradients obtained with other values of may be more informative. Finally, we point out that the gradient of entmax losses involves the entmax mapping (Eq. 12), and therefore Proposition 1 also gives the Hessian of the entmax loss.

4 Experiments

The previous section establishes the computational building blocks required to train models with entmax sparse attention and loss functions. We now put them to use for two important NLP tasks, morphological inflection and machine translation. These two tasks highlight the characteristics of our innovations in different ways. Morphological inflection is a character-level task with mostly monotonic alignments, but the evaluation demands exactness: the predicted sequence must match the gold standard. On the other hand, machine translation uses a word-level vocabulary orders of magnitude larger and forces a sparse output layer to confront more ambiguity: any sentence has several valid translations and it is not clear beforehand that entmax will manage this well.

Despite the differences between the tasks, we keep the architecture and training procedure as similar as possible. We use two layers for encoder and decoder LSTMs and apply dropout with probability 0.3. We train with Adam (Kingma and Ba, 2015), with a base learning rate of 0.001, halved whenever the loss increases on the validation set. At test time, we select the model with the best validation accuracy and decode with a beam size of 5. We implemented all models with OpenNMT-py (Klein et al., 2017).

In our primary experiments, we use three values for the attention and loss functions: (softmax), (to which our novel Algorithm 2 applies), and (sparsemax). We also investigate the effect of tuning with increased granularity.

4.1 Morphological Inflection

The goal of morphological inflection is to produce an inflected word form (such as “drawn”) given a lemma (“draw”) and a set of morphological tags ({verb, past, participle}). We use the data from task 1 of the CoNLL–SIGMORPHON 2018 shared task (Cotterell et al., 2018).


We train models under two data settings: high (approximately 10,000 samples per language in 86 languages) and medium (approximately 1,000 training samples per language in 102 languages). We depart from previous work by using multilingual training: each model is trained on the data from all languages in its data setting. This allows parameters to be shared between languages, eliminates the need to train language-specific models, and may provide benefits similar to other forms of data augmentation (Bergmanis et al., 2017). Each sample is presented as a pair: the source contains the lemma concatenated to the morphological tags and a special language identification token (Johnson et al., 2017; Peters et al., 2017), and the target contains the inflected form. As an example, the source sequence for Figure 1 is english ␣ verb ␣ participle ␣ past ␣ d ␣ r ␣ a ␣ w. Although the set of inflectional tags is not sequential, treating it as such is simple to implement and works well in practice (Kann and Schütze, 2016)

. All models use embedding and hidden state sizes of 300. We validate at the end of every epoch in the

high setting and only once every ten epochs in medium because of its smaller size.

high medium
output attention (avg.) (ens.) (avg.) (ens.)
1 1 93.15 94.20 82.55 85.68
1.5 92.32 93.50 83.20 85.63
2 90.98 92.60 83.13 85.65
1.5 1 94.36 94.96 84.88 86.38
1.5 94.44 95.00 84.93 86.55
2 94.05 94.74 84.93 86.59
2 1 94.59 95.10 84.95 86.41
1.5 94.47 95.01 85.03 86.61
2 94.32 94.89 84.96 86.47
UZH (2018) 96.00 86.64
Table 1: Average per-language accuracy on the test set (CoNLL–SIGMORPHON 2018 task 1) averaged or ensembled over three runs.


Results are shown in Table 1. We report the official metric of the shared task, word accuracy averaged across languages. In addition to the average results of three individual model runs, we use an ensemble of those models, where we decode by averaging the raw probabilities at each time step. Our best sparse loss models beat the softmax baseline by nearly a full percentage point with ensembling, and up to two and a half points in the medium setting without ensembling. The choice of attention has a smaller impact. In both data settings, our best model on the validation set outperforms all submissions from the 2018 shared task except for UZH (Makarov and Clematide, 2018)

, which uses a more involved imitation learning approach and larger ensembles. In contrast, our only departure from standard

seq2seq training is the drop-in replacement of softmax by entmax.


Besides their accuracy, we observed that entmax models made very sparse predictions: the best configuration in Table 1 concentrates all probability mass into a single predicted sequence in 81% validation samples in the high data setting, and 66% in the more difficult medium setting. When the model does give probability mass to more than one sequence, the predictions reflect reasonable ambiguity, as shown in Figure 1. Besides enhancing interpretability, sparsity in the output also has attractive properties for beam search decoding: when the beam covers all nonzero-probability hypotheses, we have a certificate of globally optimal decoding, rendering beam search exact. This is the case on 87% of validation set sequences in the high setting, and 79% in medium. To our knowledge, this is the first instance of a neural seq2seq model that can offer optimality guarantees.

4.2 Machine Translation

method deen ende jaen enja roen enro
25.70 0.15 21.86 0.09 20.22 0.08 25.21 0.29 29.12 0.18 28.12 0.18
26.17 0.13 22.42 0.08 20.55 0.30 26.00 0.31 30.15 0.06 28.84 0.10
24.69 0.22 20.82 0.19 18.54 0.11 23.84 0.37 29.20 0.16 28.03 0.16
Table 2: Machine translation comparison of softmax, sparsemax, and the proposed 1.5-entmax as both attention mapping and loss function. Reported is tokenized test BLEU averaged across three runs (higher is better).

We now turn to a highly different seq2seq regime in which the vocabulary size is much larger, there is a great deal of ambiguity, and sequences can generally be translated in several ways. We train models for three language pairs in both directions:

  • [itemsep=.5ex,leftmargin=2ex]

  • IWSLT 2017 German English (deen, Cettolo et al., 2017): training size 206,112.

  • KFTT Japanese English (jaen, Neubig, 2011): training size of 329,882.

  • WMT 2016 Romanian English (roen, Bojar et al., 2016): training size 612,422, diacritics removed (following Sennrich et al., 2016b).


We use byte pair encoding (BPE; Sennrich et al., 2016a) to ensure an open vocabulary. We use separate segmentations with 25k merge operations per language for roen and a joint segmentation with 32k merges for the other language pairs. deen is validated once every 5k steps because of its smaller size, while the other sets are validated once every 10k steps. We set the maximum number of training steps at 120k for roen and 100k for other language pairs. We use 500 dimensions for word vectors and hidden states.


Table 2 shows BLEU scores (Papineni et al., 2002) for the three models with , using the same value of for the attention mechanism and loss function. We observe that the 1.5-entmax configuration consistently performs best across all six choices of language pair and direction. These results support the notion that the optimal function is somewhere between softmax and sparsemax, which motivates a more fine-grained search for ; we explore this next.

Fine-grained impact of .

Algorithm 1 allows us to further investigate the marginal effect of varying the attention and the loss , while keeping the other fixed. We report deen validation accuracy on a fine-grained grid in Figure 4. On this dataset, moving from softmax toward sparser attention (left) has a very small positive effect on accuracy, suggesting that the benefit in interpretability does not hurt accuracy. The impact of the loss function (right) is much more visible: there is a distinct optimal value around , with performance decreasing for too large values. Interpolating between softmax and sparsemax thus inherits the benefits of both, and our novel Algorithm 2 for is confirmed to strike a good middle ground. This experiment also confirms that bisection is effective in practice, despite being inexact. Extrapolating beyond the sparsemax loss () does not seem to perform well.









Figure 4: Effect of tuning on deen, for attention (left) and for output (right), while keeping the other .


In order to form a clearer idea of how sparse entmax becomes, we measure the average number of nonzero indices on the deen validation set and show it in Table 3. As expected, 1.5-entmax is less sparse than sparsemax as both an attention mechanism and output layer. In the attention mechanism, 1.5-entmax’s increased support size does not come at the cost of much interpretability, as Figure 5 demonstrates. In the output layer, 1.5-entmax assigns positive probability to only 16.13 target types out of a vocabulary of 17,993 meaning that the supported set of words often has an intuitive interpretation. Figure 2 shows the sparsity of the 1.5-entmax output layer in practice: the support becomes completely concentrated when generating a phrase like “the tree of life”, but grows when presenting a list of synonyms (“view”, “look”, “glimpse”, and so on). This has potential practical applications as a predictive translation system (Green et al., 2014), where the model’s support set serves as a list of candidate auto-completions at each time step.

Figure 5: Attention weights produced by the deen 1.5-entmax model. Nonzero weights are outlined.
method # attended # target words
24.25 17993
5.55 16.13
3.75 7.55
Table 3: Average number of nonzeros in the attention and output distributions for the deen validation set.

Training time.

Importantly, the benefits of sparsity do not come at a high computational cost. Our proposed Algorithm 2 for 1.5-entmax runs on the GPU at near-softmax speeds (Figure 6). For other values, bisection (Algorithm 1) is slightly more costly, but practical even for large vocabulary sizes. On deen, bisection is capable of processing about 10,500 target words per second on a single Nvidia GeForce GTX 1080 GPU, compared to 13,000 words per second for 1.5-entmax with Algorithm 2 and 14,500 words per second with softmax. On the smaller-vocabulary morphology datasets, Algorithm 2 is nearly as fast as softmax.

5 Related Work

Sparse attention.

Sparsity in the attention and in the output have different, but related, motivations. Sparse attention can be justified as a form of inductive bias, since for tasks such as machine translation one expects only a few source words to be relevant for each translated word. Dense attention probabilities are particularly harmful for long sequences, as shown by Luong et al. (2015), who propose “local attention” to mitigate this problem. Combining sparse attention with fertility constraints has been recently proposed by Malaviya et al. (2018). Hard attention (Xu et al., 2015; Aharoni and Goldberg, 2017; Wu et al., 2018) selects exactly one source token. Its discrete, non-differentiable nature requires imitation learning or Monte Carlo policy gradient approximations, which drastically complicate training. In contrast, entmax is a differentiable, easy to use, drop-in softmax replacement. A recent study by Jain and Wallace (2019) tackles the limitations of attention probabilities to provide interpretability. They only study dense attention in classification tasks, where attention is less crucial for the final predictions. In their conclusions, the authors defer to future work exploring sparse attention mechanisms and seq2seq models. We believe our paper can foster interesting investigation in this area.









Figure 6: Training timing on three deen runs. Markers show validation checkpoints for one of the runs.

Losses for seq2seq models.

Mostly motivated by the challenges of large vocabulary sizes in seq2seq, an important research direction tackles replacing the cross-entropy loss with other losses or approximations (Bengio and Senécal, 2008; Morin and Bengio, 2005; Kumar and Tsvetkov, 2019). While differently motivated, some of the above strategies (e. g., hierarchical prediction) could be combined with our proposed sparse losses. Niculae et al. (2018) use sparsity to predict interpretable sets of structures. Since auto-regressive seq2seq makes no factorization assumptions, their strategy cannot be applied without approximations, such as in Edunov et al. (2018).

6 Conclusion and Future Work

We proposed sparse sequence-to-sequence models and provided fast algorithms to compute their attention and output transformations. Our approach yielded consistent improvements over dense models on morphological inflection and machine translation, while inducing interpretability in both attention and output distributions. Sparse output layers also provide exactness when the number of possible hypotheses does not exhaust beam search.

Given the ubiquity of softmax in NLP, entmax has many potential applications. A natural next step is to apply entmax to self-attention (Vaswani et al., 2017). In a different vein, the strong morphological inflection results point to usefulness in other tasks where probability is concentrated in a small number of hypotheses, such as speech recognition.


This work was supported by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contract UID/EEA/50008/2019. We thank Mathieu Blondel, Nikolay Bogoychev, Gonçalo Correia, Erick Fonseca, Pedro Martins, Tsvetomila Mihaylova, Miguel Rios, Marcos Treviso, and the anonymous reviewers, for helpful discussion and feedback.


Appendix A Background

a.1 Tsallis entropies

Recall the definition of the Tsallis family of entropies in Eq. 9 for ,


This family is continuous in , i. e., for any . Proof: For simplicity, we rewrite in separable form:


It suffices to show that for . Let , and . Observe that , so we are in an indeterminate case. We take the derivatives of and :

From l’Hôpital’s rule,


Note also that, as , the denominator grows unbounded, so .

a.2 Fenchel-Young losses

In this section, we recall the definitions and properties essential for our construction of . The concepts below were formalized by Blondel et al. (2019) in more generality; we present below a less general version, sufficient for our needs.

Definition 1 (Probabilistic prediction function regularized by ).

Let be a strictly convex regularization function. We define the prediction function as

Definition 2 (Fenchel-Young loss generated by ).

Let be a strictly convex regularization function. Let denote a ground-truth label (for example, if there is a unique correct class ). Denote by the prediction scores produced by some model, and by the probabilistic predictions. The Fenchel-Young loss generated by is


This justifies our choice of entmax mapping and loss (Eqs. 1011), as and .

Properties of Fenchel-Young losses.

  1. [itemsep=3pt]

  2. Non-negativity. for any and .

  3. Zero loss. if and only if , i. e., the prediction is exactly correct.

  4. Convexity. is convex in .

  5. Differentiability. is differentiable with .

  6. Smoothness. If is strongly convex, then is smooth.

  7. Temperature scaling. For any constant , .

Characterizing the solution of .

To shed light on the generic probability mapping in Eq. 16, we derive below the optimality conditions characterizing its solution. The optimality conditions are essential not only for constructing algorithms for computing (Appendix C), but also for deriving the Jacobian of the mapping (Appendix B). The Lagrangian of the maximization in Eq. 16 is


with subgradient


The subgradient KKT conditions are therefore:


Connection to softmax and sparsemax.

We may now directly see that, when , Eq. 20 becomes , which can only be satisfied if , thus . Then, , where . From Eq. 22, must be such that sums to 1, yielding the well-known softmax expression. In the case of sparsemax, note that for any , we have


Appendix B Backward pass for generalized sparse attention mappings

When a mapping is used inside the computation graph of a neural network, the Jacobian of the mapping has the important role of showing how to propagate error information, necessary when training with gradient methods. In this section, we derive a new, simple expression for the Jacobian of generalized sparse . We apply this result to obtain a simple form for the Jacobian of mappings.

The proof is in two steps. First, we prove a lemma that shows that Jacobians are zero outside of the support of the solution. Then, completing the result, we characterize the Jacobian at the nonzero indices.

Lemma 1 (Sparse attention mechanisms have sparse Jacobians).

Let be strongly convex. The attention mapping is differentiable almost everywhere, with Jacobian symmetric and satisfying

Proof: Since is strictly convex, the in Eq. 16 is unique. Using Danskin’s theorem (Danskin, 1966), we may write

Since is strongly convex, the gradient of its conjugate is differentiable almost everywhere (Rockafellar, 1970). Moreover, is the Hessian of , therefore it is symmetric, proving the first two claims.
Recall the definition of a partial derivative,

Denote by . We will show that for any such that , and any ,

In other words, we consider only one side of the limit, namely subtracting a small non-negative . A vector solves the optimization problem in Eq. 16 if and only if there exists and satisfying Eqs. 2023. Let . We verify that satisfies the optimality conditions for , which implies that . Since we add a non-negative quantity to , which is non-negative to begin with, , and since , we also satisfy . Finally,

It follows that If is differentiable at

, this one-sided limit must agree with the derivative. Otherwise, the sparse one-sided limit is a generalized Jacobian.  


Proposition 2.

Let , with strongly convex and differentiable . Denote the support of by . If the second derivative exists for any , then

In particular, if with twice differentiable on , we have

Proof: Lemma 1 verifies that for . It remains to find the derivatives w.r.t. . Denote by the restriction of the corresponding vectors to the indices in the support . The optimality conditions on the support are


where , so . Differentiating w.r.t.  at yields


Since is strictly convex, is invertible. From block Gaussian elimination (i. e., the Schur complement),

which can then be used to solve for giving

yielding the desired result. When is separable, is diagonal, with , yielding the simplified expression which completes the proof.   

Connection to other differentiable attention results.

Our result is similar, but simpler than Niculae and Blondel (2017, Proposition 1), especially in the case of separable . Crucially, our result does not require that the second derivative exist outside of the support. As such, unlike the cited work, our result is applicable in the case of , where either or its reciprocal may not exist at .

Appendix C Algorithms for entmax

c.1 General thresholded form for bisection algorithms.

The following lemma provides a simplified form for the solution of .

Lemma 2.

For any , there exists a unique such that


Proof: We use the regularized prediction functions defined in Appendix A.2. From both definitions,

We first note that for all ,


From the constant invariance and scaling properties of (Blondel et al., 2019, Proposition 1, items 4–5),