1 Introduction
Modern neural networks are composed of multiple layers of nested functions. Although layers are usually constituted of elementary linear algebraic operations and simple nonlinearities, there is a growing need for layers that output the value or the solution
of an optimization problem. This can be used to design loss functions that capture relevant regularities in the input
(Lample et al., 2016; Cuturi & Blondel, 2017) or to create layers that impose prior structure on the output (Kim et al., 2017; Amos & Kolter, 2017; Niculae & Blondel, 2017; Djolonga & Krause, 2017).Among these works, several involve a convex optimization problem (Amos & Kolter, 2017; Niculae & Blondel, 2017; Djolonga & Krause, 2017)
; others solve certain combinatorial optimization problems by dynamic programming
(Lample et al., 2016; Kim et al., 2017; Cuturi & Blondel, 2017). However, because dynamic programs (Bellman, 1952) are usually nondifferentiable, virtually all these works resort to the formalism of conditional random fields (CRFs) (Lafferty et al., 2001), which can be seen as changing the semiring used by the dynamic program — replacing all values by their exponentials and all operations with operations (Verdu & Poor, 1987). While this modification smoothes the dynamic program, it looses the sparsity of solutions, since hard assignments become soft ones. Moreover, a general understanding of how to relax and differentiate dynamic programs is lacking. In this work, we propose to do so by leveraging smoothing (Moreau, 1965; Nesterov, 2005) and backpropagation (Linnainmaa, 1970). We make the following contributions.1) We present a unified framework for turning a broad class of dynamic programs (DP) into differentiable operators. Unlike existing works, we propose to change the semiring to use operations, where is a max operator smoothed with a strongly convex regularizer (§2).
2) We show that the resulting DP operators, that we call , are smoothed relaxations of the original DP algorithm and satisfy several key properties, chief among them convexity. In addition, we show that their gradient, , is equal to the expected trajectory of a certain random walk and can be used as a sound relaxation to the original dynamic program’s solution. Using negative entropy for recovers existing CRFbased works from a different perspective — we provide new arguments as to why this is a good choice. On the other hand, using squared norm for leads to new algorithms whose expected solution is sparse. We derive a clean and efficient method to backpropagate gradients, both through and . This allows us to define differentiable DP layers that can be incorporated in neural networks trained endtoend (§3).
3) We illustrate how to to derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for supervised timeseries alignment (§4). The latter is illustrated in Figure 1. Finally, we showcase these two instantiations on structured prediction tasks (§5) and on structured attention for neural machine translation (§6).
Notation.
We denote scalars, vectors and matrices using lowercase, bold lowercase and bold uppercase letters,
e.g., , and . We denote the elements of by and its rows by . We denote the Frobenius inner product between and by . We denote theprobability simplex by
. We write the convex hull of , the set and the support of . We denote the Shannon entropy by .We will release an optimized modular PyTorch implementation for reproduction and reuse.
2 Smoothed max operators
In this section, we introduce smoothed max operators (Nesterov, 2005; Beck & Teboulle, 2012; Niculae & Blondel, 2017), that will serve as a powerful and generic abstraction to define differentiable dynamic programs in §3. Formally, let be a strongly convex regularizer on and let . We define the max operator smoothed by as:
(1) 
In other words, is the convex conjugate of , restricted to the simplex. From the duality between strong convexity and smoothness, is smooth: differentiable everywhere and with Lipschitz continuous gradient. Since the argument that achieves the maximum in (1) is unique, from Danskin’s theorem (1966), it is equal to the gradient of :
(2) 
The gradient is differentiable almost everywhere for any stronglyconvex (everywhere for negentropy). Next, we state properties that will be useful throughout this paper.
Lemma 1.
Properties of operators
Let .

[topsep=0pt,itemsep=3pt,parsep=3pt]

Boundedness: If is lowerbounded by and upperbounded by on the simplex , then
. 
Distributivity of over : .

Commutativity: If , where is a permutation matrix, then .

Nondecreasingness in each coordinate: .

Insensitivity to : .
Proofs are given in §A.1. In particular, property 3 holds whenever , for some function . We focus in this paper on two specific regularizers : the negentropy and the squared norm. For these choices, all properties above are satisfied and we can derive closedform expressions for , its gradient and its Hessian — see §B.1. When using negentropy, becomes the logsumexp and the softmax. The former satisfies associativity, which as we shall see, makes it natural to use in dynamic programming. With the squared regularization, as observed by Martins & Astudillo (2016); Niculae & Blondel (2017), the gradient is sparse. This will prove useful to enforce sparsity in the models we study.
3 Differentiable DP layers
Dynamic programming (DP) is a generic way of solving combinatorial optimization problems by recursively solving problems on smaller sets. We first introduce this category of algorithms in a broad setting, then use smoothed max operators to define differentiable DP layers.
3.1 Dynamic programming on a DAG
Every problem solved by dynamic programming reduces to finding the highestscoring path between a start node and an end node, on a weighted directed acyclic graph (DAG). We therefore introduce our formalism on this generic problem, and give concrete examples in §4.
Formally, let be a DAG, with nodes and edges . We write the number of nodes. Without loss of generality, we number the nodes in topological order, from (start) to (end), and thus . Node is the only node without parents, and node the only node without children. Every directed edge from a parent node to a child node has a weight . We gather the edge weights in a matrix , setting if and . We consider the set of all paths in from node to node . Any path can be represented as a binary matrix, with if the path goes through the edge and otherwise. In the sequel, paths will have a onetoone correspondence with discrete structures such as sequences or alignments. Using this representation, corresponds to the cumulated sum of edge weights, along the path . The computation of the highest score among all paths amounts to solving the combinatorial problem
(3) 
Although the size of is in general exponential in , can be computed in one topologicallyordered pass over using dynamic programming. We let be the set of parent nodes of node in graph and define recursively
(4)  
(5) 
This algorithm outputs . We now show that this is precisely the highest score among all paths.
Proposition 1.
Optimality of dynamic programming
The optimality of the recursion (5) is a wellknown result (Bellman, 1952). We prove it again with our formalism in §A.2, since it exhibits the two key properties that the max operator must satisfy to guarantee optimality: distributivity of over it and associativity. The cost of computing is , which is exponentially better than .
In many applications, we will often rather be interested in the argument that achieves the maximum, i.e., one of the highestscoring paths
(6) 
This argument can be computed by backtracking, that we now relate to computing subgradients of .
Linear program, lack of differentiality.
Unfortunately, is not differentiable everywhere. To see why this is the case, notice that (3
) can be rewritten as a linear program over the convex polytope
:(7) 
From the generalized Danskin theorem (Bertsekas, 1971),
(8) 
where denotes the subdifferential of , i.e., the set of subgradients. When is unique, is a singleton and is equal to the gradient of , that we write . Unfortunately, is not always unique, meaning that is not differentiable everywhere. This hinders optimization as we can only train models involving with subgradient methods. Worse, , a function from to , is discontinuous and has null or undefined derivatives. It is thus impossible to use it in a model trained by gradient descent.
3.2 Smoothed max layers
To address the lack of differentiability of dynamic programming, we introduce the operator , presented in §2, and consider two approaches.
Smoothing the linear program.
Let us define the smoothed maximum of a function over a finite set using the following shorthand notation:
(9) 
A natural way to circumvent the lack of differentiability of is then to replace the global operator by :
(10) 
From §2, is convex and, as long as is strongly convex, differentiable everywhere. In addition, is Lipschitz continuous and thus differentiable almost everywhere. Unfortunately, solving (10) for general strongly convex is intractable when has an exponential size.
Smoothing the dynamic program.
As a tractable alternative, we propose an algorithmic smoothing. Namely, we replace by locally within the DP recursion. Omitting the dependence on , this defines a smoothed recursion over the new sequence :
(11)  
(12) 
The new algorithm outputs , the smoothed highest score. Smoothing the max operator locally brings the same benefit as before — is smooth and is differentiable almost everywhere. However, computing is now always tractable, since it simply requires to evaluate in topological order, as in the original recursion (5). Although and are generally different (in fact, for all ), we now show that is a sensible approximation of in several respects.
Proposition 2.
Properties of

[topsep=0pt,itemsep=3pt,parsep=3pt]

is convex

When is separable, if and only if , where .
Proofs are given in §A.3. The first claim can be surprising due to the recursive definition of . The second claim implies that converges to when the regularization vanishes: ; also satisfies this property. The “if” direction of the third claim follows by showing that satisfies associativity. This recovers known results in the framework of message passing algorithms for probabilistic graphical models (e.g., Wainwright & Jordan, 2008, Section 4.1.3), with a more algebraic point of view. The key role that the distributive and associative properties play into breaking down large problems into smaller ones has long been noted (Verdu & Poor, 1987; Aji & McEliece, 2000). However, the “and only if” part of the claim is new to our knowledge. Its proof shows that is the only satisfying associativity, exhibiting a functional equation from information theory (Horibe, 1988). While this provides an argument in favor of entropic regularization, regularization has different benefits in terms of sparsity of the solutions.
3.3 Relaxed argmax layers
It is easy to check that belongs to and can be interpreted as an expected path under some distribution induced by , over all possible — see §A.4 for details. This makes interpretable as a continuous relaxation of the highestscoring path defined in (6). However, like , computing is intractable in the general case. Fortunately, we now show that is always easily computable by backpropagation and enjoys similar properties.
Computing .
Computing can be broken down into two steps. First, we compute and record the local gradients alongside the recursive step (12):
(14) 
where . Since we assume that if , we have . This ensures that, similarly to , exclusively depends on . Let be the children of node . A straighforward application of backpropagation (cf. §A.5) yields a recursion run in reversetopological order, starting from node down to :
(15) 
where and for . The final output is . Assuming can be computed in linear time, the total cost is , the same as . Pseudocode is summarized in §A.5.
Associated path distribution.
The backpropagation we derived has a probabilistic interpretation. Indeed, can be interpreted as a transition matrix: it defines a random walk on the graph , i.e.
, a finite Markov chain with states
and transition probabilities supported by . The random walk starts from node and, when at node , hops to node with probability . It always ends at node , which is absorbing. The walk follows the path with a probability , which is simply the product of the of visited edges. Thus, defines a path distribution . Our next proposition shows that and is equal to the expected path under that distribution.Proposition 3.
as an expected path
(16) 
Proof is provided in §A.5. Moreover, is a principled relaxation of the highestscoring path , in the sense that it converges to a subgradient of as the regularization vanishes:
(17) 
When , the distributions underpinning and coincide and reduce to the Gibbs distribution . The value is then equal to the log partition. When , some transitions between nodes have zero probability and hence some paths have zero probability under the distribution . Thus, is typically sparse — this will prove interesting to introspect the various models we consider (typically, the smaller , the sparser ).
3.4 Multiplication with the Hessian
Using as a layer involves backpropagating through . This requires computing the Jacobian or in other words the Hessian , a linear map from to . Fortunately, a practical implementation of backpropagation only requires to apply that map to a provided matrix , i. e., . We therefore focus on that term. Recall that the directional derivative of at along can be computed by . Our key technique, which is also at the heart of Pearlmutter’s method (1994), is to observe that is the gradient of the directional derivative at along . Namely,
(18) 
We therefore break down the computation of into two steps. First, we compute the directional derivative
using the chain rule. It can be computed in one topologicallyordered pass over
. Similarly to the gradient computation, we record multiplications with the (generalized) local Hessian along the way. Second, we compute the gradient of the directional derivative using backpropagation. It yields a recursion for computing in reverse topologicalorder over . The complete derivation and the pseudocode are given in §A.7. The total computational cost is , as for the gradient computation.Performance.
Using autodiff frameworks such as PyTorch (Paszke et al., 2017), it is possible to only implement and rely on tapebased gradient computation to obtain . Provided that we tape the backward pass as well, we can then backpropagate again through to obtain . In practice, however, implementing backpropagation without resorting to autodiff software is crucial, since the DAG structure can be directly harcoded in concrete cases — see §4. Moreover, our “reverseoverforward” approach to compute the Hessian product (backpropagating over the directional derivative computation) yields a simpler computation graph than the “reverseoverreverse” approach (backpropagation over taped backpropagation). In experiments, our approach is up to faster than vanilla PyTorch on the Viterbi DAG. Note that our algorithms are readily vectorizable and can efficiently handle minibatches with varying input lengths.
Summary.
We have proposed , a smooth, convex and tractable relaxation to the value of . We have also shown that belongs to and is therefore a sound relaxation to solutions of . To conclude this section, we formally define our proposed two layers.
Definition 1.
Differentiable dynamic programming layers
Value layer:  (19)  
Gradient layer:  (20) 
4 Examples of computational graphs
We now illustrate two instantiations of our framework for specific computational graphs.
4.1 Sequence prediction
We demonstrate in this section how to instantiate to the computational graph of the Viterbi algorithm (Viterbi, 1967; Rabiner, 1990), one of the most famous instances of DP algorithm. We call the resulting operator . We wish to tag a sequence of vectors in (e.g., word representations) with the most probable output sequence (e.g., entity tags) . This problem can be cast as finding the highestscoring path on a treillis . While can always be represented as a sparse binary matrix, it is convenient to represent it instead as a
binary tensor
, such that if transitions from node to node on time , and otherwise — we set . The potentials can similarly be organized as a real tensor, such that . Traditionally, the potential functions were humanengineered (Sutton et al., 2012, §2.5). In recent works and in this paper, they are learned endtoend (Lample et al., 2016).Using the above binary tensor representation, the inner product is equal to , ’s cumulated score. This is illustrated in Figure 2 on the task of partofspeech tagging. The bold arrows indicate one possible output sequence , i.e., one possible path in .
When , we recover linearchain conditional random fields (CRFs) (Lafferty et al., 2001) and the probability of ( in tensor representation) given is
(21) 
From Prop. 3, the gradient is such that . The marginal probability of state at time is simply . Using a different simply changes the distribution over state transitions. When , the marginal probabilities are typically sparse. Pseudocode for , as well as gradient and Hessianproduct computations, is provided in §B.2. The case is new to our knowledge.
When , the marginal probabilities are traditionally computed using the forwardbackward algorithm (Baum & Petrie, 1966). In contrast, we compute using backpropagation while efficiently maintaining the marginalization. An advantage of our approach is that all operations are numerically stable. The relation between forwardbackward and backpropagation has been noted before (e.g., Eisner (2016)). However, the analysis is led using operations, instead of as we do. This Viterbi instantiation can immediately be generalized to graphical models with a tree structure, and to approximate inference in general graphical models, since unrolled loopy belief propagation (Pearl, 1988) yields a dynamic program.
4.2 Timeseries alignment
We now demonstrate how to instantiate to the computational graph of dynamic time warping (DTW) (Sakoe & Chiba, 1978), whose goal is to seek the minimal cost alignment between two timeseries. We call the resulting operator . Formally, let and be the lengths of two timeseries, and . Let and be the and observations of and , respectively. Since edge weights only depend on child nodes, it is convenient to rearrange and as matrices. Namely, we represent an alignment as a binary matrix, such that if is aligned with , and otherwise. Likewise, we represent as a matrix. A classical example is , for some differentiable discrepancy measure . We write the set of all monotonic alignment matrices, such that the path that connects the upperleft matrix entry to the lowerright one uses only moves. The DAG associated with is illustrated in Figure 3 with and below.
Again, the bold arrows indicate one possible path from start to end in the DAG, and correspond to one possible alignment. Using this representation, the cost of an alignment (cumulated cost along the path) is conveniently computed by . The value can be used to define a loss between alignments or between timeseries. Following Proposition 3, can be understood as a soft alignment matrix. This matrix is sparse when , as illustrated in Figure 1 (right).
Pseudocode to compute as well as its gradient and its Hessian products are provided in §B.3. When , is a conditional random field known as softDTW, and the probability is a Gibbs distribution similar to §4.1 (Cuturi & Blondel, 2017). However, the case and the computation of are new and allow new applications.
5 Differentiable structured prediction
We now apply the proposed layers, and , to structured prediction (Bakır et al., 2007), whose goal is to predict a structured output associated with a structured input
. We define old and new structured losses, and demonstrate them on two structured prediction tasks: named entity recognition and timeseries alignment.
5.1 Structured loss functions
Throughout this section, we assume that the potentials have already been computed using a function from to and let be a cost function between the groundtruth output and the predicted output .
Convex losses.
Because is typically nonconvex, the costaugmented structured hinge loss (Tsochantaridis et al., 2005) is often used instead for linear models
(22) 
This is a convex upperbound on , where is defined in (6). To make the costaugmented decoding tractable, it is usually assumed that is linear in , i. e., it can be written as for some matrix . We can then rewrite (22) using our notation as
(23) 
However, this loss function is nondifferentiable. We therefore propose to relax LP by substituting it with :
(24) 
Losses in this class are convex, smooth, tractable for any , and by Proposition 2 property 2 a sensible approximation of . In addition, they only require to backpropagate through at training time. It is easy to check that we recover the structured hinge loss with (Tsochantaridis et al., 2005) and the CRF loss with (Lafferty et al., 2001). The last one has been used on top of LSTMs in several recent works (Lample et al., 2016; Ma & Hovy, 2016). Minimizing is equivalent to maximizing the likelihood . However, minimizing is not equivalent to maximizing . In fact, the former is convex while the latter is not.
Nonconvex losses.
A direct approach that uses the output distribution consists in minimizing the risk . As shown by Stoyanov & Eisner (2012), this can be achieved by backpropagating through the minimum risk decoder. However, the risk is usually nondifferentiable, piecewise constant (Smith & Eisner, 2006)
and several smoothing heuristics are necessary to make the method work
(Stoyanov & Eisner, 2012).Another principled approach is to consider a differentiable approximation of the cost . We can then relax by . Unlike minimum risk training, this approach is differentiable everywhere when . Both approaches require to backpropagate through , which is only roughly twice as costly as backpropagating through using the approach outlined in §3.4.
5.2 Named entity recognition
Let be an input sentence, where each word is represented by a vector in , computed using a neural recurrent architecture trained endtoend. We wish to tag each word with named entities, i.e., identify blocks of words that correspond to names, locations, dates, etc. We use the specialized operator described in §4.1. In our experiments, we define the elements of the potential tensor when by
(25) 
and , where
is the linear classifier associated with tag
and is a transition matrix. We learn , and along with the network producing , and compare two losses:Surrogate convex loss:  (26)  
Relaxed loss: 
where is the squared distance when
and the KullbackLeibler divergence when
, applied rowwise to the marginalization of and .Experiments.
We measure the performance of the different losses and regularizations on the four languages of the CoNLL 2003 (Tjong Kim Sang & De Meulder, 2003) dataset. Following Lample et al. (2016), who use the loss, we use a character LSTM and pretrained embeddings computed using FastText (Joulin et al., 2016) on Wikipedia. Those are fed to a word bidirectional LSTM to obtain . Architecture details are provided in §C.1. Results are reported in Table 1, along with (Lample et al., 2016) results with different pretrained embeddings. With proper parameter selections, all losses perform within score of each other, although entropyregularized losses perform slightly better on languages. However, the regularized losses yield sparse predictions, whereas entropy regularization always yields dense probability vectors. Qualitatively, this allows to identify ambiguous predictions more easily — this is illustrated in §C.1 with additional figures.
Loss  English  Spanish  German  Dutch  

Negentropy  Surrogate  90.80  86.68  77.35  87.56 
Relaxed  90.47  86.20  77.56  87.37  
Surrogate  90.86  85.51  76.01  86.58  
Relaxed  89.49  84.07  76.91  85.90  
(Lample et al., 2016)  90.96  85.75  78.76  81.74 
5.3 Supervised audiotoscore transcription
We use our framework to perform supervised audiotoscore alignment on the Bach 10 dataset (Duan & Pardo, 2011). The dataset consists of 10 music pieces with audio tracks, MIDI transcriptions, and annotated alignments between them. We transform the audio tracks into a sequence of audio frames using a feature extractor (see §C.2) to obtain a sequence , while the associated score sequence is represented by (each row is a onehot vector corresponding to one key ). Each pair is associated to an alignment . As described in §4.2, we need to define a discrepancy matrix between the elements of the two sequences. We set the cost between an audio frame and a key to be the loglikelihood of this key given a multinomial linear classifier. For all , we define
(27)  
(28) 
where are learned classifier parameters. We predict a soft alignment by . Following (Garreau et al., 2014), we define the relaxed loss
(29) 
where a the lower triangular matrix filled with . When is a true alignement matrix, is the area between the path of and , which corresponds to the mean absolute deviation in the audio literature. When , it is a convex relaxation of the area. At test time, once is learned, we use the nonregularized DTW algorithm to output a hard alignment .
Results.
We perform a leaveoneout crossvalidation of our model performance, learning the multinomial classifier on pieces and assessing the quality of the alignment on the remaining piece. We report the mean absolute deviation on both train and test sets. A solid baseline consists in learning the multinomial classifier beforehand, i.e., without endtoend training. We then use this model to compute as in (27) and obtain . As shown in Table 2, our endtoend technique outperforms this baseline by a large margin. We also demonstrate in §C.2 that the alignments obtained by endtoend training are visibly closer to the ground truth. Endtoend training thus allows to finetune the distance matrix for the alignment task at hand.
Linear model  Train  Test 

Endtoend trained  
Pretrained  
Random 
6 Structured and sparse attention
We show in this section how to apply our framework to neural sequencetosequence models augmented with an attention mechanism (Bahdanau et al., 2015). An encoder first produces a list of vectors representing the input sequence. A decoder is then used to greedily produce the corresponding output sequence. To simplify the notation, we focus on one time step of the decoding procedure. Given the decoder’s current hidden state and as inputs, the role of the attention mechanism is to produce a distribution over , for the current time step. This distribution is then typically used to produce a context vector , that is in turn invoved in the computation of the output sequence’s next element.
Structured attention layers.
Kim et al. (2017) proposed a segmentation attention layer, which is capable of taking into account the transitions between elements of . They use a linearchain CRF to model the probability of a sequence , where each is either (“pay attention”) or . They then propose to use normalized marginal probabilities as attention weights: . They show how to backpropagate gradients through the forwardbackward algorithm, which they use to compute the marginal probabilities.
Generalizing structured attention.
We now show how to generalize segmentation layers to any and how to backpropagate through them efficiently. Using the notation from §4.1, any can be represented as a tensor and the potentials as a tensor . As in (Kim et al., 2017), we define
(30) 
where is a learned bilinear form and is a learned transition matrix. Following §4.1, the gradient is equal to the expected matrix and the marginals are obtained by marginalizing that matrix. Hence, we can set .
Experiments.
We demonstrate structured attention layers with an LSTM encoder and decoder to perform French to English translation using data from a 1 million sentence subset of the WMT14 FREN challenge. We illustrate an example of attenion map obtained with negentropy and regularizations in Figure 4. Nonzero elements are underlined with borders: regularized attention maps are sparse and more interpretable — this provides a structured alternative to sparsemax attention (Martins & Astudillo, 2016). Results were all within point of BLEU score on the newstest2014 dataset. For French to English, standard softmax attention obtained 27.96, while entropy and regularized attention obtained 27.96 and 27.19 — introducing structure and sparsity therefore provides enhanced interpretability with comparable peformance. We provide model details, full results and further visualizations in §C.3.
7 Conclusion
We proposed a theoretical framework for turning a broad class of dynamic programs into convex, differentiable and tractable operators, using the novel point of view of smoothed max operators. Our work sheds a new light on how to transform dynamic programs that predict hard assignments (e.g.,
the maximum aposteriori estimator in a probabilistic graphical model or an alignment matrix between two timeseries) into continuous and probabilistic ones. We provided a new argument in favor of negentropy regularization by showing that it is the only one to preserve
associativity of the smoothed max operator. We showed that different regularizations induce different distributions over outputs and thatregularization has other benefits, in terms of sparsity of the expected outputs. Generally speaking, performing inference in a graphical model and backpropagating through it reduces to computing the first and secondorder derivatives of a relaxed maximumlikelihood estimation — leveraging this observation yields elegant and efficient algorithms that are readily usable in deep learning frameworks, with various promising applications.
Acknowledgements
MB thanks Vlad Niculae and Marco Cuturi for many fruitful discussions. AM thanks Julien Mairal, Inria Thoth and Inria Parietal for lending him the computational resources necessary to run the experiments. He thanks University ParisSaclay and his Ph.D. supervisors Bertrand Thirion and Gaël Varoquaux for allowing him to do an internship at NTT, and Olivier Grisel for his insightful comments.
References
 Aji & McEliece (2000) Aji, Srinivas M and McEliece, Robert J. The generalized distributive law. IEEE Transactions on Information Theory, 46(2):325–343, 2000.
 Amos & Kolter (2017) Amos, Brandon and Kolter, J. Zico. OptNet: Differentiable optimization as a layer in neural networks. In Proc. of ICML, pp. 136–145, 2017.
 Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. of ICLR, 2015.
 Bakır et al. (2007) Bakır, Gökhan, Hofmann, Thomas, Schölkopf, Bernhard, Smola, Alexander J, Taskar, Ben, and Vishwanathan, S. V. N. Predicting Structured Data. The MIT Press, 2007.
 Banderier & Schwer (2005) Banderier, Cyril and Schwer, Sylviane. Why Delannoy numbers? Journal of Statistical Planning and Inference, 135(1):40–54, 2005.
 Baum & Petrie (1966) Baum, Leonard E. and Petrie, Ted. Statistical inference for probabilistic functions of finite state markov chains. The Annals of Mathematical Statistics, 37(6):1554–1563, 1966.
 Beck & Teboulle (2012) Beck, Amir and Teboulle, Marc. Smoothing and First Order Methods: A Unified Framework. SIAM Journal on Optimization, 22(2):557–580, 2012.
 Bellman (1952) Bellman, Richard. On the theory of dynamic programming. Proc. of the National Academy of Sciences, 38(8):716–719, 1952.
 Bertsekas (1971) Bertsekas, Dimitri P. Control of uncertain systems with a setmembership description of the uncertainty. PhD thesis, Massachusetts Institute of Technology, 1971.
 Boyd & Vandenberghe (2004) Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004.
 Cuturi & Blondel (2017) Cuturi, Marco and Blondel, Mathieu. SoftDTW: a Differentiable Loss Function for TimeSeries. In Proc. of ICML, pp. 894–903, 2017.
 Danskin (1966) Danskin, John M. The theory of maxmin, with applications. SIAM Journal on Applied Mathematics, 14(4):641–664, 1966.
 Djolonga & Krause (2017) Djolonga, Josip and Krause, Andreas. Differentiable learning of submodular functions. In Proc. of NIPS, pp. 1014–1024, 2017.
 (14) Duan, Zhiyao and Pardo, Bryan. Bach 10 dataset. http://music.cs.northwestern.edu/data/Bach10.html.
 Duan & Pardo (2011) Duan, Zhiyao and Pardo, Bryan. Soundprism: An online system for scoreinformed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing, 5(6):1205–1215, 2011.
 Duchi et al. (2008) Duchi, John, ShalevShwartz, Shai, Singer, Yoram, and Chandra, Tushar. Efficient projections onto the ball for learning in high dimensions. In Proc. of ICML, pp. 272–279, 2008.
 Eisner (2016) Eisner, Jason. Insideoutside and forwardbackward algorithms are just backprop (tutorial paper). In Proc. of the Workshop on Structured Prediction for NLP, pp. 1–17, 2016.
 Garreau et al. (2014) Garreau, Damien, Lajugie, Rémi, Arlot, Sylvain, and Bach, Francis. Metric learning for temporal sequence alignment. In Proc. of NIPS, pp. 1817–1825, 2014.
 Gselmann (2011) Gselmann, Eszter. Entropy functions and functional equations. Mathematical Communications, (16):347–357, 2011.
 Horibe (1988) Horibe, Yasuichi. Entropy of terminal distributions and the Fibonnacci trees. The Fibonacci Quarterly, (26):135–140, 1988.
 Joulin et al. (2016) Joulin, Armand, Grave, Edouard, Bojanowski, Piotr, Douze, Matthijs, Jégou, Hérve, and Mikolov, Tomas. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
 Kim et al. (2017) Kim, Yoon, Denton, Carl, Hoang, Luong, and Rush, Alexander M. Structured Attention Networks. In Proc. of ICLR, 2017.
 Lafferty et al. (2001) Lafferty, John, McCallum, Andrew, and Pereira, Fernando CN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, pp. 282–289, 2001.
 Lample et al. (2016) Lample, Guillaume, Ballesteros, Miguel, Subramanian, Sandeep, Kawakami, Kazuya, and Dyer, Chris. Neural architectures for named entity recognition. In Proc. of NAACL, pp. 260–270, 2016.
 Linnainmaa (1970) Linnainmaa, Seppo. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. PhD thesis, Univ. Helsinki, 1970.
 Luong et al. (2015) Luong, Thang, Pham, Hieu, and Manning, Christopher D. Effective Approaches to Attentionbased Neural Machine Translation. In Proc. of EMNLP, pp. 1412–1421, 2015.
 Ma & Hovy (2016) Ma, Xuezhe and Hovy, Eduard. Endtoend sequence labeling via bidirectional lstmcnnscrf. In Proc. of ACL, pp. 1064–1074, 2016.
 Martins & Astudillo (2016) Martins, André F.T. and Astudillo, Ramón Fernandez. From softmax to sparsemax: A sparse model of attention and multilabel classification. In Proc. of ICML, pp. 1614–1623, 2016.
 Michelot (1986) Michelot, Christian. A finite algorithm for finding the projection of a point onto the canonical simplex of . Journal of Optimization Theory and Applications, 50(1):195–200, 1986.
 Moreau (1965) Moreau, JeanJacques. Proximité et dualité dans un espace hilbertien. Bullet de la Société Mathémathique de France, 93(2):273–299, 1965.
 Nesterov (2005) Nesterov, Yurii. Smooth minimization of nonsmooth functions. Mathematical Programming, 103(1):127–152, 2005.
 Niculae & Blondel (2017) Niculae, Vlad and Blondel, Mathieu. A Regularized Framework for Sparse and Structured Neural Attention. In Proc. of NIPS, pp. 3340–3350, 2017.
 Paszke et al. (2017) Paszke, Adam, Gross, Sam, Chintala, Soumith, and Chanan, Gregory. Pytorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, 2017.
 Pearl (1988) Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Elsevier, 1988.
 Pearlmutter (1994) Pearlmutter, Barak A. Fast exact multiplication by the Hessian. Neural computation, 6(1):147–160, 1994.

Rabiner (1990)
Rabiner, Lawrence R.
A tutorial on hidden markov models and selected applications in speech recognition.
In Readings in Speech Recognition, pp. 267–296. 1990.  Sakoe & Chiba (1978) Sakoe, Hiroaki and Chiba, Seibi. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26:43–49, 1978.
 Smith & Eisner (2006) Smith, David A. and Eisner, Jason. Minimum risk annealing for training loglinear models. In Proc. of COLING/ACL, pp. 787–794, 2006.
 Stoyanov & Eisner (2012) Stoyanov, Veselin and Eisner, Jason. Minimumrisk training of approximate crfbased nlp systems. In Proc. of NAACL, pp. 120–130, 2012.
 Sulanke (2003) Sulanke, Robert A. Objects counted by the central Delannoy numbers. Journal of Integer Sequences, 6(1):3, 2003.

Sutton et al. (2012)
Sutton, Charles, McCallum, Andrew, et al.
An introduction to conditional random fields.
Foundations and Trends in Machine Learning
, 4(4):267–373, 2012.  Tjong Kim Sang & De Meulder (2003) Tjong Kim Sang, Erik F. and De Meulder, Fien. Introduction to the CoNLL2003 shared task: Languageindependent named entity recognition. In Proc. of CoNLL, pp. 142–147, 2003.
 Tsochantaridis et al. (2005) Tsochantaridis, Ioannis, Joachims, Thorsten, Hofmann, Thomas, and Altun, Yasemin. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005.
 Verdu & Poor (1987) Verdu, Sergio and Poor, H Vincent. Abstract dynamic programming models under commutativity conditions. SIAM Journal on Control and Optimization, 25(4):990–1006, 1987.
 Viterbi (1967) Viterbi, Andrew. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967.
 Wainwright & Jordan (2008) Wainwright, Martin J. and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
Appendix A Proofs and detailed derivations
This section contains the proofs of the propositions and lemmas presented in the main text. It also contains derivations of gradient, directional derivative and Hessianproduct computations.
a.1 Proof of Lemma 1 (properties of )
Property 1 (boundedness).
Let and be the solutions of and , respectively. Then, we have
(31) 
and
(32) 
Combining the two and using , we obtain
(33) 
When , we have the tight inequality and hence
(34) 
When , we have the tight inequality and hence
(35) 
Note that the difference is equal to when is the negative entropy and to when is the squared norm. Since for all integers , we get a better approximation of the operator using squared norm than using negative entropy, whenever .
Property 2 (distributivity of over ).
This follows immediately from
(36) 
Using our shorthand notation, this simply becomes .
Property 3 (commutativity).
Assume for all permutation matrices . Let be the inverse permutation matrix associated with . Then we have
(37)  
(38) 
Property 4 (nondecreasingness in each coordinate).
If , then for all ,