Differentiable Dynamic Programming for Structured Prediction and Attention

02/11/2018 ∙ by Arthur Mensch, et al. ∙ Mathieu Blondel 0

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 28

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern neural networks are composed of multiple layers of nested functions. Although layers are usually constituted of elementary linear algebraic operations and simple non-linearities, there is a growing need for layers that output the value or the solution

of an optimization problem. This can be used to design loss functions that capture relevant regularities in the input

(Lample et al., 2016; Cuturi & Blondel, 2017) or to create layers that impose prior structure on the output (Kim et al., 2017; Amos & Kolter, 2017; Niculae & Blondel, 2017; Djolonga & Krause, 2017).

Among these works, several involve a convex optimization problem (Amos & Kolter, 2017; Niculae & Blondel, 2017; Djolonga & Krause, 2017)

; others solve certain combinatorial optimization problems by dynamic programming

(Lample et al., 2016; Kim et al., 2017; Cuturi & Blondel, 2017). However, because dynamic programs (Bellman, 1952) are usually non-differentiable, virtually all these works resort to the formalism of conditional random fields (CRFs) (Lafferty et al., 2001), which can be seen as changing the semiring used by the dynamic program — replacing all values by their exponentials and all operations with operations (Verdu & Poor, 1987). While this modification smoothes the dynamic program, it looses the sparsity of solutions, since hard assignments become soft ones. Moreover, a general understanding of how to relax and differentiate dynamic programs is lacking. In this work, we propose to do so by leveraging smoothing (Moreau, 1965; Nesterov, 2005) and backpropagation (Linnainmaa, 1970). We make the following contributions.

1) We present a unified framework for turning a broad class of dynamic programs (DP) into differentiable operators. Unlike existing works, we propose to change the semiring to use operations, where is a max operator smoothed with a strongly convex regularizer 2).

2) We show that the resulting DP operators, that we call , are smoothed relaxations of the original DP algorithm and satisfy several key properties, chief among them convexity. In addition, we show that their gradient, , is equal to the expected trajectory of a certain random walk and can be used as a sound relaxation to the original dynamic program’s solution. Using negative entropy for recovers existing CRF-based works from a different perspective — we provide new arguments as to why this is a good choice. On the other hand, using squared norm for leads to new algorithms whose expected solution is sparse. We derive a clean and efficient method to backpropagate gradients, both through and . This allows us to define differentiable DP layers that can be incorporated in neural networks trained end-to-end (§3).

3) We illustrate how to to derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for supervised time-series alignment (§4). The latter is illustrated in Figure 1. Finally, we showcase these two instantiations on structured prediction tasks (§5) and on structured attention for neural machine translation (§6).

Notation.

We denote scalars, vectors and matrices using lower-case, bold lower-case and bold upper-case letters,

e.g., , and . We denote the elements of by and its rows by . We denote the Frobenius inner product between and by . We denote the

-probability simplex by

. We write the convex hull of , the set and the support of . We denote the Shannon entropy by .

We will release an optimized modular PyTorch implementation for reproduction and reuse.

Figure 1: is an instantiation of the proposed smoothed dynamic programming operator, , to the dynamic time warping (DTW) computational graph. In this picture, is the squared Euclidean distance matrix between the observations of two time-series. The gradient is equal to the expected alignment under a certain random walk characterized in §3.3 and is a sound continuous relaxation to the hard DTW alignment between the two time-series (here depicted with a yellow path). Unlike negentropy regularization (left), regularization leads to exactly sparse alignments (right). Our framework allows to backpropagate through both and , which makes it possible to learn the distance matrix  end-to-end.

2 Smoothed max operators

In this section, we introduce smoothed max operators (Nesterov, 2005; Beck & Teboulle, 2012; Niculae & Blondel, 2017), that will serve as a powerful and generic abstraction to define differentiable dynamic programs in §3. Formally, let be a strongly convex regularizer on and let . We define the max operator smoothed by as:

(1)

In other words, is the convex conjugate of , restricted to the simplex. From the duality between strong convexity and smoothness, is smooth: differentiable everywhere and with Lipschitz continuous gradient. Since the argument that achieves the maximum in (1) is unique, from Danskin’s theorem (1966), it is equal to the gradient of :

(2)

The gradient is differentiable almost everywhere for any strongly-convex (everywhere for negentropy). Next, we state properties that will be useful throughout this paper.

Lemma 1.

Properties of operators

Let .

  1. [topsep=0pt,itemsep=3pt,parsep=3pt]

  2. Boundedness: If is lower-bounded by and upper-bounded by on the simplex , then
    .

  3. Distributivity of over : .

  4. Commutativity: If , where is a permutation matrix, then .

  5. Non-decreasingness in each coordinate: .

  6. Insensitivity to : .

Proofs are given in §A.1. In particular, property 3 holds whenever , for some function . We focus in this paper on two specific regularizers : the negentropy and the squared norm. For these choices, all properties above are satisfied and we can derive closed-form expressions for , its gradient and its Hessian — see §B.1. When using negentropy, becomes the log-sum-exp and the softmax. The former satisfies associativity, which as we shall see, makes it natural to use in dynamic programming. With the squared regularization, as observed by Martins & Astudillo (2016); Niculae & Blondel (2017), the gradient is sparse. This will prove useful to enforce sparsity in the models we study.

3 Differentiable DP layers

Dynamic programming (DP) is a generic way of solving combinatorial optimization problems by recursively solving problems on smaller sets. We first introduce this category of algorithms in a broad setting, then use smoothed max operators to define differentiable DP layers.

3.1 Dynamic programming on a DAG

Every problem solved by dynamic programming reduces to finding the highest-scoring path between a start node and an end node, on a weighted directed acyclic graph (DAG). We therefore introduce our formalism on this generic problem, and give concrete examples in §4.

Formally, let be a DAG, with nodes  and edges . We write the number of nodes. Without loss of generality, we number the nodes in topological order, from  (start) to (end), and thus . Node  is the only node without parents, and node  the only node without children. Every directed edge from a parent node to a child node has a weight . We gather the edge weights in a matrix , setting if and . We consider the set of all paths in from node to node . Any path can be represented as a binary matrix, with if the path goes through the edge and otherwise. In the sequel, paths will have a one-to-one correspondence with discrete structures such as sequences or alignments. Using this representation, corresponds to the cumulated sum of edge weights, along the path . The computation of the highest score among all paths amounts to solving the combinatorial problem

(3)

Although the size of is in general exponential in , can be computed in one topologically-ordered pass over using dynamic programming. We let be the set of parent nodes of node in graph and define recursively

(4)
(5)

This algorithm outputs . We now show that this is precisely the highest score among all paths.

Proposition 1.

Optimality of dynamic programming

The optimality of the recursion (5) is a well-known result (Bellman, 1952). We prove it again with our formalism in §A.2, since it exhibits the two key properties that the max operator must satisfy to guarantee optimality: distributivity of over it and associativity. The cost of computing is , which is exponentially better than .

In many applications, we will often rather be interested in the argument that achieves the maximum, i.e., one of the highest-scoring paths

(6)

This argument can be computed by backtracking, that we now relate to computing subgradients of .

Linear program, lack of differentiality.

Unfortunately, is not differentiable everywhere. To see why this is the case, notice that (3

) can be rewritten as a linear program over the convex polytope

:

(7)

From the generalized Danskin theorem (Bertsekas, 1971),

(8)

where denotes the subdifferential of , i.e., the set of subgradients. When is unique, is a singleton and is equal to the gradient of , that we write . Unfortunately, is not always unique, meaning that is not differentiable everywhere. This hinders optimization as we can only train models involving with subgradient methods. Worse, , a function from to , is discontinuous and has null or undefined derivatives. It is thus impossible to use it in a model trained by gradient descent.

3.2 Smoothed max layers

To address the lack of differentiability of dynamic programming, we introduce the operator , presented in §2, and consider two approaches.

Smoothing the linear program.

Let us define the -smoothed maximum of a function over a finite set using the following shorthand notation:

(9)

A natural way to circumvent the lack of differentiability of is then to replace the global operator by :

(10)

From §2, is convex and, as long as is strongly convex, differentiable everywhere. In addition, is Lipschitz continuous and thus differentiable almost everywhere. Unfortunately, solving (10) for general strongly convex is intractable when has an exponential size.

Smoothing the dynamic program.

As a tractable alternative, we propose an algorithmic smoothing. Namely, we replace by locally within the DP recursion. Omitting the dependence on , this defines a smoothed recursion over the new sequence :

(11)
(12)

The new algorithm outputs , the smoothed highest score. Smoothing the max operator locally brings the same benefit as before — is smooth and is differentiable almost everywhere. However, computing is now always tractable, since it simply requires to evaluate in topological order, as in the original recursion (5). Although and are generally different (in fact, for all ), we now show that is a sensible approximation of in several respects.

Proposition 2.

Properties of

  1. [topsep=0pt,itemsep=3pt,parsep=3pt]

  2. is convex

  3. is bounded above and below:

    (13)

    where and are defined in Lemma 1.

  4. When is separable, if and only if , where .

Proofs are given in §A.3. The first claim can be surprising due to the recursive definition of . The second claim implies that converges to when the regularization vanishes: ; also satisfies this property. The “if” direction of the third claim follows by showing that satisfies associativity. This recovers known results in the framework of message passing algorithms for probabilistic graphical models (e.g., Wainwright & Jordan, 2008, Section 4.1.3), with a more algebraic point of view. The key role that the distributive and associative properties play into breaking down large problems into smaller ones has long been noted (Verdu & Poor, 1987; Aji & McEliece, 2000). However, the “and only if” part of the claim is new to our knowledge. Its proof shows that is the only satisfying associativity, exhibiting a functional equation from information theory (Horibe, 1988). While this provides an argument in favor of entropic regularization, regularization has different benefits in terms of sparsity of the solutions.

3.3 Relaxed argmax layers

It is easy to check that belongs to and can be interpreted as an expected path under some distribution induced by , over all possible — see §A.4 for details. This makes interpretable as a continuous relaxation of the highest-scoring path defined in (6). However, like , computing is intractable in the general case. Fortunately, we now show that is always easily computable by backpropagation and enjoys similar properties.

Computing .

Computing can be broken down into two steps. First, we compute and record the local gradients alongside the recursive step (12):

(14)

where . Since we assume that if , we have . This ensures that, similarly to , exclusively depends on . Let be the children of node . A straighforward application of backpropagation (cf. §A.5) yields a recursion run in reverse-topological order, starting from node down to :

(15)

where and for . The final output is . Assuming can be computed in linear time, the total cost is , the same as . Pseudo-code is summarized in §A.5.

Associated path distribution.

The backpropagation we derived has a probabilistic interpretation. Indeed, can be interpreted as a transition matrix: it defines a random walk on the graph , i.e.

, a finite Markov chain with states

and transition probabilities supported by . The random walk starts from node and, when at node , hops to node  with probability . It always ends at node , which is absorbing. The walk follows the path with a probability , which is simply the product of the of visited edges. Thus, defines a path distribution . Our next proposition shows that and is equal to the expected path under that distribution.

Proposition 3.

as an expected path

(16)

Proof is provided in §A.5. Moreover, is a principled relaxation of the highest-scoring path , in the sense that it converges to a subgradient of as the regularization vanishes:

(17)

When , the distributions underpinning and coincide and reduce to the Gibbs distribution . The value is then equal to the log partition. When , some transitions between nodes have zero probability and hence some paths have zero probability under the distribution . Thus, is typically sparse — this will prove interesting to introspect the various models we consider (typically, the smaller , the sparser ).

3.4 Multiplication with the Hessian

Using as a layer involves backpropagating through . This requires computing the Jacobian or in other words the Hessian , a linear map from to . Fortunately, a practical implementation of backpropagation only requires to apply that map to a provided matrix , i. e., . We therefore focus on that term. Recall that the directional derivative of at along can be computed by . Our key technique, which is also at the heart of Pearlmutter’s method (1994), is to observe that is the gradient of the directional derivative at along . Namely,

(18)

We therefore break down the computation of into two steps. First, we compute the directional derivative

using the chain rule. It can be computed in one topologically-ordered pass over

. Similarly to the gradient computation, we record multiplications with the (generalized) local Hessian along the way. Second, we compute the gradient of the directional derivative using backpropagation. It yields a recursion for computing in reverse topological-order over . The complete derivation and the pseudo-code are given in §A.7. The total computational cost is , as for the gradient computation.

Performance.

Using autodiff frameworks such as PyTorch (Paszke et al., 2017), it is possible to only implement and rely on tape-based gradient computation to obtain . Provided that we tape the backward pass as well, we can then backpropagate again through to obtain . In practice, however, implementing backpropagation without resorting to autodiff software is crucial, since the DAG structure can be directly harcoded in concrete cases — see §4. Moreover, our “reverse-over-forward” approach to compute the Hessian product (backpropagating over the directional derivative computation) yields a simpler computation graph than the “reverse-over-reverse” approach (backpropagation over taped backpropagation). In experiments, our approach is up to faster than vanilla PyTorch on the Viterbi DAG. Note that our algorithms are readily vectorizable and can efficiently handle mini-batches with varying input lengths.

Summary.

We have proposed , a smooth, convex and tractable relaxation to the value of . We have also shown that belongs to and is therefore a sound relaxation to solutions of . To conclude this section, we formally define our proposed two layers.

Definition 1.

Differentiable dynamic programming layers

Value layer: (19)
Gradient layer: (20)

4 Examples of computational graphs

We now illustrate two instantiations of our framework for specific computational graphs.

4.1 Sequence prediction

We demonstrate in this section how to instantiate to the computational graph of the Viterbi algorithm (Viterbi, 1967; Rabiner, 1990), one of the most famous instances of DP algorithm. We call the resulting operator . We wish to tag a sequence of vectors in (e.g., word representations) with the most probable output sequence (e.g., entity tags) . This problem can be cast as finding the highest-scoring path on a treillis . While can always be represented as a sparse binary matrix, it is convenient to represent it instead as a

binary tensor

, such that if transitions from node to node on time , and  otherwise — we set . The potentials can similarly be organized as a real tensor, such that . Traditionally, the potential functions were human-engineered (Sutton et al., 2012, §2.5). In recent works and in this paper, they are learned end-to-end (Lample et al., 2016).

Using the above binary tensor representation, the inner product is equal to , ’s cumulated score. This is illustrated in Figure 2 on the task of part-of-speech tagging. The bold arrows indicate one possible output sequence , i.e., one possible path in .

Figure 2: Computational graph of the Viterbi algorithm.

When , we recover linear-chain conditional random fields (CRFs) (Lafferty et al., 2001) and the probability of ( in tensor representation) given is

(21)

From Prop. 3, the gradient is such that . The marginal probability of state at time is simply . Using a different simply changes the distribution over state transitions. When , the marginal probabilities are typically sparse. Pseudo-code for , as well as gradient and Hessian-product computations, is provided in §B.2. The case is new to our knowledge.

When , the marginal probabilities are traditionally computed using the forward-backward algorithm (Baum & Petrie, 1966). In contrast, we compute using backpropagation while efficiently maintaining the marginalization. An advantage of our approach is that all operations are numerically stable. The relation between forward-backward and backpropagation has been noted before (e.g., Eisner (2016)). However, the analysis is led using operations, instead of as we do. This Viterbi instantiation can immediately be generalized to graphical models with a tree structure, and to approximate inference in general graphical models, since unrolled loopy belief propagation (Pearl, 1988) yields a dynamic program.

4.2 Time-series alignment

We now demonstrate how to instantiate to the computational graph of dynamic time warping (DTW) (Sakoe & Chiba, 1978), whose goal is to seek the minimal cost alignment between two time-series. We call the resulting operator . Formally, let and be the lengths of two time-series, and . Let and be the and observations of and , respectively. Since edge weights only depend on child nodes, it is convenient to rearrange and as matrices. Namely, we represent an alignment as a binary matrix, such that if is aligned with , and otherwise. Likewise, we represent as a matrix. A classical example is , for some differentiable discrepancy measure . We write the set of all monotonic alignment matrices, such that the path that connects the upper-left matrix entry to the lower-right one uses only moves. The DAG associated with is illustrated in Figure 3 with and below.

Figure 3: Computational graph of the DTW algorithm.

Again, the bold arrows indicate one possible path from start to end in the DAG, and correspond to one possible alignment. Using this representation, the cost of an alignment (cumulated cost along the path) is conveniently computed by . The value can be used to define a loss between alignments or between time-series. Following Proposition 3, can be understood as a soft alignment matrix. This matrix is sparse when , as illustrated in Figure 1 (right).

Pseudo-code to compute as well as its gradient and its Hessian products are provided in §B.3. When , is a conditional random field known as soft-DTW, and the probability is a Gibbs distribution similar to §4.1  (Cuturi & Blondel, 2017). However, the case and the computation of are new and allow new applications.

5 Differentiable structured prediction

We now apply the proposed layers, and , to structured prediction (Bakır et al., 2007), whose goal is to predict a structured output associated with a structured input

. We define old and new structured losses, and demonstrate them on two structured prediction tasks: named entity recognition and time-series alignment.

5.1 Structured loss functions

Throughout this section, we assume that the potentials have already been computed using a function from to and let be a cost function between the ground-truth output and the predicted output .

Convex losses.

Because is typically non-convex, the cost-augmented structured hinge loss (Tsochantaridis et al., 2005) is often used instead for linear models

(22)

This is a convex upper-bound on , where is defined in (6). To make the cost-augmented decoding tractable, it is usually assumed that is linear in , i. e., it can be written as for some matrix . We can then rewrite (22) using our notation as

(23)

However, this loss function is non-differentiable. We therefore propose to relax LP by substituting it with :

(24)

Losses in this class are convex, smooth, tractable for any , and by Proposition 2 property 2 a sensible approximation of . In addition, they only require to backpropagate through at training time. It is easy to check that we recover the structured hinge loss with (Tsochantaridis et al., 2005) and the CRF loss with (Lafferty et al., 2001). The last one has been used on top of LSTMs in several recent works (Lample et al., 2016; Ma & Hovy, 2016). Minimizing is equivalent to maximizing the likelihood . However, minimizing is not equivalent to maximizing . In fact, the former is convex while the latter is not.

Non-convex losses.

A direct approach that uses the output distribution consists in minimizing the risk . As shown by Stoyanov & Eisner (2012), this can be achieved by backpropagating through the minimum risk decoder. However, the risk is usually non-differentiable, piecewise constant (Smith & Eisner, 2006)

and several smoothing heuristics are necessary to make the method work

(Stoyanov & Eisner, 2012).

Another principled approach is to consider a differentiable approximation of the cost . We can then relax by . Unlike minimum risk training, this approach is differentiable everywhere when . Both approaches require to backpropagate through , which is only roughly twice as costly as backpropagating through using the approach outlined in §3.4.

5.2 Named entity recognition

Let be an input sentence, where each word is represented by a vector in , computed using a neural recurrent architecture trained end-to-end. We wish to tag each word with named entities, i.e., identify blocks of words that correspond to names, locations, dates, etc. We use the specialized operator described in §4.1. In our experiments, we define the elements of the potential tensor when by

(25)

and , where

is the linear classifier associated with tag

and is a transition matrix. We learn , and along with the network producing , and compare two losses:

Surrogate convex loss: (26)
Relaxed loss:

where is the squared distance when

and the Kullback-Leibler divergence when

, applied row-wise to the marginalization of and .

Experiments.

We measure the performance of the different losses and regularizations on the four languages of the CoNLL 2003 (Tjong Kim Sang & De Meulder, 2003) dataset. Following Lample et al. (2016), who use the loss, we use a character LSTM and pretrained embeddings computed using FastText (Joulin et al., 2016) on Wikipedia. Those are fed to a word bidirectional LSTM to obtain . Architecture details are provided in §C.1. Results are reported in Table 1, along with (Lample et al., 2016) results with different pretrained embeddings. With proper parameter selections, all losses perform within -score of each other, although entropy-regularized losses perform slightly better on languages. However, the -regularized losses yield sparse predictions, whereas entropy regularization always yields dense probability vectors. Qualitatively, this allows to identify ambiguous predictions more easily — this is illustrated in §C.1 with additional figures.

Loss English Spanish German Dutch
Negentropy Surrogate 90.80 86.68 77.35 87.56
Relaxed 90.47 86.20 77.56 87.37
Surrogate 90.86 85.51 76.01 86.58
Relaxed 89.49 84.07 76.91 85.90
(Lample et al., 2016) 90.96 85.75 78.76 81.74
Table 1: score comparison on CoNLL03 NER datasets.

5.3 Supervised audio-to-score transcription

We use our framework to perform supervised audio-to-score alignment on the Bach 10 dataset (Duan & Pardo, 2011). The dataset consists of 10 music pieces with audio tracks, MIDI transcriptions, and annotated alignments between them. We transform the audio tracks into a sequence of audio frames using a feature extractor (see §C.2) to obtain a sequence , while the associated score sequence is represented by (each row is a one-hot vector corresponding to one key ). Each pair is associated to an alignment . As described in §4.2, we need to define a discrepancy matrix between the elements of the two sequences. We set the cost between an audio frame and a key to be the log-likelihood of this key given a multinomial linear classifier. For all , we define

(27)
(28)

where are learned classifier parameters. We predict a soft alignment by . Following (Garreau et al., 2014), we define the relaxed loss

(29)

where a the lower triangular matrix filled with . When is a true alignement matrix, is the area between the path of and , which corresponds to the mean absolute deviation in the audio literature. When , it is a convex relaxation of the area. At test time, once is learned, we use the non-regularized DTW algorithm to output a hard alignment .

Results.

We perform a leave-one-out cross-validation of our model performance, learning the multinomial classifier on pieces and assessing the quality of the alignment on the remaining piece. We report the mean absolute deviation on both train and test sets. A solid baseline consists in learning the multinomial classifier beforehand, i.e., without end-to-end training. We then use this model to compute as in (27) and obtain . As shown in Table 2, our end-to-end technique outperforms this baseline by a large margin. We also demonstrate in §C.2 that the alignments obtained by end-to-end training are visibly closer to the ground truth. End-to-end training thus allows to fine-tune the distance matrix for the alignment task at hand.

Linear model Train Test
End-to-end trained
Pretrained
Random
Table 2: Mean absolute deviation of alignment using an end-to-end trained multinomial classifier and a pre-trained one.

6 Structured and sparse attention

We show in this section how to apply our framework to neural sequence-to-sequence models augmented with an attention mechanism (Bahdanau et al., 2015). An encoder first produces a list of vectors representing the input sequence. A decoder is then used to greedily produce the corresponding output sequence. To simplify the notation, we focus on one time step of the decoding procedure. Given the decoder’s current hidden state and as inputs, the role of the attention mechanism is to produce a distribution over , for the current time step. This distribution is then typically used to produce a context vector , that is in turn invoved in the computation of the output sequence’s next element.

Structured attention layers.

Kim et al. (2017) proposed a segmentation attention layer, which is capable of taking into account the transitions between elements of . They use a linear-chain CRF to model the probability of a sequence , where each is either (“pay attention”) or . They then propose to use normalized marginal probabilities as attention weights: . They show how to backpropagate gradients through the forward-backward algorithm, which they use to compute the marginal probabilities.

Generalizing structured attention.

We now show how to generalize segmentation layers to any and how to backpropagate through them efficiently. Using the notation from §4.1, any can be represented as a tensor and the potentials as a tensor . As in (Kim et al., 2017), we define

(30)

where is a learned bilinear form and is a learned transition matrix. Following §4.1, the gradient is equal to the expected matrix and the marginals are obtained by marginalizing that matrix. Hence, we can set .

Backpropagating through can be carried out using our approach outlined in §3.4. This approach is not only more general, but also simpler and more robust to underflow problems than backpropagating through the forward-backward algorithm as done in (Kim et al., 2017).

Experiments.

We demonstrate structured attention layers with an LSTM encoder and decoder to perform French to English translation using data from a 1 million sentence subset of the WMT14 FR-EN challenge. We illustrate an example of attenion map obtained with negentropy and regularizations in Figure 4. Non-zero elements are underlined with borders: -regularized attention maps are sparse and more interpretable — this provides a structured alternative to sparsemax attention  (Martins & Astudillo, 2016). Results were all within  point of BLEU score on the newstest2014 dataset. For French to English, standard softmax attention obtained 27.96, while entropy and regularized attention obtained 27.96 and 27.19 — introducing structure and sparsity therefore provides enhanced interpretability with comparable peformance. We provide model details, full results and further visualizations in §C.3.

Figure 4: Attention maps obtained with structured attention. Although both regularizations led to the same translation (-axis) in this example, attention is sparse and more interpretable with .

7 Conclusion

We proposed a theoretical framework for turning a broad class of dynamic programs into convex, differentiable and tractable operators, using the novel point of view of smoothed max operators. Our work sheds a new light on how to transform dynamic programs that predict hard assignments (e.g.,

the maximum a-posteriori estimator in a probabilistic graphical model or an alignment matrix between two time-series) into continuous and probabilistic ones. We provided a new argument in favor of negentropy regularization by showing that it is the only one to preserve

associativity of the smoothed max operator. We showed that different regularizations induce different distributions over outputs and that

regularization has other benefits, in terms of sparsity of the expected outputs. Generally speaking, performing inference in a graphical model and backpropagating through it reduces to computing the first and second-order derivatives of a relaxed maximum-likelihood estimation — leveraging this observation yields elegant and efficient algorithms that are readily usable in deep learning frameworks, with various promising applications.

Acknowledgements

MB thanks Vlad Niculae and Marco Cuturi for many fruitful discussions. AM thanks Julien Mairal, Inria Thoth and Inria Parietal for lending him the computational resources necessary to run the experiments. He thanks University Paris-Saclay and his Ph.D. supervisors Bertrand Thirion and Gaël Varoquaux for allowing him to do an internship at NTT, and Olivier Grisel for his insightful comments.

References

  • Aji & McEliece (2000) Aji, Srinivas M and McEliece, Robert J. The generalized distributive law. IEEE Transactions on Information Theory, 46(2):325–343, 2000.
  • Amos & Kolter (2017) Amos, Brandon and Kolter, J. Zico. OptNet: Differentiable optimization as a layer in neural networks. In Proc. of ICML, pp. 136–145, 2017.
  • Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. of ICLR, 2015.
  • Bakır et al. (2007) Bakır, Gökhan, Hofmann, Thomas, Schölkopf, Bernhard, Smola, Alexander J, Taskar, Ben, and Vishwanathan, S. V. N. Predicting Structured Data. The MIT Press, 2007.
  • Banderier & Schwer (2005) Banderier, Cyril and Schwer, Sylviane. Why Delannoy numbers? Journal of Statistical Planning and Inference, 135(1):40–54, 2005.
  • Baum & Petrie (1966) Baum, Leonard E. and Petrie, Ted. Statistical inference for probabilistic functions of finite state markov chains. The Annals of Mathematical Statistics, 37(6):1554–1563, 1966.
  • Beck & Teboulle (2012) Beck, Amir and Teboulle, Marc. Smoothing and First Order Methods: A Unified Framework. SIAM Journal on Optimization, 22(2):557–580, 2012.
  • Bellman (1952) Bellman, Richard. On the theory of dynamic programming. Proc. of the National Academy of Sciences, 38(8):716–719, 1952.
  • Bertsekas (1971) Bertsekas, Dimitri P. Control of uncertain systems with a set-membership description of the uncertainty. PhD thesis, Massachusetts Institute of Technology, 1971.
  • Boyd & Vandenberghe (2004) Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004.
  • Cuturi & Blondel (2017) Cuturi, Marco and Blondel, Mathieu. Soft-DTW: a Differentiable Loss Function for Time-Series. In Proc. of ICML, pp. 894–903, 2017.
  • Danskin (1966) Danskin, John M. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641–664, 1966.
  • Djolonga & Krause (2017) Djolonga, Josip and Krause, Andreas. Differentiable learning of submodular functions. In Proc. of NIPS, pp. 1014–1024, 2017.
  • (14) Duan, Zhiyao and Pardo, Bryan. Bach 10 dataset. http://music.cs.northwestern.edu/data/Bach10.html.
  • Duan & Pardo (2011) Duan, Zhiyao and Pardo, Bryan. Soundprism: An online system for score-informed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing, 5(6):1205–1215, 2011.
  • Duchi et al. (2008) Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Chandra, Tushar. Efficient projections onto the -ball for learning in high dimensions. In Proc. of ICML, pp. 272–279, 2008.
  • Eisner (2016) Eisner, Jason. Inside-outside and forward-backward algorithms are just backprop (tutorial paper). In Proc. of the Workshop on Structured Prediction for NLP, pp. 1–17, 2016.
  • Garreau et al. (2014) Garreau, Damien, Lajugie, Rémi, Arlot, Sylvain, and Bach, Francis. Metric learning for temporal sequence alignment. In Proc. of NIPS, pp. 1817–1825, 2014.
  • Gselmann (2011) Gselmann, Eszter. Entropy functions and functional equations. Mathematical Communications, (16):347–357, 2011.
  • Horibe (1988) Horibe, Yasuichi. Entropy of terminal distributions and the Fibonnacci trees. The Fibonacci Quarterly, (26):135–140, 1988.
  • Joulin et al. (2016) Joulin, Armand, Grave, Edouard, Bojanowski, Piotr, Douze, Matthijs, Jégou, Hérve, and Mikolov, Tomas. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
  • Kim et al. (2017) Kim, Yoon, Denton, Carl, Hoang, Luong, and Rush, Alexander M. Structured Attention Networks. In Proc. of ICLR, 2017.
  • Lafferty et al. (2001) Lafferty, John, McCallum, Andrew, and Pereira, Fernando CN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, pp. 282–289, 2001.
  • Lample et al. (2016) Lample, Guillaume, Ballesteros, Miguel, Subramanian, Sandeep, Kawakami, Kazuya, and Dyer, Chris. Neural architectures for named entity recognition. In Proc. of NAACL, pp. 260–270, 2016.
  • Linnainmaa (1970) Linnainmaa, Seppo. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. PhD thesis, Univ. Helsinki, 1970.
  • Luong et al. (2015) Luong, Thang, Pham, Hieu, and Manning, Christopher D. Effective Approaches to Attention-based Neural Machine Translation. In Proc. of EMNLP, pp. 1412–1421, 2015.
  • Ma & Hovy (2016) Ma, Xuezhe and Hovy, Eduard. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proc. of ACL, pp. 1064–1074, 2016.
  • Martins & Astudillo (2016) Martins, André F.T. and Astudillo, Ramón Fernandez. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proc. of ICML, pp. 1614–1623, 2016.
  • Michelot (1986) Michelot, Christian. A finite algorithm for finding the projection of a point onto the canonical simplex of . Journal of Optimization Theory and Applications, 50(1):195–200, 1986.
  • Moreau (1965) Moreau, Jean-Jacques. Proximité et dualité dans un espace hilbertien. Bullet de la Société Mathémathique de France, 93(2):273–299, 1965.
  • Nesterov (2005) Nesterov, Yurii. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005.
  • Niculae & Blondel (2017) Niculae, Vlad and Blondel, Mathieu. A Regularized Framework for Sparse and Structured Neural Attention. In Proc. of NIPS, pp. 3340–3350, 2017.
  • Paszke et al. (2017) Paszke, Adam, Gross, Sam, Chintala, Soumith, and Chanan, Gregory. Pytorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, 2017.
  • Pearl (1988) Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Elsevier, 1988.
  • Pearlmutter (1994) Pearlmutter, Barak A. Fast exact multiplication by the Hessian. Neural computation, 6(1):147–160, 1994.
  • Rabiner (1990) Rabiner, Lawrence R.

    A tutorial on hidden markov models and selected applications in speech recognition.

    In Readings in Speech Recognition, pp. 267–296. 1990.
  • Sakoe & Chiba (1978) Sakoe, Hiroaki and Chiba, Seibi. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26:43–49, 1978.
  • Smith & Eisner (2006) Smith, David A. and Eisner, Jason. Minimum risk annealing for training log-linear models. In Proc. of COLING/ACL, pp. 787–794, 2006.
  • Stoyanov & Eisner (2012) Stoyanov, Veselin and Eisner, Jason. Minimum-risk training of approximate crf-based nlp systems. In Proc. of NAACL, pp. 120–130, 2012.
  • Sulanke (2003) Sulanke, Robert A. Objects counted by the central Delannoy numbers. Journal of Integer Sequences, 6(1):3, 2003.
  • Sutton et al. (2012) Sutton, Charles, McCallum, Andrew, et al. An introduction to conditional random fields.

    Foundations and Trends in Machine Learning

    , 4(4):267–373, 2012.
  • Tjong Kim Sang & De Meulder (2003) Tjong Kim Sang, Erik F. and De Meulder, Fien. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proc. of CoNLL, pp. 142–147, 2003.
  • Tsochantaridis et al. (2005) Tsochantaridis, Ioannis, Joachims, Thorsten, Hofmann, Thomas, and Altun, Yasemin. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005.
  • Verdu & Poor (1987) Verdu, Sergio and Poor, H Vincent. Abstract dynamic programming models under commutativity conditions. SIAM Journal on Control and Optimization, 25(4):990–1006, 1987.
  • Viterbi (1967) Viterbi, Andrew. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967.
  • Wainwright & Jordan (2008) Wainwright, Martin J. and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.

Appendix A Proofs and detailed derivations

This section contains the proofs of the propositions and lemmas presented in the main text. It also contains derivations of gradient, directional derivative and Hessian-product computations.

a.1 Proof of Lemma 1 (properties of )

Property 1 (boundedness).

Let and be the solutions of and , respectively. Then, we have

(31)

and

(32)

Combining the two and using , we obtain

(33)

When , we have the tight inequality and hence

(34)

When , we have the tight inequality and hence

(35)

Note that the difference is equal to when is the negative entropy and to when is the squared norm. Since for all integers , we get a better approximation of the operator using squared norm than using negative entropy, whenever .

Property 2 (distributivity of over ).

This follows immediately from

(36)

Using our shorthand notation, this simply becomes .

Property 3 (commutativity).

Assume for all permutation matrices . Let be the inverse permutation matrix associated with . Then we have

(37)
(38)

Property 4 (non-decreasingness in each coordinate).

If