Log In Sign Up

SparseMAP: Differentiable Sparse Structured Inference

by   Vlad Niculae, et al.

Structured prediction requires searching over a combinatorial number of structures. To tackle it, we introduce SparseMAP, a new method for sparse structured inference, together with corresponding loss functions. SparseMAP inference is able to automatically select only a few global structures: it is situated between MAP inference, which picks a single structure, and marginal inference, which assigns probability mass to all structures, including implausible ones. Importantly, SparseMAP can be computed using only calls to a MAP oracle, hence it is applicable even to problems where marginal inference is intractable, such as linear assignment. Moreover, thanks to the solution sparsity, gradient backpropagation is efficient regardless of the structure. SparseMAP thus enables us to augment deep neural networks with generic and sparse structured hidden layers. Experiments in dependency parsing and natural language inference reveal competitive accuracy, improved interpretability, and the ability to capture natural language ambiguities, which is attractive for pipeline systems.


LP-SparseMAP: Differentiable Relaxed Optimization for Sparse Structured Prediction

Structured prediction requires manipulating a large number of combinator...

Backpropagating through Structured Argmax using a SPIGOT

We introduce the structured projection of intermediate gradients optimiz...

End-to-end learning potentials for structured attribute prediction

We present a structured inference approach in deep neural networks for m...

Transductive Parsing for Universal Decompositional Semantics

We introduce a transductive model for parsing into Universal Decompositi...

Hinge-Loss Markov Random Fields and Probabilistic Soft Logic

A fundamental challenge in developing high-impact machine learning techn...

Learning with Latent Structures in Natural Language Processing: A Survey

While end-to-end learning with fully differentiable models has enabled t...

SPECTRA: Sparse Structured Text Rationalization

Selective rationalization aims to produce decisions along with rationale...

1 Introduction

Structured prediction involves the manipulation of discrete, combinatorial structures, e.g., trees and alignments (Bakır et al., 2007; Smith, 2011; Nowozin et al., 2014). Such structures arise naturally as machine learning outputs, and as intermediate representations in deep pipelines. However, the set of possible structures is typically prohibitively large. As such, inference is a core challenge, often sidestepped by greedy search, factorization assumptions, or continuous relaxations (Belanger & McCallum, 2016).

In this paper, we propose an appealing alternative: a new inference strategy, dubbed , which encourages sparsity in the structured representations. Namely, we seek solutions explicitly expressed as a combination of a small, enumerable set of global structures.


argmax(1, 0, 0)

softmax(.5, .3, .2)

sparsemax(.6, .4, 0)




Figure 1: Left: in the unstructured case, and can be interpreted as regularized, differentiable approximations; returns dense solutions while favors sparse ones. Right: in this work, we extend this view to structured inference, which consists of optimizing over a polytope , the convex hull of all possible structures (depicted: the arborescence polytope, whose vertices are trees). We introduce as a structured extension of : it is situated in between MAP inference, which yields a single structure, and marginal inference, which returns a dense combination of structures.

Our framework departs from the two most common inference strategies in structured prediction: maximum a posteriori (MAP) inference, which returns the highest-scoring structure, and marginal inference

, which yields a dense probability distribution over structures. Neither of these strategies is fully satisfactory: for latent structure models, marginal inference is appealing, since it can represent uncertainty and, unlike MAP inference, it is continuous and differentiable, hence amenable for use in structured hidden layers in neural networks

(Kim et al., 2017). It has, however, several limitations. First, there are useful problems for which MAP is tractable, but marginal inference is not, e.g., linear assignment (Valiant, 1979; Taskar, 2004). Even when marginal inference is available, case-by-case derivation of the backward pass is needed, sometimes producing fairly complicated algorithms, e.g., second-order expectation semirings (Li & Eisner, 2009). Finally, marginal inference is dense: it assigns nonzero probabilities to all structures and cannot completely rule out irrelevant ones. This can be statistically and computationally wasteful, as well as qualitatively harder to interpret.

In this work, we make the following contributions:

  • We propose : a new framework for sparse structured inference3.1). The main idea is illustrated in Figure 1. is a twofold generalization: first, as a structured extension of the transformation (Martins & Astudillo, 2016); second, as a continuous yet sparse relaxation of MAP inference. MAP yields a single structure and marginal inference yields a dense distribution over all structures. In contrast, the solutions are sparse combinations of a small number of often-overlapping structures.

  • We show how to compute effectively, requiring only a MAP solver as a subroutine (§3.2), by exploiting the problem’s sparsity and quadratic curvature. Noticeably, the MAP oracle can be any arbitrary solver, e.g., the Hungarian algorithm for linear assignment, which permits tackling problems for which marginal inference is intractable.

  • We derive expressions for gradient backpropagation through inference, which, unlike MAP, is differentiable almost everywhere (§3.3). The backward pass is fully general (applicable to any type of structure), and it is efficient, thanks to the sparsity of the solutions and to reusing quantities computed in the forward pass.

  • We introduce a novel loss for structured prediction, placing it into a family of loss functions which generalizes the CRF and structured SVM losses (§4). Inheriting the desirable properties of inference, the loss and its gradients can be computed efficiently, provided access to MAP inference.

Our experiments demonstrate that is useful both for predicting structured outputs, as well as for learning latent structured representations. On dependency parsing5.1), structured output networks trained with the loss yield more accurate models with sparse, interpretable predictions, adapting to the ambiguity (or lack thereof) of test examples. On natural language inference5.2), we learn latent structured alignments, obtaining good predictive performance, as well as useful natural visualizations concentrated on a small number of structures.111 General-purpose dynet and pytorch implementations available at


Given vectors

, denotes their concatenation; given matrices , we denote their row-wise stacking as . We denote the columns of a matrix by ; by extension, a slice of columns of is denoted for a set of indices . We denote the canonical simplex by , and the indicator function of a predicate as otherwise .

2 Preliminaries

2.1 Regularized Max Operators: Softmax, Sparsemax

As a basis for the more complex structured case, we first consider the simple problem of selecting the largest value in a vector . We denote the vector mapping

When there are no ties, has a unique solution peaking at the index of the highest value of . When there are ties, is set-valued. Even assuming no ties, is piecewise constant, and thus is ill-suited for direct use within neural networks, e.g., in an attention mechanism. Instead, it is common to use , a continuous and differentiable approximation to , which can be seen as an entropy-regularized


where , i.e. the negative Shannon entropy. Since strictly, outputs are dense.

By replacing the entropic penalty with a squared norm, Martins & Astudillo (2016) introduced a sparse alternative to , called , given by


Both and are continuous and differentiable almost everywhere; however, encourages sparsity in its outputs. This is because it corresponds to an Euclidean projection onto the simplex, which is likely to hit its boundary as the magnitude of increases. Both mechanisms, as well as variants with different penalties (Niculae & Blondel, 2017), have been successfully used in attention mechanisms, for mapping a score vector to a -dimensional normalized discrete probability distribution over a small set of choices. The relationship between , , and , illustrated in Figure 1, sits at the foundation of .

2.2 Structured Inference

In structured prediction, the space of possible outputs is typically very large: for instance, all possible labelings of a length- sequence, spanning trees over nodes, or one-to-one alignments between two sets. We may still write optimization problems such as , but it is impractical to enumerate all of the possible structures and, in turn, to specify the scores for each structure in .

Instead, structured problems are often parametrized through structured log-potentials (scores) , where is a matrix that specifies the structure of the problem, and is lower-dimensional parameter vector, i.e., . For example, in a factor graph (Kschischang et al., 2001) with variables and factors , is given by

where and are unary and higher-order log-potentials, and and are local configurations at variable and factor nodes. This can be written in matrix notation as for suitable matrices , fitting the assumption above with and .

We can then rewrite the MAP inference problem, which seeks the highest-scoring structure, as a -dimensional problem, by introducing variables to denote configurations at variable and factor nodes:222We use the notation to convey that the maximization is over both and , but only is returned. Separating the variables as loses no generality and allows us to isolate the unary posteriors as the return value of interest.


where is the marginal polytope (Wainwright & Jordan, 2008), with one vertex for each possible structure (Figure 1). However, as previously said, since it is equivalent to a -dimensional , MAP is piecewise constant and discontinuous.

Negative entropy regularization over , on the other hand, yields marginal inference,


Marginal inference is differentiable, but may be more difficult to compute; the entropy itself lacks a closed form (Wainwright & Jordan, 2008, §4.1.2). Gradient backpropagation is available only to specialized problem instances, e.g. those solvable by dynamic programming (Li & Eisner, 2009). The entropic term regularizes

toward more uniform distributions, resulting in strictly dense solutions, just like in the case of

 (Equation 1).

Interesting types of structures, which we use in the experiments described in Section 5, include the following.

Sequence tagging. Consider a sequence of items, each assigned one out of a possible tags. In this case, a global structure is a joint assignment of tags . The matrix is -by-–dimensional, with columns indicating which tag is assigned to each variable in the global structure . is -by-–dimensional, with encoding the transitions between consecutive tags, i.e., . The Viterbi algorithm provides MAP inference and forward-backward provides marginal inference (Rabiner, 1989).

Non-projective dependency parsing. Consider a sentence of length . Here, a structure is a dependency tree: a rooted spanning tree over the possible arcs (for example, the arcs above the sentences in Figure 3). Each column encodes a tree by assigning a to its arcs. is empty, is known as the arborescence polytope (Martins et al., 2009). MAP inference may be performed by maximal arborescence algorithms (Chu & Liu, 1965; Edmonds, 1967; McDonald et al., 2005), and the Matrix-Tree theorem (Kirchhoff, 1847) provides a way to perform marginal inference (Koo et al., 2007; Smith & Smith, 2007).

Linear assignment. Consider a one-to-one matching (linear assignment) between two sets of nodes. A global structure is a -permutation, and a column can be seen as a flattening of the corresponding permutation matrix. Again, is empty. is the Birkhoff polytope (Birkhoff, 1946), and MAP inference can be performed by, e.g., the Hungarian algorithm (Kuhn, 1955) or the Jonker-Volgenant algorithm (Jonker & Volgenant, 1987). Noticeably, marginal inference is known to be #P-complete (Valiant, 1979; Taskar, 2004, Section 3.5). This makes it an open problem how to use matchings as latent variables.


Armed with the parallel between structured inference and regularized operators described in §2, we are now ready to introduce , a novel inference optimization problem which returns sparse solutions.

3.1 Definition

We introduce by regularizing the MAP inference problem in Equation 3 with a squared penalty on the returned posteriors, i.e., . Denoting, as above, , the result is a quadratic optimization problem,


The quadratic penalty replaces the entropic penalty from marginal inference (Equation 4), which pushes the solutions to the strict interior of the marginal polytope. In consequence, favors sparse solutions from the faces of the marginal polytope , as illustrated in Figure 1. For the structured prediction problems mentioned in Section 2.2, would be able to return, for example, a sparse combination of sequence labelings, parse trees, or matchings. Moreover, the strongly convex regularization on ensures that has a unique solution and is differentiable almost everywhere, as we will see.

3.2 Solving

Figure 2: Comparison of solvers on the optimization problem for a tree factor with 20 nodes. The active set solver converges much faster and to a much sparser solution.

We now tackle the optimization problem in Equation 5. Although is a QP over a polytope, even describing it in standard form is infeasible, since enumerating the exponentially-large set of vertices is infeasible. This prevents direct application of, e.g., the generic differentiable QP solver of Amos & Kolter (2017). We instead focus on solvers that involve a sequence of MAP problems as a subroutine—this makes widely applicable, given the availability of MAP implementations for various structures. We discuss two such methods, one based on the conditional gradient algorithm and another based on the active set method for quadratic programming. We provide a full description of both methods in Appendix A.

Conditional gradient.  One family of such solvers is based on the conditional gradient (CG) algorithm (Frank & Wolfe, 1956; Lacoste-Julien & Jaggi, 2015), considered in prior work for solving approximations of the marginal inference problem (Belanger et al., 2013; Krishnan et al., 2015). Each step must solve a linearized subproblem. Denote by the objective from Equation 5,

The gradients of with respect to the two variables are

A linear approximation to around a point is

Minimizing over is exactly MAP inference with adjusted variable scores . Intuitively, at each step we seek a high-scoring structure while penalizing sharing variables with already-selected structures Vanilla CG simply adds the new structure to the active set at every iteration. The pairwise and away-step variants trade off between the direction toward the new structure, and away from one of the already-selected structures. More sophisticated variants have been proposed (Garber & Meshi, 2016) which can provide sparse solutions when optimizing over a polytope.

Active set method.  Importantly, the problem in Equation 5 has quadratic curvature, which the general CG algorithms may not optimally leverage. For this reason, we consider the active set method for constrained QPs: a generalization of Wolfe’s min-norm point algorithm (Wolfe, 1976), also used in structured prediction for the quadratic subproblems by Martins et al. (2015)

. The active set algorithm, at each iteration, updates an estimate of the solution support by adding or removing one constraint to/from the active set; then it solves the Karush–Kuhn–Tucker (KKT) system of a relaxed QP restricted to the current support.

Comparison. Both algorithms enjoy global linear convergence with similar rates (Lacoste-Julien & Jaggi, 2015), but the active set algorithm also exhibits exact finite convergence—this allows it, for instance, to capture the optimal sparsity pattern (Nocedal & Wright, 1999, Ch. 16.4 & 16.5). Vinyes & Obozinski (2017) provide a more in-depth discussion of the connections between the two algorithms. We perform an empirical comparison on a dependency parsing instance with random potentials. Figure 2 shows that active set substantially outperforms all CG variants, both in terms of objective value as well as in the solution sparsity, suggesting that the quadratic curvature makes solvable in very few iterations to high accuracy. We therefore use the active set solver in the remainder of the paper.

3.3 Backpropagating Gradients through

In order to use as a neural network layer trained with backpropagation, one must compute products of the Jacobian with a vector . Computing the Jacobian of an optimization problem is an active research topic known as argmin differentiation, and is generally difficult. Fortunately, as we show next, argmin differentiation is always easy and efficient in the case of .

Proposition 1

Denote a solution by and its support by . Then, is differentiable almost everywhere with Jacobian

The proof, given in Appendix B, relies on the KKT conditions of the QP. Importantly, because is zero outside of the support of the solution, computing the Jacobian only requires the columns of and corresponding to the structures in the active set. Moreover, when using the active set algorithm discussed in §3.2, the matrix is readily available as a byproduct of the forward pass. The backward pass can, therefore, be computed in .

Our approach for gradient computation draws its efficiency from the solution sparsity and does not depend on the type of structure considered. This is contrasted with two related lines of research. The first is “unrolling” iterative inference algorithms, for instance belief propagation (Stoyanov et al., 2011; Domke, 2013) and gradient descent (Belanger et al., 2017), where the backward pass complexity scales with the number of iterations. In the second, employed by Kim et al. (2017), when inference can be performed via dynamic programming, backpropagation can be performed using second-order expectation semirings (Li & Eisner, 2009) or more general smoothing (Mensch & Blondel, 2018), in the same time complexity as the forward pass. Moreover, in our approach, neither the forward nor the backward passes involve logarithms, exponentiations or log-domain classes, avoiding the slowdown and stability issues normally incurred.

In the unstructured case, since ,

is also an identity matrix, uncovering the

Jacobian (Martins & Astudillo, 2016). In general, structures are not necessarily orthogonal, but may have degrees of overlap.

4 Structured Fenchel-Young Losses
and the Loss

With the efficient algorithms derived above in hand, we switch gears to defining a loss function. Structured output prediction models are typically trained by minimizing a structured loss measuring the discrepancy between the desired structure (encoded, for instance, as an indicator vector ) and the prediction induced by the log-potentials . We provide here a general family of structured prediction losses that will make the newly proposed loss arise as a very natural case. Below, we let denote a convex penalty function and denote by its restriction to , i.e.,

The Fenchel convex conjugate of is

We next introduce a family of structured prediction losses, named after the corresponding Fenchel-Young duality gap.

Definition 1 (Fenchel-Young losses)

Given a convex penalty function , and a -dimensional matrix encoding the structure of the problem, we define the following family of structured losses:


This family, studied in more detail in (Blondel et al., 2018), includes the commonly-used structured losses:

  • Structured perceptron 

    (Collins, 2002): ;

  • Structured SVM (Taskar et al., 2003; Tsochantaridis et al., 2004): for a cost function , where is the true output;

  • CRF (Lafferty et al., 2001): ;

  • Margin CRF (Gimpel & Smith, 2010):

This leads to a natural way of defining losses, by plugging the following into Equation 6:

  • loss: ,

  • Margin : .

It is well-known that the subgradients of structured perceptron and SVM losses consist of MAP inference, while the CRF loss gradient requires marginal inference. Similarly, the subgradients of the loss can be computed via inference, which in turn only requires MAP. The next proposition states properties of structured Fenchel-Young losses, including a general connection between a loss and its corresponding inference method.

Proposition 2

Consider a convex and a structured model defined by the matrix . Denote the inference objective , and a solution . Then, the following properties hold:

  • , with equality when ;

  • is convex, ;

  • for any .

Proof is given in Appendix C. Property 1 suggests that pminimizing aligns models with the true label. Property 2 shows how to compute subgradients of provided access to the inference output . Combined with our efficient procedure described in Section 3.2, it makes the losses promising for structured prediction. Property 3 suggests that the strength of the penalty can be adjusted by simply scaling . Finally, we remark that for a strongly-convex , can be seen as a smoothed perceptron loss; other smoothed losses have been explored by Shalev-Shwartz & Zhang (2016).

5 Experimental Results

In this section, we experimentally validate

on two natural language processing applications, illustrating the two main use cases presented: structured output prediction with the

loss (§5.1) and structured hidden layers (§5.2). All models are implemented using the dynet library v2.0.2 (Neubig et al., 2017).

5.1 Dependency Parsing with the Loss

Loss en zh vi ro ja
Structured SVM 87.02 81.94 69.42 87.58 96.24
CRF 86.74 83.18 69.10 87.13 96.09
86.90 84.03 69.71 87.35 96.04
m- 87.34 82.63 70.87 87.63 96.03
UDPipe baseline 87.68 82.14 69.63 87.36 95.94
Table 1: Unlabeled attachment accuracy scores for dependency parsing, using a bi-LSTM model (Kiperwasser & Goldberg, 2016). and its margin version, m-, produce the best parser on 4/5 datasets. For context, we include the scores of the CoNLL 2017 UDPipe baseline, which is trained under the same conditions (Straka & Straková, 2017).

[hide label,edge unit distance=1.5ex,thick,label style=scale=1.3] & They & did & a & vehicle & wrap & for & my & Toyota & Venza & that & looks & amazing & .
321.0 131.0 641.0 651.0 361.0 1071.0 1081.0 1091.0 [very thick,color=myblue,show label,edge below,edge unit distance=0.5ex]310.55 [very thick, color=myblue,show label,edge above,edge unit distance=1.5ex]610.45 12111.0 [very thick,color=mypurple,show label,edge below,edge unit distance=0.7ex]312.68 [very thick,color=mypurple,show label,edge above,edge unit distance=1.5ex]612.32 12131.0 [hide label,edge unit distance=1.5ex,thick,label style=scale=1.3] & the & broccoli & looks & browned & around & the & edges & .
32 43 [segmented edge,edge unit distance=1.3ex]14 45 86 87 [very thick,show label,edge below,color=myblue, edge unit distance=0.9ex]48.76 [very thick,show label,edge above,color=myblue,edge unit distance=1.8ex]58.24

Figure 3: Example of ambiguous parses from the UD English validation set. selects a small number of candidate parses (left: three, right: two), differing from each other in a small number of ambiguous dependency arcs. In both cases, the desired gold parse is among the selected trees (depicted by the arcs above the sentence), but it is not the highest-scoring one.
Figure 4: Distribution of the tree sparsity (top) and arc sparsity (bottom) of solutions during training on the Chinese dataset. Shown are respectively the number of trees and the average number of parents per word with nonzero probability.

We evaluate the losses against the commonly used CRF and structured SVM losses. The task we focus on is non-projective dependency parsing: a structured output task consisting of predicting the directed tree of grammatical dependencies between words in a sentence (Jurafsky & Martin, 2018, Ch. 14). We use annotated Universal Dependency data (Nivre et al., 2016), as used in the CoNLL 2017 shared task (Zeman et al., 2017). To isolate the effect of the loss, we use the provided gold tokenization and part-of-speech tags. We follow closely the bidirectional LSTM arc-factored parser of Kiperwasser & Goldberg (2016), using the same model configuration; the only exception is not using externally pretrained embeddings. Parameters are trained using Adam (Kingma & Ba, 2015), tuning the learning rate on the grid , expanded by a factor of 2 if the best model is at either end.

We experiment with 5 languages, diverse both in terms of family and in terms of the amount of training data (ranging from 1,400 sentences for Vietnamese to 12,525 for English). Test set results (Table 1) indicate that the losses outperform the SVM and CRF losses on 4 out of the 5 languages considered. This suggests that is a good middle ground between MAP-based and marginal-based losses in terms of smoothness and gradient sparsity.

Moreover, as illustrated in Figure 4, the loss encourages sparse predictions: models converge towards sparser solutions as they train, yielding very few ambiguous arcs. When confident, can predict a single tree. Otherwise, the small set of candidate parses returned can be easily visualized, often indicating genuine linguistic ambiguities (Figure 3). Returning a small set of parses, also sought concomittantly by Keith et al. (2018), is valuable in pipeline systems, e.g., when the parse is an input to a downstream application: error propagation is diminished in cases where the highest-scoring tree is incorrect (which is the case for the sentences in Figure 3). Unlike

-best heuristics,

dynamically adjusts its output sparsity, which is desirable on realistic data where most instances are easy.

5.2 Latent Structured Alignment
for Natural Language Inference

(a) softmax
(b) sequence



(c) matching
Figure 5: Latent alignments on an example from the SNLI validation set, correctly predicted as neutral by all compared models. The premise is on the -axis, the hypothesis on the -axis. Top: columns sum to 1; bottom: rows sum to 1. The matching alignment mechanism yields a symmetrical alignment, and is thus shown only once. Softmax yields a dense alignment (nonzero weights are marked with a border). The structures selected by sequential alignment are overlayed as paths; the selected matchings are displayed in the top right.
ESIM variant MultiNLI SNLI
softmax 76.05 (100%) 86.52 (100%)
sequential 75.54 (13%) 86.62 (19%)
matching 76.13 (8%) 86.05 (15%)
Table 2: Test accuracy scores for natural language inference with structured and unstructured variants of ESIM. In parentheses: the percentage of pairs of words with nonzero alignment scores.

In this section, we demonstrate for inferring latent structure in large-scale deep neural networks. We focus on the task of natural language inference, defined as the classification problem of deciding, given two sentences (a premise and a hypothesis), whether the premise entails the hypothesis, contradicts it, or is neutral with respect to it.

We consider novel structured variants of the state-of-the-art ESIM model (Chen et al., 2017). Given a premise of length and a hypothesis of length , ESIM:

  • Encodes and with an LSTM.

  • Computes alignment scores ; with the inner product between the word and word .

  • Computes -to- and -to- alignments using row-wise, respectively column-wise on .

  • Augments words with the weighted average of its aligned words, and vice-versa.

  • Passes the result through another LSTM, then predicts.

We consider the following structured replacements for the independent row-wise and column-wise es (step 3):

Sequential alignment. We model the alignment of to as a sequence tagging instance of length , with possible tags corresponding to the words of the hypothesis. Through transition scores, we enable the model to capture continuity and monotonicity of alignments: we parametrize transitioning from word to by binning the distance into 5 groups, . We similarly parametrize the initial alignment using bins and the final alignment as , allowing the model to express whether an alignment starts at the beginning or ends on the final word of ; formally

We align to applying the same method in the other direction, with different transition scores . Overall, sequential alignment requires learning 18 additional scalar parameters.

Matching alignment. We now seek a symmetrical alignment in both directions simultaneously. To this end, we cast the alignment problem as finding a maximal weight bipartite matching. We recall from §2.2 that a solution can be found via the Hungarian algorithm (in contrast to marginal inference, which is #P-complete). When , maximal matchings can be represented as permutation matrices, and when some words remain unaligned. returns a weighted average of a few maximal matchings. This method requires no additional learned parameters.

We evaluate the two models alongside the baseline on the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets.333We split the MultiNLI matched validation set into equal validation and test sets; for SNLI we use the provided split. All models are trained by SGD, with

learning rate decay at epochs when the validation accuracy is not the best seen. We tune the learning rate on the grid

, extending the range if the best model is at either end. The results in Table 2 show that structured alignments are competitive with in terms of accuracy, but are orders of magnitude sparser. This sparsity allows them to produce global alignment structures that are interpretable, as illustrated in Figure 5.

Interestingly, we observe computational advantages of sparsity. Despite the overhead of GPU memory copying, both training and validation in our latent structure models take roughly the same time as with and become faster as the models grow more certain. For the sake of comparison, Kim et al. (2017) report a slow-down in their structured attention networks, where they use marginal inference.

6 Related Work

Structured attention networks. Kim et al. (2017) and Liu & Lapata (2018) take advantage of the tractability of marginal inference in certain structured models and derive specialized backward passes for structured attention. In contrast, our approach is modular and general: with , the forward pass only requires MAP inference, and the backward pass is efficiently computed based on the forward pass results. Moreover, unlike marginal inference, yields sparse solutions, which is an appealing property statistically, computationally, and visually.

-best inference. As it returns a small set of structures, brings to mind -best inference, often used in pipeline NLP systems for increasing recall and handling uncertainty (Yang & Cardie, 2013). -best inference can be approximated (or, in some cases, solved), roughly times slower than MAP inference (Yanover & Weiss, 2004; Camerini et al., 1980; Chegireddy & Hamacher, 1987; Fromer & Globerson, 2009). The main advantages of are convexity, differentiablity, and modularity, as can be computed in terms of MAP subproblems. Moreover, it yields a distribution, unlike -best, which does not reveal the gap between selected structures,

Learning permutations. A popular approach for differentiable permutation learning involves mean-entropic optimal transport relaxations (Adams & Zemel, 2011; Mena et al., 2018). Unlike , this does not apply to general structures, and solutions are not directly expressible as combinations of a few permutations.

Regularized inference. Ravikumar et al. (2010), Meshi et al. (2015), and Martins et al. (2015) proposed perturbations and penalties in various related ways, with the goal of solving LP-MAP approximate inference in graphical models. In contrast, the goal of our work is sparse structured prediction, which is not considered in the aforementioned work. Nevertheless, some of the formulations in their work share properties with ; exploring the connections further is an interesting avenue for future work.

7 Conclusion

We introduced a new framework for sparse structured inference, , along with a corresponding loss function. We proposed efficient ways to compute the forward and backward passes of . Experimental results illustrate two use cases where sparse inference is well-suited. For structured prediction, the loss leads to strong models that make sparse, interpretable predictions, a good fit for tasks where local ambiguities are common, like many natural language processing tasks. For structured hidden layers, we demonstrated that leads to strong, interpretable networks trained end-to-end. Modular by design, can be applied readily to any structured problem for which MAP inference is available, including combinatorial problems such as linear assignment.


We thank Tim Vieira, David Belanger, Jack Hessel, Justine Zhang, Sydney Zink, the Unbabel AI Research team, and the three anonymous reviewers for their insightful comments. This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2013, PTDC/EEI-SII/7092/2014 (LearnBig), and CMUPERI/TIC/0046/2014  (GoLocal).


Appendix A Implementation Details for Solvers

a.1 Conditional Gradient Variants

We adapt the presentation of vanilla, away-step and pairwise conditional gradient of Lacoste-Julien & Jaggi (2015).

Recall the optimization problem (Equation 5), which we rewrite below as a minimization, to align with the formulation in (Lacoste-Julien & Jaggi, 2015)

The gradients of the objective function w.r.t. the two variables are

The ingredients required to apply conditional gradient algorithms are solving linear minimization problem, selecting the away step, computing the Wolfe gap, and performing line search.

Linear minimization problem.

For , this amounts to a MAP inference call, since

where we assume yields the set of maximally-scoring structures.

Away step selection.

This step involves searching the currently selected structures in the active set with the opposite goal: finding the structure maximizing the linearization

Wolfe gap.

The gap at a point is given by


Line search.

Once we have picked a direction , we can pick the optimal step size by solving a simple optimization problem. Let , and . We seek so as to optimize

Setting the gradient w.r.t.  to yields

We may therefore compute the optimal step size as

1:  Initialization: 
2:  for  do
3:     ;                           (forward direction)
4:       (away direction)
5:     if  then
6:        return  (Equation 7)
7:     end if
8:     if variant vanilla then
10:     else if variant pairwise then
12:     else if variant away-step then
13:        if  then
15:        else
17:        end if
18:     end if
19:     Compute step size  (Equation 8)
21:     Update and accordingly.
22:  end for
Algorithm 1 Conditional gradient for

a.2 The Active Set Algorithm

We use a variant of the active set algorithm (Nocedal & Wright, 1999, Ch. 16.4 & 16.5) as proposed for the quadratic subproblems of the AD algorithm; our presentation follows (Martins et al., 2015, Algorithm 3). At each step, the active set algorithm solves a relaxed variant of the QP, relaxing the non-negativity constraint on , and restricting the solution to the current active set

whose solution can be found by solving the KKT system


At each iteration, the (symmetric) design matrix in Equation 9 is updated by adding or removing a row and a column; therefore its inverse (or a decomposition) may be efficiently maintained and updated.

Line search.

The optimal step size for moving a feasible current estimate toward a solution of Equation 9, while keeping feasibility, is given by (Martins et al., 2015, Equation 31)


When this update zeros out a coordinate of ; otherwise, remains the same.

1:  Initialization: 
2:  for  do
3:     Solve the relaxed QP restricted to ; get  (Equation 9)
4:     if  then
6:        if  then
7:           return  (Equation 7)
8:        else
10:        end if
11:     else
12:        Compute step size  (Equation 10)
13:         (sparse update)
14:        Update if necessary
15:     end if
16:  end for
Algorithm 2 Active Set algorithm for

Appendix B Computing the Jacobian: Proof of Proposition 1

Recall that is defined as the that maximizes the value of the quadratic program (Equation 5),