1 Introduction
Structured prediction involves the manipulation of discrete, combinatorial structures, e.g., trees and alignments (Bakır et al., 2007; Smith, 2011; Nowozin et al., 2014). Such structures arise naturally as machine learning outputs, and as intermediate representations in deep pipelines. However, the set of possible structures is typically prohibitively large. As such, inference is a core challenge, often sidestepped by greedy search, factorization assumptions, or continuous relaxations (Belanger & McCallum, 2016).
In this paper, we propose an appealing alternative: a new inference strategy, dubbed , which encourages sparsity in the structured representations. Namely, we seek solutions explicitly expressed as a combination of a small, enumerable set of global structures.
Our framework departs from the two most common inference strategies in structured prediction: maximum a posteriori (MAP) inference, which returns the highestscoring structure, and marginal inference
, which yields a dense probability distribution over structures. Neither of these strategies is fully satisfactory: for latent structure models, marginal inference is appealing, since it can represent uncertainty and, unlike MAP inference, it is continuous and differentiable, hence amenable for use in structured hidden layers in neural networks
(Kim et al., 2017). It has, however, several limitations. First, there are useful problems for which MAP is tractable, but marginal inference is not, e.g., linear assignment (Valiant, 1979; Taskar, 2004). Even when marginal inference is available, casebycase derivation of the backward pass is needed, sometimes producing fairly complicated algorithms, e.g., secondorder expectation semirings (Li & Eisner, 2009). Finally, marginal inference is dense: it assigns nonzero probabilities to all structures and cannot completely rule out irrelevant ones. This can be statistically and computationally wasteful, as well as qualitatively harder to interpret.In this work, we make the following contributions:

We propose : a new framework for sparse structured inference (§3.1). The main idea is illustrated in Figure 1. is a twofold generalization: first, as a structured extension of the transformation (Martins & Astudillo, 2016); second, as a continuous yet sparse relaxation of MAP inference. MAP yields a single structure and marginal inference yields a dense distribution over all structures. In contrast, the solutions are sparse combinations of a small number of oftenoverlapping structures.

We show how to compute effectively, requiring only a MAP solver as a subroutine (§3.2), by exploiting the problem’s sparsity and quadratic curvature. Noticeably, the MAP oracle can be any arbitrary solver, e.g., the Hungarian algorithm for linear assignment, which permits tackling problems for which marginal inference is intractable.

We derive expressions for gradient backpropagation through inference, which, unlike MAP, is differentiable almost everywhere (§3.3). The backward pass is fully general (applicable to any type of structure), and it is efficient, thanks to the sparsity of the solutions and to reusing quantities computed in the forward pass.

We introduce a novel loss for structured prediction, placing it into a family of loss functions which generalizes the CRF and structured SVM losses (§4). Inheriting the desirable properties of inference, the loss and its gradients can be computed efficiently, provided access to MAP inference.
Our experiments demonstrate that is useful both for predicting structured outputs, as well as for learning latent structured representations. On dependency parsing (§5.1), structured output networks trained with the loss yield more accurate models with sparse, interpretable predictions, adapting to the ambiguity (or lack thereof) of test examples. On natural language inference (§5.2), we learn latent structured alignments, obtaining good predictive performance, as well as useful natural visualizations concentrated on a small number of structures.^{1}^{1}1 Generalpurpose dynet and pytorch implementations available at https://github.com/vene/sparsemap.
Notation.
Given vectors
, denotes their concatenation; given matrices , we denote their rowwise stacking as . We denote the columns of a matrix by ; by extension, a slice of columns of is denoted for a set of indices . We denote the canonical simplex by , and the indicator function of a predicate as otherwise .2 Preliminaries
2.1 Regularized Max Operators: Softmax, Sparsemax
As a basis for the more complex structured case, we first consider the simple problem of selecting the largest value in a vector . We denote the vector mapping
When there are no ties, has a unique solution peaking at the index of the highest value of . When there are ties, is setvalued. Even assuming no ties, is piecewise constant, and thus is illsuited for direct use within neural networks, e.g., in an attention mechanism. Instead, it is common to use , a continuous and differentiable approximation to , which can be seen as an entropyregularized
(1) 
where , i.e. the negative Shannon entropy. Since strictly, outputs are dense.
By replacing the entropic penalty with a squared norm, Martins & Astudillo (2016) introduced a sparse alternative to , called , given by
(2)  
Both and are continuous and differentiable almost everywhere; however, encourages sparsity in its outputs. This is because it corresponds to an Euclidean projection onto the simplex, which is likely to hit its boundary as the magnitude of increases. Both mechanisms, as well as variants with different penalties (Niculae & Blondel, 2017), have been successfully used in attention mechanisms, for mapping a score vector to a dimensional normalized discrete probability distribution over a small set of choices. The relationship between , , and , illustrated in Figure 1, sits at the foundation of .
2.2 Structured Inference
In structured prediction, the space of possible outputs is typically very large: for instance, all possible labelings of a length sequence, spanning trees over nodes, or onetoone alignments between two sets. We may still write optimization problems such as , but it is impractical to enumerate all of the possible structures and, in turn, to specify the scores for each structure in .
Instead, structured problems are often parametrized through structured logpotentials (scores) , where is a matrix that specifies the structure of the problem, and is lowerdimensional parameter vector, i.e., . For example, in a factor graph (Kschischang et al., 2001) with variables and factors , is given by
where and are unary and higherorder logpotentials, and and are local configurations at variable and factor nodes. This can be written in matrix notation as for suitable matrices , fitting the assumption above with and .
We can then rewrite the MAP inference problem, which seeks the highestscoring structure, as a dimensional problem, by introducing variables to denote configurations at variable and factor nodes:^{2}^{2}2We use the notation to convey that the maximization is over both and , but only is returned. Separating the variables as loses no generality and allows us to isolate the unary posteriors as the return value of interest.
(3)  
where is the marginal polytope (Wainwright & Jordan, 2008), with one vertex for each possible structure (Figure 1). However, as previously said, since it is equivalent to a dimensional , MAP is piecewise constant and discontinuous.
Negative entropy regularization over , on the other hand, yields marginal inference,
(4)  
Marginal inference is differentiable, but may be more difficult to compute; the entropy itself lacks a closed form (Wainwright & Jordan, 2008, §4.1.2). Gradient backpropagation is available only to specialized problem instances, e.g. those solvable by dynamic programming (Li & Eisner, 2009). The entropic term regularizes
toward more uniform distributions, resulting in strictly dense solutions, just like in the case of
(Equation 1).Interesting types of structures, which we use in the experiments described in Section 5, include the following.
Sequence tagging. Consider a sequence of items, each assigned one out of a possible tags. In this case, a global structure is a joint assignment of tags . The matrix is by–dimensional, with columns indicating which tag is assigned to each variable in the global structure . is by–dimensional, with encoding the transitions between consecutive tags, i.e., . The Viterbi algorithm provides MAP inference and forwardbackward provides marginal inference (Rabiner, 1989).
Nonprojective dependency parsing. Consider a sentence of length . Here, a structure is a dependency tree: a rooted spanning tree over the possible arcs (for example, the arcs above the sentences in Figure 3). Each column encodes a tree by assigning a to its arcs. is empty, is known as the arborescence polytope (Martins et al., 2009). MAP inference may be performed by maximal arborescence algorithms (Chu & Liu, 1965; Edmonds, 1967; McDonald et al., 2005), and the MatrixTree theorem (Kirchhoff, 1847) provides a way to perform marginal inference (Koo et al., 2007; Smith & Smith, 2007).
Linear assignment. Consider a onetoone matching (linear assignment) between two sets of nodes. A global structure is a permutation, and a column can be seen as a flattening of the corresponding permutation matrix. Again, is empty. is the Birkhoff polytope (Birkhoff, 1946), and MAP inference can be performed by, e.g., the Hungarian algorithm (Kuhn, 1955) or the JonkerVolgenant algorithm (Jonker & Volgenant, 1987). Noticeably, marginal inference is known to be #Pcomplete (Valiant, 1979; Taskar, 2004, Section 3.5). This makes it an open problem how to use matchings as latent variables.
3
Armed with the parallel between structured inference and regularized operators described in §2, we are now ready to introduce , a novel inference optimization problem which returns sparse solutions.
3.1 Definition
We introduce by regularizing the MAP inference problem in Equation 3 with a squared penalty on the returned posteriors, i.e., . Denoting, as above, , the result is a quadratic optimization problem,
(5)  
The quadratic penalty replaces the entropic penalty from marginal inference (Equation 4), which pushes the solutions to the strict interior of the marginal polytope. In consequence, favors sparse solutions from the faces of the marginal polytope , as illustrated in Figure 1. For the structured prediction problems mentioned in Section 2.2, would be able to return, for example, a sparse combination of sequence labelings, parse trees, or matchings. Moreover, the strongly convex regularization on ensures that has a unique solution and is differentiable almost everywhere, as we will see.
3.2 Solving
We now tackle the optimization problem in Equation 5. Although is a QP over a polytope, even describing it in standard form is infeasible, since enumerating the exponentiallylarge set of vertices is infeasible. This prevents direct application of, e.g., the generic differentiable QP solver of Amos & Kolter (2017). We instead focus on solvers that involve a sequence of MAP problems as a subroutine—this makes widely applicable, given the availability of MAP implementations for various structures. We discuss two such methods, one based on the conditional gradient algorithm and another based on the active set method for quadratic programming. We provide a full description of both methods in Appendix A.
Conditional gradient. One family of such solvers is based on the conditional gradient (CG) algorithm (Frank & Wolfe, 1956; LacosteJulien & Jaggi, 2015), considered in prior work for solving approximations of the marginal inference problem (Belanger et al., 2013; Krishnan et al., 2015). Each step must solve a linearized subproblem. Denote by the objective from Equation 5,
The gradients of with respect to the two variables are
A linear approximation to around a point is
Minimizing over is exactly MAP inference with adjusted variable scores . Intuitively, at each step we seek a highscoring structure while penalizing sharing variables with alreadyselected structures Vanilla CG simply adds the new structure to the active set at every iteration. The pairwise and awaystep variants trade off between the direction toward the new structure, and away from one of the alreadyselected structures. More sophisticated variants have been proposed (Garber & Meshi, 2016) which can provide sparse solutions when optimizing over a polytope.
Active set method. Importantly, the problem in Equation 5 has quadratic curvature, which the general CG algorithms may not optimally leverage. For this reason, we consider the active set method for constrained QPs: a generalization of Wolfe’s minnorm point algorithm (Wolfe, 1976), also used in structured prediction for the quadratic subproblems by Martins et al. (2015)
. The active set algorithm, at each iteration, updates an estimate of the solution support by adding or removing one constraint to/from the active set; then it solves the Karush–Kuhn–Tucker (KKT) system of a relaxed QP restricted to the current support.
Comparison. Both algorithms enjoy global linear convergence with similar rates (LacosteJulien & Jaggi, 2015), but the active set algorithm also exhibits exact finite convergence—this allows it, for instance, to capture the optimal sparsity pattern (Nocedal & Wright, 1999, Ch. 16.4 & 16.5). Vinyes & Obozinski (2017) provide a more indepth discussion of the connections between the two algorithms. We perform an empirical comparison on a dependency parsing instance with random potentials. Figure 2 shows that active set substantially outperforms all CG variants, both in terms of objective value as well as in the solution sparsity, suggesting that the quadratic curvature makes solvable in very few iterations to high accuracy. We therefore use the active set solver in the remainder of the paper.
3.3 Backpropagating Gradients through
In order to use as a neural network layer trained with backpropagation, one must compute products of the Jacobian with a vector . Computing the Jacobian of an optimization problem is an active research topic known as argmin differentiation, and is generally difficult. Fortunately, as we show next, argmin differentiation is always easy and efficient in the case of .
Proposition 1
Denote a solution by and its support by . Then, is differentiable almost everywhere with Jacobian
The proof, given in Appendix B, relies on the KKT conditions of the QP. Importantly, because is zero outside of the support of the solution, computing the Jacobian only requires the columns of and corresponding to the structures in the active set. Moreover, when using the active set algorithm discussed in §3.2, the matrix is readily available as a byproduct of the forward pass. The backward pass can, therefore, be computed in .
Our approach for gradient computation draws its efficiency from the solution sparsity and does not depend on the type of structure considered. This is contrasted with two related lines of research. The first is “unrolling” iterative inference algorithms, for instance belief propagation (Stoyanov et al., 2011; Domke, 2013) and gradient descent (Belanger et al., 2017), where the backward pass complexity scales with the number of iterations. In the second, employed by Kim et al. (2017), when inference can be performed via dynamic programming, backpropagation can be performed using secondorder expectation semirings (Li & Eisner, 2009) or more general smoothing (Mensch & Blondel, 2018), in the same time complexity as the forward pass. Moreover, in our approach, neither the forward nor the backward passes involve logarithms, exponentiations or logdomain classes, avoiding the slowdown and stability issues normally incurred.
In the unstructured case, since ,
is also an identity matrix, uncovering the
Jacobian (Martins & Astudillo, 2016). In general, structures are not necessarily orthogonal, but may have degrees of overlap.
4 Structured FenchelYoung Losses
and the Loss
With the efficient algorithms derived above in hand, we switch gears to defining a loss function. Structured output prediction models are typically trained by minimizing a structured loss measuring the discrepancy between the desired structure (encoded, for instance, as an indicator vector ) and the prediction induced by the logpotentials . We provide here a general family of structured prediction losses that will make the newly proposed loss arise as a very natural case. Below, we let denote a convex penalty function and denote by its restriction to , i.e.,
The Fenchel convex conjugate of is
We next introduce a family of structured prediction losses, named after the corresponding FenchelYoung duality gap.
Definition 1 (FenchelYoung losses)
Given a convex penalty function , and a dimensional matrix encoding the structure of the problem, we define the following family of structured losses:
(6) 
This family, studied in more detail in (Blondel et al., 2018), includes the commonlyused structured losses:
This leads to a natural way of defining losses, by plugging the following into Equation 6:

loss: ,

Margin : .
It is wellknown that the subgradients of structured perceptron and SVM losses consist of MAP inference, while the CRF loss gradient requires marginal inference. Similarly, the subgradients of the loss can be computed via inference, which in turn only requires MAP. The next proposition states properties of structured FenchelYoung losses, including a general connection between a loss and its corresponding inference method.
Proposition 2
Consider a convex and a structured model defined by the matrix . Denote the inference objective , and a solution . Then, the following properties hold:

, with equality when ;

is convex, ;

for any .
Proof is given in Appendix C. Property 1 suggests that pminimizing aligns models with the true label. Property 2 shows how to compute subgradients of provided access to the inference output . Combined with our efficient procedure described in Section 3.2, it makes the losses promising for structured prediction. Property 3 suggests that the strength of the penalty can be adjusted by simply scaling . Finally, we remark that for a stronglyconvex , can be seen as a smoothed perceptron loss; other smoothed losses have been explored by ShalevShwartz & Zhang (2016).
5 Experimental Results
In this section, we experimentally validate
on two natural language processing applications, illustrating the two main use cases presented: structured output prediction with the
loss (§5.1) and structured hidden layers (§5.2). All models are implemented using the dynet library v2.0.2 (Neubig et al., 2017).5.1 Dependency Parsing with the Loss
Loss  en  zh  vi  ro  ja 

Structured SVM  87.02  81.94  69.42  87.58  96.24 
CRF  86.74  83.18  69.10  87.13  96.09 
86.90  84.03  69.71  87.35  96.04  
m  87.34  82.63  70.87  87.63  96.03 
UDPipe baseline  87.68  82.14  69.63  87.36  95.94 
We evaluate the losses against the commonly used CRF and structured SVM losses. The task we focus on is nonprojective dependency parsing: a structured output task consisting of predicting the directed tree of grammatical dependencies between words in a sentence (Jurafsky & Martin, 2018, Ch. 14). We use annotated Universal Dependency data (Nivre et al., 2016), as used in the CoNLL 2017 shared task (Zeman et al., 2017). To isolate the effect of the loss, we use the provided gold tokenization and partofspeech tags. We follow closely the bidirectional LSTM arcfactored parser of Kiperwasser & Goldberg (2016), using the same model configuration; the only exception is not using externally pretrained embeddings. Parameters are trained using Adam (Kingma & Ba, 2015), tuning the learning rate on the grid , expanded by a factor of 2 if the best model is at either end.
We experiment with 5 languages, diverse both in terms of family and in terms of the amount of training data (ranging from 1,400 sentences for Vietnamese to 12,525 for English). Test set results (Table 1) indicate that the losses outperform the SVM and CRF losses on 4 out of the 5 languages considered. This suggests that is a good middle ground between MAPbased and marginalbased losses in terms of smoothness and gradient sparsity.
Moreover, as illustrated in Figure 4, the loss encourages sparse predictions: models converge towards sparser solutions as they train, yielding very few ambiguous arcs. When confident, can predict a single tree. Otherwise, the small set of candidate parses returned can be easily visualized, often indicating genuine linguistic ambiguities (Figure 3). Returning a small set of parses, also sought concomittantly by Keith et al. (2018), is valuable in pipeline systems, e.g., when the parse is an input to a downstream application: error propagation is diminished in cases where the highestscoring tree is incorrect (which is the case for the sentences in Figure 3). Unlike
best heuristics,
dynamically adjusts its output sparsity, which is desirable on realistic data where most instances are easy.
5.2 Latent Structured Alignment
for Natural Language Inference
ESIM variant  MultiNLI  SNLI  

softmax  76.05  (100%)  86.52  (100%) 
sequential  75.54  (13%)  86.62  (19%) 
matching  76.13  (8%)  86.05  (15%) 
In this section, we demonstrate for inferring latent structure in largescale deep neural networks. We focus on the task of natural language inference, defined as the classification problem of deciding, given two sentences (a premise and a hypothesis), whether the premise entails the hypothesis, contradicts it, or is neutral with respect to it.
We consider novel structured variants of the stateoftheart ESIM model (Chen et al., 2017). Given a premise of length and a hypothesis of length , ESIM:

Encodes and with an LSTM.

Computes alignment scores ; with the inner product between the word and word .

Computes to and to alignments using rowwise, respectively columnwise on .

Augments words with the weighted average of its aligned words, and viceversa.

Passes the result through another LSTM, then predicts.
We consider the following structured replacements for the independent rowwise and columnwise es (step 3):
Sequential alignment. We model the alignment of to as a sequence tagging instance of length , with possible tags corresponding to the words of the hypothesis. Through transition scores, we enable the model to capture continuity and monotonicity of alignments: we parametrize transitioning from word to by binning the distance into 5 groups, . We similarly parametrize the initial alignment using bins and the final alignment as , allowing the model to express whether an alignment starts at the beginning or ends on the final word of ; formally
We align to applying the same method in the other direction, with different transition scores . Overall, sequential alignment requires learning 18 additional scalar parameters.
Matching alignment. We now seek a symmetrical alignment in both directions simultaneously. To this end, we cast the alignment problem as finding a maximal weight bipartite matching. We recall from §2.2 that a solution can be found via the Hungarian algorithm (in contrast to marginal inference, which is #Pcomplete). When , maximal matchings can be represented as permutation matrices, and when some words remain unaligned. returns a weighted average of a few maximal matchings. This method requires no additional learned parameters.
We evaluate the two models alongside the baseline on the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets.^{3}^{3}3We split the MultiNLI matched validation set into equal validation and test sets; for SNLI we use the provided split. All models are trained by SGD, with
learning rate decay at epochs when the validation accuracy is not the best seen. We tune the learning rate on the grid
, extending the range if the best model is at either end. The results in Table 2 show that structured alignments are competitive with in terms of accuracy, but are orders of magnitude sparser. This sparsity allows them to produce global alignment structures that are interpretable, as illustrated in Figure 5.Interestingly, we observe computational advantages of sparsity. Despite the overhead of GPU memory copying, both training and validation in our latent structure models take roughly the same time as with and become faster as the models grow more certain. For the sake of comparison, Kim et al. (2017) report a slowdown in their structured attention networks, where they use marginal inference.
6 Related Work
Structured attention networks. Kim et al. (2017) and Liu & Lapata (2018) take advantage of the tractability of marginal inference in certain structured models and derive specialized backward passes for structured attention. In contrast, our approach is modular and general: with , the forward pass only requires MAP inference, and the backward pass is efficiently computed based on the forward pass results. Moreover, unlike marginal inference, yields sparse solutions, which is an appealing property statistically, computationally, and visually.
best inference. As it returns a small set of structures, brings to mind best inference, often used in pipeline NLP systems for increasing recall and handling uncertainty (Yang & Cardie, 2013). best inference can be approximated (or, in some cases, solved), roughly times slower than MAP inference (Yanover & Weiss, 2004; Camerini et al., 1980; Chegireddy & Hamacher, 1987; Fromer & Globerson, 2009). The main advantages of are convexity, differentiablity, and modularity, as can be computed in terms of MAP subproblems. Moreover, it yields a distribution, unlike best, which does not reveal the gap between selected structures,
Learning permutations. A popular approach for differentiable permutation learning involves meanentropic optimal transport relaxations (Adams & Zemel, 2011; Mena et al., 2018). Unlike , this does not apply to general structures, and solutions are not directly expressible as combinations of a few permutations.
Regularized inference. Ravikumar et al. (2010), Meshi et al. (2015), and Martins et al. (2015) proposed perturbations and penalties in various related ways, with the goal of solving LPMAP approximate inference in graphical models. In contrast, the goal of our work is sparse structured prediction, which is not considered in the aforementioned work. Nevertheless, some of the formulations in their work share properties with ; exploring the connections further is an interesting avenue for future work.
7 Conclusion
We introduced a new framework for sparse structured inference, , along with a corresponding loss function. We proposed efficient ways to compute the forward and backward passes of . Experimental results illustrate two use cases where sparse inference is wellsuited. For structured prediction, the loss leads to strong models that make sparse, interpretable predictions, a good fit for tasks where local ambiguities are common, like many natural language processing tasks. For structured hidden layers, we demonstrated that leads to strong, interpretable networks trained endtoend. Modular by design, can be applied readily to any structured problem for which MAP inference is available, including combinatorial problems such as linear assignment.
Acknowledgements
We thank Tim Vieira, David Belanger, Jack Hessel, Justine Zhang, Sydney Zink, the Unbabel AI Research team, and the three anonymous reviewers for their insightful comments. This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2013, PTDC/EEISII/7092/2014 (LearnBig), and CMUPERI/TIC/0046/2014 (GoLocal).
References
 Adams & Zemel (2011) Adams, R. P. and Zemel, R. S. Ranking via sinkhorn propagation. arXiv eprints, 2011.
 Amos & Kolter (2017) Amos, B. and Kolter, J. Z. OptNet: Differentiable optimization as a layer in neural networks. In ICML, 2017.
 Bakır et al. (2007) Bakır, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., and Vishwanathan, S. V. N. Predicting Structured Data. The MIT Press, 2007.
 Belanger & McCallum (2016) Belanger, D. and McCallum, A. Structured prediction energy networks. In ICML, 2016.
 Belanger et al. (2013) Belanger, D., Sheldon, D., and McCallum, A. Marginal inference in MRFs using FrankWolfe. In NIPS Workshop on Greedy Opt., FW and Friends, 2013.
 Belanger et al. (2017) Belanger, D., Yang, B., and McCallum, A. Endtoend learning for structured prediction energy networks. In ICML, 2017.
 Birkhoff (1946) Birkhoff, G. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucumán Rev. Ser. A, 5:147–151, 1946.
 Blondel et al. (2018) Blondel, M., Martins, A. F., and Niculae, V. Learning classifiers with FenchelYoung losses: Generalized entropies, margins, and algorithms. arXiv eprints, 2018.
 Bowman et al. (2015) Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference. In EMNLP, 2015.
 Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
 Camerini et al. (1980) Camerini, P. M., Fratta, L., and Maffioli, F. The best spanning arborescences of a network. Networks, 10(2):91–109, 1980.
 Chegireddy & Hamacher (1987) Chegireddy, C. R. and Hamacher, H. W. Algorithms for finding best perfect matchings. Discrete Applied Mathematics, 18(2):155 – 165, 1987.
 Chen et al. (2017) Chen, Q., Zhu, X., Ling, Z.H., Wei, S., Jiang, H., and Inkpen, D. Enhanced LSTM for natural language inference. In ACL, 2017.
 Chu & Liu (1965) Chu, Y.J. and Liu, T.H. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400, 1965.
 Clarke (1990) Clarke, F. H. Optimization and Nonsmooth Analysis. SIAM, 1990.
 Collins (2002) Collins, M. Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. In EMNLP, 2002.
 Domke (2013) Domke, J. Learning graphical model parameters with approximate marginal inference. IEEE T. Pattern. Anal., 35(10):2454–2467, 2013.
 Edmonds (1967) Edmonds, J. Optimum branchings. J. Res. Nat. Bur. Stand., 71B:233–240, 1967.
 Fenchel (1949) Fenchel, W. On conjugate convex functions. Canad. J. Math, 1(7377), 1949.
 Frank & Wolfe (1956) Frank, M. and Wolfe, P. An algorithm for quadratic programming. Nav. Res. Log., 3(12):95–110, 1956.
 Fromer & Globerson (2009) Fromer, M. and Globerson, A. An LP view of the best MAP problem. In NIPS, 2009.
 Garber & Meshi (2016) Garber, D. and Meshi, O. Linearmemory and decompositioninvariant linearly convergent conditional gradient algorithm for structured polytopes. In NIPS, 2016.
 Gimpel & Smith (2010) Gimpel, K. and Smith, N. A. Softmaxmargin CRFs: Training loglinear models with cost functions. In NAACL, 2010.
 Jonker & Volgenant (1987) Jonker, R. and Volgenant, A. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340, 1987.
 Jurafsky & Martin (2018) Jurafsky, D. and Martin, J. H. Speech and Language Processing (3rd ed.). draft, 2018.
 Keith et al. (2018) Keith, K., Blodgett, S. L., and O’Connor, B. Monte Carlo syntax marginals for exploring and using dependency parses. In NAACL, 2018.
 Kim et al. (2017) Kim, Y., Denton, C., Hoang, L., and Rush, A. M. Structured attention networks. In ICLR, 2017.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kiperwasser & Goldberg (2016) Kiperwasser, E. and Goldberg, Y. Simple and accurate dependency parsing using bidirectional LSTM feature representations. TACL, 4:313–327, 2016.
 Kirchhoff (1847) Kirchhoff, G. Ueber die auflösung der gleichungen, auf welche man bei der untersuchung der linearen vertheilung galvanischer ströme geführt wird. Annalen der Physik, 148(12):497–508, 1847.
 Koo et al. (2007) Koo, T., Globerson, A., Carreras Pérez, X., and Collins, M. Structured prediction models via the matrixtree theorem. In EMNLP, 2007.
 Krishnan et al. (2015) Krishnan, R. G., LacosteJulien, S., and Sontag, D. Barrier FrankWolfe for marginal inference. In NIPS, 2015.
 Kschischang et al. (2001) Kschischang, F. R., Frey, B. J., and Loeliger, H.A. Factor graphs and the sumproduct algorithm. IEEE T. Inform. Theory, 47(2):498–519, 2001.
 Kuhn (1955) Kuhn, H. W. The Hungarian method for the assignment problem. Nav. Res. Log., 2(12):83–97, 1955.
 LacosteJulien & Jaggi (2015) LacosteJulien, S. and Jaggi, M. On the global linear convergence of FrankWolfe optimization variants. In NIPS, 2015.
 Lafferty et al. (2001) Lafferty, J. D., McCallum, A., and Pereira, F. C. N. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
 Li & Eisner (2009) Li, Z. and Eisner, J. Firstand secondorder expectation semirings with applications to minimumrisk training on translation forests. In EMNLP, 2009.
 Liu & Lapata (2018) Liu, Y. and Lapata, M. Learning structured text representations. TACL, 6:63–75, 2018.
 Martins & Astudillo (2016) Martins, A. F. and Astudillo, R. F. From softmax to sparsemax: A sparse model of attention and multilabel classification. In ICML, 2016.
 Martins et al. (2009) Martins, A. F., Smith, N. A., and Xing, E. P. Concise integer linear programming formulations for dependency parsing. In ACLIJCNLP, 2009.
 Martins et al. (2015) Martins, A. F., Figueiredo, M. A., Aguiar, P. M., Smith, N. A., and Xing, E. P. AD3: Alternating directions dual decomposition for MAP inference in graphical models. JMLR, 16(1):495–545, 2015.
 McDonald et al. (2005) McDonald, R., Crammer, K., and Pereira, F. Online largemargin training of dependency parsers. In ACL, 2005.
 Mena et al. (2018) Mena, G., Belanger, D., Linderman, S., and Snoek, J. Learning latent permutations with GumbelSinkhorn networks. In ICLR, 2018.
 Mensch & Blondel (2018) Mensch, A. and Blondel, M. Differentiable dynamic programming for structured prediction and attention. In ICML, 2018.
 Meshi et al. (2015) Meshi, O., Mahdavi, M., and Schwing, A. Smooth and strong: MAP inference with linear convergence. In NIPS, 2015.
 Neubig et al. (2017) Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., et al. DyNet: The dynamic neural network toolkit. preprint arXiv:1701.03980, 2017.
 Niculae & Blondel (2017) Niculae, V. and Blondel, M. A regularized framework for sparse and structured neural attention. In NIPS, 2017.
 Nivre et al. (2016) Nivre, J., de Marneffe, M.C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N., et al. Universal Dependencies v1: A multilingual treebank collection. In LREC, 2016.
 Nocedal & Wright (1999) Nocedal, J. and Wright, S. Numerical Optimization. Springer New York, 1999.
 Nowozin et al. (2014) Nowozin, S., Gehler, P. V., Jancsary, J., and Lampert, C. H. Advanced Structured Prediction. MIT Press, 2014.
 Rabiner (1989) Rabiner, L. R. A tutorial on Hidden Markov Models and selected applications in speech recognition. P. IEEE, 77(2):257–286, 1989.
 Ravikumar et al. (2010) Ravikumar, P., Agarwal, A., and Wainwright, M. J. Messagepassing for graphstructured linear programs: Proximal methods and rounding schemes. JMLR, 11:1043–1080, 2010.
 ShalevShwartz & Zhang (2016) ShalevShwartz, S. and Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program., 155(1):105–145, 2016.
 Smith & Smith (2007) Smith, D. A. and Smith, N. A. Probabilistic models of nonprojective dependency trees. In EMNLP, 2007.
 Smith (2011) Smith, N. A. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, May 2011.
 Stoyanov et al. (2011) Stoyanov, V., Ropson, A., and Eisner, J. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In AISTATS, 2011.
 Straka & Straková (2017) Straka, M. and Straková, J. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In CoNLL Shared Task, 2017.
 Taskar (2004) Taskar, B. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis, Stanford University, 2004.
 Taskar et al. (2003) Taskar, B., Guestrin, C., and Koller, D. MaxMargin Markov Networks. In NIPS, 2003.
 Tsochantaridis et al. (2004) Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004.
 Valiant (1979) Valiant, L. G. The complexity of computing the permanent. Theor. Comput. Sci., 8(2):189–201, 1979.
 Vinyes & Obozinski (2017) Vinyes, M. and Obozinski, G. Fast column generation for atomic norm regularization. In AISTATS, 2017.
 Wainwright & Jordan (2008) Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1–2):1–305, 2008.
 Williams et al. (2018) Williams, A., Nangia, N., and Bowman, S. R. A broadcoverage challenge corpus for sentence understanding through inference. In NAACL, 2018.
 Wolfe (1976) Wolfe, P. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128–149, 1976.
 Yang & Cardie (2013) Yang, B. and Cardie, C. Joint inference for finegrained opinion extraction. In ACL, 2013.
 Yanover & Weiss (2004) Yanover, C. and Weiss, Y. Finding the most probable configurations using loopy belief propagation. In NIPS, 2004.
 Zeman et al. (2017) Zeman, D., Popel, M., Straka, M., Hajic, J., Nivre, J., Ginter, F., Luotolahti, J., Pyysalo, S., Petrov, S., Potthast, M., et al. CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. CoNLL, 2017.
Appendix A Implementation Details for Solvers
a.1 Conditional Gradient Variants
We adapt the presentation of vanilla, awaystep and pairwise conditional gradient of LacosteJulien & Jaggi (2015).
Recall the optimization problem (Equation 5), which we rewrite below as a minimization, to align with the formulation in (LacosteJulien & Jaggi, 2015)
The gradients of the objective function w.r.t. the two variables are
The ingredients required to apply conditional gradient algorithms are solving linear minimization problem, selecting the away step, computing the Wolfe gap, and performing line search.
Linear minimization problem.
For , this amounts to a MAP inference call, since
where we assume yields the set of maximallyscoring structures.
Away step selection.
This step involves searching the currently selected structures in the active set with the opposite goal: finding the structure maximizing the linearization
Wolfe gap.
The gap at a point is given by
(7)  
Line search.
Once we have picked a direction , we can pick the optimal step size by solving a simple optimization problem. Let , and . We seek so as to optimize
Setting the gradient w.r.t. to yields
We may therefore compute the optimal step size as
(8) 
a.2 The Active Set Algorithm
We use a variant of the active set algorithm (Nocedal & Wright, 1999, Ch. 16.4 & 16.5) as proposed for the quadratic subproblems of the AD algorithm; our presentation follows (Martins et al., 2015, Algorithm 3). At each step, the active set algorithm solves a relaxed variant of the QP, relaxing the nonnegativity constraint on , and restricting the solution to the current active set
whose solution can be found by solving the KKT system
(9) 
At each iteration, the (symmetric) design matrix in Equation 9 is updated by adding or removing a row and a column; therefore its inverse (or a decomposition) may be efficiently maintained and updated.
Line search.
Appendix B Computing the Jacobian: Proof of Proposition 1
Recall that is defined as the that maximizes the value of the quadratic program (Equation 5),