1 Introduction
The toolbox of popular methods in computer science currently sees a split into two major components. On the one hand, there are classical algorithmic techniques from discrete optimization – graph algorithms, SATsolvers, integer programming solvers – often with heavily optimized implementations and theoretical guarantees on runtime and performance. On the other hand, there is the realm of deep learning allowing datadriven feature extraction as well as the flexible design of endtoend architectures. The fusion of deep learning with combinatorial optimization is desirable both for foundational reasons – extending the reach of deep learning to data with large combinatorial complexity – and in practical applications. These often occur for example in computer vision problems that require solving a combinatorial subtask on top of features extracted from raw input such as establishing global consistency in multiobject tracking from a sequence of frames.
The fundamental problem with constructing hybrid architectures is differentiability of the combinatorial components. Stateoftheart approaches pursue the following paradigm: introduce suitable approximations or modifications of the objective function or of a baseline algorithm that eventually yield a differentiable computation. The resulting algorithms are often suboptimal in terms of runtime, performance and optimality guarantees when compared to their unmodified counterparts. While the sources of suboptimality vary from example to example, there is a common theme: any differentiable algorithm in particular outputs continuous values and as such it solves a relaxation of the original problem. It is wellknown in combinatorial optimization theory that even strong and practical convex relaxations induce lower bounds on the approximation ratio for large classes of problems [38, 48] which makes them inherently suboptimal. This inability to incorporate the best implementations of the best algorithms is unsatisfactory.
In this paper, we propose a method that, at the cost of one hyperparameter, implements a backward pass for a
blackbox implementation of a combinatorial algorithm or a solver that optimizes a linear objective function. This effectively turns the algorithm or solver into a composable building block of neural network architectures, as illustrated in Fig. 1. Suitable problems with linear objective include classical problems such as shortestpath, travelingsalesman (TSP), mincostperfectmatching, various cut problems as well as entire frameworks such as integer programs (IP), Markov random fields (MRF) and conditional random fields (CRF).The main technical challenge boils down to providing an informative gradient of a piecewise constant function. To that end, we are able to heavily leverage the minimization structure of the underlying combinatorial problem and efficiently compute a gradient of a continuous interpolation. While the roots of the method lie in lossaugmented inference, the employed mathematical technique for continuous interpolation is novel. The computational cost of the introduced
backward pass matches the cost of the forward pass. In particular, it also amounts to one call to the solver.In experiments, we train architectures that contain unmodified implementations of the following efficient combinatorial algorithms: generalpurpose mixedinteger programming solver Gurobi [19], stateoftheart C implementation of mincostperfectmatching algorithm – Blossom V [25] and Dijkstra’s algorithm [11] for shortestpath. We demonstrate that the resulting architectures train without sophisticated tweaks and are able to solve tasks that are beyond the capabilities of conventional neural networks.
2 Related Work
Multiple lines of work lie at the intersection of combinatorial algorithms and deep learning. We primarily distinguish them by their motivation.
Motivated by applied problems.
Even though computer vision has seen a substantial shift from combinatorial methods to deep learning, some problems still have a strong combinatorial aspect and require hybrid approaches. Examples include multiobject tracking [42], semantic segmentation [8]
, multiperson pose estimation
[36, 45], stereo matching [24] and person reidentification [53]. The combinatorial algorithms in question are typically Markov random fields (MRF) [9], conditional random fields (CRF) [32], graph matching [53] or integer programming [42]. In recent years, a plethora of hybrid endtoend architectures have been proposed. The techniques used for constructing the backward pass range from employing various relaxations and approximations of the combinatorial problem [9, 54] over differentiating a fixed number of iterations of an iterative solver [35, 49, 29] all the way to relying on the structured SVM framework [50, 9].Motivated by “bridging the gap”.
Building links between combinatorics and deep learning can also be viewed as a foundational problem; for example, [6]
advocate that “combinatorial generalization must be a top priority for AI”. One such line of work focuses on designing architectures with algorithmic structural prior – for example by mimicking the layout of a Turing machine
[47, 51, 17, 18] or by promoting behaviour that resembles messagepassing algorithms as it is the case in Graph Neural Networks and related architectures [40, 27, 6]. Another approach is to provide neural network building blocks that are specialized to solve some types of combinatorial problems such as satisfiability (SAT) instances [52]. Some works have directly addressed the question of learning combinatorial optimization algorithms such as the travelingsalesmanproblem in [7].There are also efforts to bridge the gap in the opposite direction; to use deep learning methods to improve stateoftheart combinatorial solvers, typically by learning (otherwise handcrafted) heuristics. Some works have again targeted the
travelingsalesmanproblem [26, 10, 7] as well as other NPHard problems [28]. Also, more general solvers received some attention; this includes SATsolvers [43, 44], integer programming solvers [15] and SMTsolvers (satisfiability modulo theories)[5].3 Method
Let us first formalize the notion of a combinatorial solver. We expect the solver to receive continuous input (e.g. edge weights of a fixed graph) and return discrete output from some finite set (e.g. all traveling salesman tours on a fixed graph) that minimizes some cost (e.g. length of the tour). More precisely, the solver maps
(1) 
We will restrict ourselves to objective functions that are linear , namely may be represented as
(2) 
in which is an injective representation of in . For brevity, we omit the mapping and instead treat elements of as discrete points in .
Note that such definition of a solver is still very general as there are no assumptions on the set of constraints or on the structure of the output space .
Example 1 (Encoding shortestpath problem).
If is a given graph with vertices , the combinatorial solver for the shortestpath would take edge weights as input and produce the shortest path represented as
an indicator vector of the selected edges. The cost function is then indeed the inner product
.The task to solve during backpropagation is the following. We receive the gradient of the global loss with respect to solver output at a given point . We are expected to return , the gradient of the loss with respect to solver input at a point .
Since is finite, there are only finitely many values of . In other words, this function of is piecewise constant and the gradient is identically zero or does not exist (at points of jumps). This should not come as a surprise; if one does a small perturbation to edge weights of a graph, one usually does not change the optimal TSP tour and on rare occasions alters it drastically. This has an important consequence:
The fundamental problem with differentiating through combinatorial solvers is not the lack of differentiability; the gradient exists almost everywhere. However, this gradient is a constant zero and as such is unhelpful for optimization.
Accordingly, we will not rely on standard techniques for gradient estimation (see [33] for a comprehensive survey).
First, we simplify the situation by considering the linearization of at the point . Then for
and therefore it suffices to focus on differentiating the piecewise constant function
If the piecewise constant function at hand was arbitrary, we would be forced to use zeroorder gradient estimation techniques such as computing finite differences. These require prohibitively many function evaluations particularly for highdimensional problems.
However, the function is a result of a minimization process and it is known that for smooth spaces there are techniques for such “differentiation through argmin” [41, 39, 14, 12, 4, 3]. It turns out to be possible to build – with different mathematical tools – a viable discrete analogy. In particular, we can efficiently construct a function , a continuous interpolation of , whose gradient we return (see Fig. 2). The hyperparameter controls the tradeoff between “informativeness of the gradient” and “faithfulness to the original function”.
Before diving into the formalization, we present the final algorithm as listed in Algo. 1. It is simple to implement and the backward pass indeed only runs the solver once on modified input. Providing the justification, however, is not straightforward, and it is the subject of the rest of the section.
3.1 Construction and Properties of
Before we give the exact definition of the function , we formulate several requirements on it. This will help us understand why is a reasonable replacement for and, most importantly, why its gradient captures changes in the values of .
Property A1.
For each , is continuous and piecewise affine.
The second property describes the tradeoff induced by changing the value of . For , we define sets and as the sets where and coincide and where they differ, i.e.
Property A2.
The sets are monotone in and they vanish as , i.e.
In other words, Property A2 tells us that controls the size of the set where deviates from and where has meaningful gradient.
In the third and final property, we want to capture the interpolation behavior of . For that purpose, we define a interpolator of . We say that , defined on a set , is a interpolator of between and , if

is nonconstant affine function;

the image is an interval with endpoints and ;

attains the boundary values and at most far away from where does. In particular, there is a point for which and , where , for .
In the special case of a 0interpolator , the graph of connects (in a topological sense) two components of the graph of . In the general case, measures displacement of the interpolator (see also Fig. 2 for some examples). This displacement on the one hand loosens the connection to but on the other hand allows for less local interpolation which might be desirable.
Property A3.
The function consists of finitely many (possibly incomplete) interpolators of on where for some fixed . Equivalently, the displacement is linearly controlled by .
For defining the function , we need a solution of a perturbed optimization problem
(3) 
Theorem 1.
Let . The function defined by
(4) 
satisfies Properties A1, A2, A3.
Let us remark that already the continuity of is not apparent from its definition as the first term is still a piecewise constant function. Proof of this result, along with geometrical description of , can be found in section A.2. Fig. 3 visualizes for different values if .
Now, since is ensured to be differentiable, we have
(5) 
The second equality then holds due to (2). We then return as a loss gradient.
Remark 1.
The roots of the method we propose lie in lossaugmented inference. In fact, the update rule from (5) (but not the function or any of its properties) was already proposed in a different context in [21, 46] and was later used in [30, 34]. The main difference to our work is that only the case of is recommended and studied, which in our situation computes the correct but uninformative zero gradient. Our analysis implies that larger values of are not only sound but even preferable. This will be seen in experiments where we use values .
3.2 Efficient Computation of
Computing in (3) is the only potentially expensive part of evaluating (5). However, the linear interplay of the cost function and the gradient trivially gives a resolution.
Proposition 1.
Let be fixed. If we set , we can compute as
In other words, is the output of calling the solver on input .
4 Experiments
In this section, we experimentally validate a proof of concept: that architectures containing exact blackbox solvers (with backward pass provided by Algo. 1) can be trained by standard methods.
To that end, we solve three synthetic tasks as listed in Tab. 1. These tasks are designed to mimic practical examples from Section 2 and solving them anticipates a twostage process: 1) extract suitable features from raw input, 2) solve a combinatorial problem over the features. The dimensionalities of input and of intermediate representations also aim to mirror practical problems and are chosen to be prohibitively large for zeroorder gradient estimation methods. Guidelines of setting the hyperparameter are given in section A.1. The source code and datasets will be made public.
We include the performance of ResNet18 [22] as a sanity check to demonstrate that the constructed datasets are too complex for standard architectures.
Remark 2.
The included solvers have very efficient implementations and do not severely impact runtime. All models train in under two hours on a single machine with 1 GPU and no more than 24 utilized CPU cores. Only for the large TSP problems the solver’s runtime dominates.
Graph Problem  Solver  Solver instance size  Input format 
Shortest path  Dijkstra  up to vertices  (image) up to 
Min Cost PM  Blossom V  up to edges  (image) up to 
Traveling Salesman  Gurobi  up to edges  up to images () 
4.1 Warcraft Shortest Path
Problem input and output.
The training dataset for problem SP consists of 10000 examples of randomly generated images of terrain maps from the Warcraft II tileset [20]. The maps have an underlying grid of dimension where each vertex represents a terrain with a fixed cost that is unknown to the network. The shortest (minimum cost) path between top left and bottom right vertices is encoded as an indicator matrix and serves as a label (see also Fig. 4). We consider datasets SP for . More experimental details are provided in section A.3.
Architecture.
An image of the terrain map is presented to a convolutional neural network which outputs a
grid of vertex costs. These costs are then the input to the Dijkstra algorithm to compute the predicted shortest path for the respective map. The loss used for computing the gradient update is the Hamming distance between the true shortest path and the predicted shortest path.Embedding Dijkstra  ResNet18  
Train %  Test %  Train %  Test %  
12  
18  
24  
30 
Reported is the accuracy, i.e. percentage of paths with the optimal costs. Standard deviations are over five restarts.
Results.
Our method learns to predict the shortest paths with high accuracy and generalization capability, whereas the ResNet18 baseline unsurprisingly fails to generalize already for small grid sizes of . Since the shortest paths in the maps are often nonunique (i.e. there are multiple shortest paths with the same cost), we report the percentage of shortest path predictions that have optimal cost. The results are summarized in Tab. 2.
4.2 Globe Traveling Salesman Problem
Problem input and output.
The training dataset for problem TSP consists of 10000 examples where the input for each example is a element subset of fixed 100 country flags and the label is the shortest traveling salesman tour through the capitals of the corresponding countries. The optimal tour is represented by its adjacency matrix (see also Fig. 5). We consider datasets TSP for .
Architecture.
Each of the flags is presented to a convolutional network that produces threedimensional vectors. These vectors are projected onto the unit sphere in ; a representation of the globe. The TSP solver receives a matrix of pairwise distances of the computed locations. The loss of the network is the Hamming distance between the true and the predicted TSP adjacency matrix. The architecture is expected to learn the correct representations of the flags (i.e. locations of the respective countries’ capitals on Earth, up to rotations of the sphere). The employed Gurobi solver optimizes a mixedinteger programming formulation of TSP using the cutting plane method [31] for lazy subtour elimination.
Embedding TSP Solver  ResNet18  
Train %  Test %  Train %  Test %  
5  
10  
20  
40 
Results.
This architecture not only learns to extract the correct TSP tours but also learns the correct representations. Quantitative evidence is presented in Tab. 3, where we see that the learned locations generalize well and lead to correct TSP tours also on the test set and also on somewhat large instances (note that there are admissible TSP tours for ). The baseline architecture only memorizes the training set. Additionally, we can extract the suggested locations of world capitals and compare them with reality. To that end, we present Fig. 4(b), where the learned locations of 10 capitals in Southeast Asia are displayed.
4.3 MNIST Mincost Perfect Matching
Problem input and output.
The training dataset for problem PM consists of 10000 examples where the input to each example is a set of digits drawn from the MNIST dataset arranged in a grid. For computing the label, we consider the underlying grid graph (without diagonal edges) and solve a mincostperfectmatching problem, where edge weights are given simply by reading the two vertex digits as a twodigit number (we read downwards for vertical edges and from left to right for horizontal edges). The optimal perfect matching (i.e. the label) is encoded by an indicator vector for the subset of the selected edges, see example in Fig. 6.
Architecture.
The grid image is the input of a convolutional neural network which outputs a grid of vertex weights. These weights are transformed into edge weights as described above and given to the solver. The loss function is Hamming distance between solver output and the true label.
Embedding Blossom V  ResNet18  
Train %  Test %  Train %  Test %  
4  
8  
16  
24 
Results.
The architecture containing the solver is capable of good generalizations suggesting that the correct representation is learned. The performance is good even on larger instances and despite the presence of noise in supervision – often there are many optimal matchings. In contrast, the ResNet18 baseline only achieves reasonable performance for the simplest case PM. The results are summarized in Tab. 4.
5 Discussion
We provide a unified mathematically sound algorithm to embed combinatorial algorithms into neural networks. Its practical implementation is straightforward and training succeeds with standard deep learning techniques. The two main branches of future work are: 1) exploring the potential of newly enabled architectures, 2) addressing standing realworld problems. The latter case requires embedding approximate solvers (that are common in practice). This breaks some of our theoretical guarantees but given their strong empirical performance, the fusion might still work well in practice.
Acknowledgement
We thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting Marin Vlastelica. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). Additionally, we would like to thank Paul Swoboda and Alexander Kolesnikov for valuable feedback on an early version of the manuscript.
References
 [1]
 [2] Google’s ortools, 2019, https://developers.google.com/optimization/.
 [3] B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. 2017, http://arxiv.org/abs/1703.00443.

[4]
B. Amos, L. Xu, and J. Z. Kolter.
Input convex neural networks.
In
34th International Conference on Machine Learning (ICML’17)
, pages 146–155. JMLR, 2017.  [5] M. Balunovic, P. Bielik, and M. Vechev. Learning to solve SMT formulas. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10317–10328. Curran Associates, Inc., 2018, http://papers.nips.cc/paper/8233learningtosolvesmtformulas.pdf.
 [6] P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, K. Allen, C. Nash, V. J. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu. Relational inductive biases, deep learning, and graph networks. 2018, http://arxiv.org/abs/1806.01261.

[7]
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio.
Neural combinatorial optimization with reinforcement learning.
In 5th International Conference on Learning Representations, ICLR 2017, Workshop Track Proceedings, 2017, http://openreview.net/forum?id=Bk9mxlSFx.  [8] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(04):834–848, 2018. doi: 10.1109/TPAMI.2017.2699184.
 [9] L.C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning deep structured models. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, pages 1785–1794. JMLR, 2015, http://dl.acm.org/citation.cfm?id=3045118.3045308.
 [10] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.M. Rousseau. Learning heuristics for the tsp by policy gradient. In W.J. van Hoeve, editor, Proc. of Intl. Conf. on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 170–181. Springer, 2018.
 [11] E. W. Dijkstra. A note on two problems in connexion with graphs. Numer. Math., 1(1):269–271, Dec. 1959. doi: 10.1007/BF01386390.
 [12] J. Domke. Generic methods for optimizationbased modeling. In Artificial Intelligence and Statistics, pages 318–326, 2012.
 [13] J. Edmonds. Paths, trees, and flowers. Canad. J. Math., 17:449–467, 1965, www.cs.berkeley.edu/~christos/classics/edmonds.ps.
 [14] C.s. Foo, C. B. Do, and A. Y. Ng. Efficient multiple hyperparameter learning for loglinear models. In Advances in neural information processing systems, pages 377–384, 2008.
 [15] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi. Exact combinatorial optimization with graph convolutional neural networks. 2019, http://arxiv.org/abs/1906.01629.
 [16] J. C. Gower and G. B. Dijksterhuis. Procrustes problems, volume 30 of Oxford Statistical Science Series. Oxford University Press, Oxford, UK, January 2004.
 [17] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. 2014, http://arxiv.org/abs/1410.5401.
 [18] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. GrabskaBarwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, Oct. 2016. doi: 10.1038/nature20101.
 [19] L. Gurobi Optimization. Gurobi optimizer reference manual, 2019, http://www.gurobi.com.

[20]
J. Guyomarch.
Warcraft ii opensource map editor, 2017,
http://github.com/war2/war2edit.  [21] T. Hazan, J. Keshet, and D. A. McAllester. Direct loss minimization for structured prediction. In Advances in Neural Information Processing Systems 23, pages 1594–1602. Curran Associates, Inc., 2010, http://papers.nips.cc/paper/4069directlossminimizationforstructuredprediction.pdf.

[22]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  [23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
 [24] P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov, and T. Pock. Endtoend training of hybrid cnncrf models for stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), July 2017.
 [25] V. Kolmogorov. Blossom V: a new implementation of a minimum cost perfect matching algorithm. Mathematical Programming Computation, 1(1):43–67, Jul 2009. doi: 10.1007/s1253200900028.
 [26] W. Kool, H. van Hoof, and M. Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations (ICLR’19), 2019, http://openreview.net/forum?id=ByxBFsRqYm.
 [27] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR’16), 2016, http://arxiv.org/abs/1511.05493.
 [28] Z. Li, Q. Chen, and V. Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, NeurIPS’18, pages 537–546, USA, 2018. Curran Associates Inc., http://dl.acm.org/citation.cfm?id=3326943.3326993.
 [29] Z. Liu, X. Li, P. Luo, C.C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In IEEE International Conference on Computer Vision, ICCV’15, pages 1377–1385. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.162.
 [30] G. Lorberbom, A. Gane, T. S. Jaakkola, and T. Hazan. Direct optimization through arg max for discrete variational autoencoder. 2018, http://arxiv.org/abs/1806.02867.
 [31] H. Marchand, A. Martin, R. Weismantel, and L. Wolsey. Cutting planes in integer and mixed integer programming. Discrete Appl. Math., 123(13):397–446, Nov. 2002. doi: 10.1016/S0166218X(01)003481.
 [32] D. Marin, M. Tang, I. B. Ayed, and Y. Boykov. Beyond gradient descent for regularized segmentation losses. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19), June 2019.
 [33] S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. Monte carlo gradient estimation in machine learning. 2019, http://arxiv.org/abs/1906.10652.
 [34] P. Mohapatra, M. Rolínek, C. Jawahar, V. Kolmogorov, and M. Pawan Kumar. Efficient optimization for rankbased loss functions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), June 2018.
 [35] D. Paschalidou, A. O. Ulusoy, C. Schmitt, L. Gool, and A. Geiger. Raynet: Learning volumetric 3d reconstruction with ray potentials. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), 2018.

[36]
L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler,
and B. Schiele.
Deepcut: Joint subset partition and labeling for multi person pose estimation.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), pages 4929–4937. IEEE, 2016.  [37] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2015, http://arxiv.org/abs/1511.06434.

[38]
P. Raghavendra.
Optimal algorithms and inapproximability results for every CSP?
In
Proceedings of the 40th Annual ACM Symposium on Theory of Computing
, STOC ’08, pages 245–254, New York, NY, USA, 2008. ACM. doi: 10.1145/1374376.1374414.  [39] K. G. Samuel and M. F. Tappen. Learning optimized map estimates in continuouslyvalued mrf models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09), pages 477–484, 2009.
 [40] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. Trans. Neur. Netw., 20(1):61–80, Jan. 2009. doi: 10.1109/TNN.2008.2005605.
 [41] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), pages 2774–2781, 2014.
 [42] S. Schulter, P. Vernaza, W. Choi, and M. K. Chandraker. Deep network flow for multiobject tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), pages 2730–2739, 2017.
 [43] D. Selsam and N. Bjørner. Guiding highperformance SAT solvers with UnsatCore predictions. In M. Janota and I. Lynce, editors, Theory and Applications of Satisfiability Testing – SAT 2019, pages 336–353. Springer International Publishing, 2019.
 [44] D. Selsam, M. Lamm, B. Bünz, P. Liang, L. de Moura, and D. L. Dill. Learning a SAT solver from singlebit supervision. In International Conference on Learning Representations (ICLR’19), 2019, http://openreview.net/forum?id=HJMC_iA5tm.
 [45] J. Song, B. Andres, M. Black, O. Hilliges, and S. Tang. Endtoend learning for graph decomposition. 2018, http://arxiv.org/abs/1812.09737.
 [46] Y. Song, A. Schwing, Richard, and R. Urtasun. Training deep neural networks via direct loss minimization. In 33rd International Conference on Machine Learning (ICML), volume 48 of Proceedings of Machine Learning Research, pages 2169–2177. PMLR, 2016, http://proceedings.mlr.press/v48/songb16.html.
 [47] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. Endtoend memory networks. In Advances in Neural Information Processing Systems 28 (NIPS), pages 2440–2448. Curran Associates, Inc., 2015.
 [48] J. Thapper and S. Živný. The limits of SDP relaxations for generalvalued CSPs. In 32nd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’17, pages 27:1–27:12, Piscataway, NJ, USA, 2017. IEEE Press, http://dl.acm.org/citation.cfm?id=3329995.3330022.
 [49] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems 27 (NIPS’14), pages 1799–1807. Curran Associates, Inc., 2014.
 [50] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453–1484, 2005, http://dl.acm.org/citation.cfm?id=1046920.1088722.
 [51] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information Processing Systems 28 (NIPS’15), pages 2692–2700. Curran Associates, Inc., 2015, http://papers.nips.cc/paper/5866pointernetworks.pdf.
 [52] P.W. Wang, P. L. Donti, B. Wilder, and Z. Kolter. SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. 2019, http://arxiv.org/abs/1905.12149.
 [53] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen. Dynamic label graph matching for unsupervised video reidentification. In IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, Oct 2017.

[54]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. H. S. Torr.
Conditional random fields as recurrent neural networks.
In IEEE International Conference on Computer Vision (ICCV’15), pages 1529–1537. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.179.
Appendix A Appendix
a.1 Guidelines for Setting the Values of .
In practice, has to be chosen appropriately, but we found its exact choice uncritical (no precise tuning was required). Nevertheless, note that should cause a noticeable disruption in the optimization problem from equation (3), otherwise it is too likely that resulting in a zero gradient. In other words, should roughly be of the magnitude that brings the two terms in the definition of in Prop. 1 to the same order:
where stands for the average. This again justifies that is a true hyperparameter and that there is no reason to expect values around .
a.2 Proofs
[Proof of Proposition 1.] Let us write and , for brevity. Thanks to the linearity of and the definition of , we have
where and as desired. The conclusion about the points of minima then follows.
Before we prove Theorem 1, we make some preliminary observations. To start with, due to the definition of the solver, we have the fundamental inequality
(6) 
Observation 1.
The function is continuous and piecewise linear.
[Proof.]Since ’s are linear and distinct, , as their pointwise minimum, has the desired properties.
Analogous fundamental inequality
(7) 
follows from the definition of the solution to the optimization problem (3).
A counterpart of Observation 1 reads as follows.
Observation 2.
The function is continuous and piecewise affine.
[Proof.]The function under inspection is a pointwise minimum of distinct affine functions as ranges .
As a consequence of abovementioned fundamental inequalities, we obtain the following twosided estimates on .
Observation 3.
The following inequalities hold for
[Proof.]Inequality (6) implies that and the first inequality then follows simply from the definition of . As for the second one, it suffices to apply (7) to .
Now, let us introduce few notions that will be useful later in the proofs. For a fixed , partitions into maximal connected sets on which is constant (see Fig. 7). We denote this collection of sets by and set .
For and , we denote
We write , for brevity. For technical reasons, we also allow negative values of here.
Note, that if , then
is a hyperplane since
’s are linear. In general, may just be a proper subset of and, in that case, is just the restriction of a hyperplane onto . Consequently, it may happen that will be empty for some pair of , and some . To emphasize this fact, we say “hyperplane in ”. Analogous considerations should be taken into account for all other linear objects. The note “in ” stands for the intersection of these linear object with the set .Observation 4.
Let and let for . Then is a convex polytope in , where the facets consist of parts of finitely many hyperplanes in for some .
[Proof.]Assume that . The values of may only change on hyperplanes of the form for some . Then is an intersection of corresponding halfspaces and therefore is a convex polytope. If is a proper subset of the claim follows by intersecting all the objects with .
Observation 5.
Let be distinct. If nonempty, the hyperplanes and are parallel and their distance is equal to , where
[Proof.]If we define a function and a constant , then our objects rewrite to
Since is linear, these sets are parallel and intersects the origin. Thus, the required distance is the distance of the hyperplane from the origin, which equals to .
As the set is finite, there is a uniform upper bound on all values of . Namely
(8) 
a.2.1 Proof of Theorem 1
[Proof of Property A1.] Now, Property A1 follows, since
and is a difference of continuous and piecewise affine functions.
[Proof of Property A2.] Let be given. We show that which is the same as showing . Assume that , that is, by the definition of and ,
(9) 
in which we denoted . Our goal is to show that
(10) 
where as this equality then guarantees that . Observe that (7) applied to and , yields the inequality “” in (10).
Let us show the reversed inequality. By Observation 3 applied to , we have
(11) 
We now use (7) with and , followed by equality (9) to obtain
where the last inequality holds due to (11).
Next, we have to show that as , i.e. that for almost every , there is a such that . To this end, let be given. We can assume that is a unique solution of solver (1), since two solutions, say and , coincide only on the hyperplane in , which is of measure zero. Thus, since is finite, the constant
is positive. Denote
(12) 
If , set . Then, for every such that , we have
which rewrites
(13) 
For the remaining ’s, (13) holds trivially for every . Therefore, is a solution of the minimization problem (3), whence . This shows that as we wished. If , then for every and (13) follows again.
[Proof of Property A3.] Let be given. We show that on the component of the set
(14) 
the function agrees with a interpolator, where and is an absolute constant. The claim follows as there are only finitely many sets and their components of the form (14) in .
Let us set
and
The condition on tells us that is a nonconstant affine function. It follows by the definition of and that
(15) 
and
(16) 
By Observation 5, the sets and are parallel hyperplanes. Denote by the nonempty intersection of their corresponding halfspaces in . We show that is a interpolator of on between and , with being linearly controlled by .
We have already observed that is the affine function ranging from – on the set – to – on the set . It remains to show that attains both the values and at most far from the sets and , respectively, where denotes a component of the set , .
Consider first. By Observation 4, there are , such that facets of are parts of hyperplanes in . Each of them separates into two halfspaces, say and , where is the halfspace which contains and is the other one. Let us denote
Every is a nonzero linear function which is negative on and positive on . By the definition of , we have
that is
Now, denote
Each is a halfspace in containing and hence . Let us set . Clearly, (see Fig. 8). By Observation 5, the distance of the hyperplane from is at most , where is given by (8). Therefore, since all the facets of are at most far from , there is a constant such that each point of is at most far from .
Finally, choose any . By (16), we have , and by the definition of , is no farther than away from .
Now, let us treat and define the set analogous to , where each occurrence of is replaced by . Any has desired properties. Indeed, (15) ensures that and is at most far away from .
a.3 Details of Experiments
a.3.1 Warcraft Shortest Path
The maps for the dataset have been generated with a custom random generation process by using 142 tiles from the Warcraft II tileset [20]. The costs for the different terrain types range from –. Some example maps of size are presented in Fig. 8(a)
together with a histogram of the shortest path lengths. We used the first five layers of ResNet18 followed by a maxpooling operation to extract the latent costs for the vertices.

Optimization was carried out via Adam optimizer [23] with scheduled learning rate drops dividing the learning rate by
at epochs
and . Hyperparameters and model details are listed in Tab. 5k  Optimizer(LR)  Architecture  Epochs  Batch Size  
12, 18, 24, 30  Adam()  subset of ResNet18 
a.3.2 MNIST Mincost Perfect Matching
The dataset consists of randomly generated grids of MNIST digits that are sampled from a subset of 1000 digits of the full MNIST dataset. We trained a fully convolutional neural network with two convolutional layers followed by a maxpooling operation that outputs a grid of vertex costs for each example. The vertex costs are transformed into the edge costs via the known cost function and the edge costs are then the inputs to the Blossom V solver [13] as implemented in [25].
Regarding the optimization procedure, we employed the Adam optimizer along with scheduled learning rate drops dividing the learning rate by at epochs and , respectively. Other training details are in Tab. 6. Lower batch sizes were used to reduce GPU memory requirements.
k  Optimizer(LR) 

Epochs  Batch Size  
4, 8  Adam()  
16  Adam()  
24  Adam() 
a.3.3 Globe Traveling Salesman Problem
For the Globe Traveling Salesman Problem we used a convolutional neural network architecture of three convolutional layers and two fully connected layers. The last layer outputs a vector of dimension containing the 3dimensional representations of the respective countries’ capital cities. These representations are projected onto the unit sphere and the matrix of pairwise distances is fed to the TSP solver.
The high combinatorial complexity of TSP has negative effects on the loss landscape and results in many local minima and high sensitivity to random restarts. For reducing sensitivity to restarts, we set Adam parameters to (as it is done for example in GAN training [37]) and .
The local minima correspond to solving planar TSP as opposed to spherical TSP. For example, if all cities are positioned to almost identical locations, the network can still make progress but it will never have the incentive to spread the cities apart in order to reach the global minimum. To mitigate that, we introduce a repellent force between epochs 15 and 30. In particular, we set
where for are the positions of the cities on the unit sphere. The regularization constants were chosen as and for .
For finetuning we also introduce scheduled learning rate drops where we divide the learning rate by at epochs and .
k  Optimizer(LR) 

Epochs  Batch Size  
5, 10, 20  Adam()  
40  Adam() 
In Fig. 4(b), we compare the true city locations with the ones learned by the hybrid architecture. Due to symmetries of the sphere, the architecture can embed the cities in any rotated or flipped fashion. We resolve this by computing “the most favorable” isometric transformation of the suggested locations. In particular, we solve the orthogonal Procrustes problem [16]
where are the suggested locations, the true locations, and the optimal transformation to apply. We report the resulting offsets in kilometers in Tab. 8.
k  5  10  20  40 
Location offset (km) 
a.4 Traveling Salesman with an Approximate Solver
Since approximate solvers often appear in practice where the combinatorial instances are too large to be solved exactly in reasonable time, we test our method also in this setup. In particular, we use the approximate solver (ORTools [2]) for the Globe TSP. We draw two conclusions from the numbers presented below in Tab. 9.

The choice of the solver matters. Even if ORTools is fed with the ground truth representations (i.e. true locations) it does not achieve perfect results on the test set (see the right column). We expect, that also in practical applications, running a suboptimal solver (e.g. a differentiable relaxation) substantially reduces the maximum attainable performance.

The suboptimality of the solver didn’t harm the feature extraction – the point of our method. Indeed, the learned locations yield performance that is close to the upper limit of what the solver allows (compare the middle and the right column).
Embedding ORtools  ORtools on GT locations  
Train %  Test %  Test %  
5  
10  
20  
40 
Comments
There are no comments yet.