1 Introduction
There has been a growing interest in the community which focuses on developing neural models for solving combinatorial optimization problems. These problems often require complex reasoning over discrete symbols. Many of these problems can be expressed in the form an underlying Integer Linear Program (ILP). Two different kinds of problem (input) settings have been considered in the literature: (a) purely symbolic and (b) combination of perceptual and symbolic. Solving an nQueens problem given the partial assignment of queens on the board as input would be an example of the former, and solving a sudoku puzzle given the image of a partially filled board as input would be an example of the latter. While the first setting corresponds to a pure reasoning task, second involves a combination of perception and reasoning tasks which need to be solved in a joint fashion. Existing literature proposes various approaches to handle one or both these settings. One line of work proposes purely neural models
(palm&al18; nandwani&al21; dong&al19)for solving these tasks representing the underlying constraints and costs implicitly. While standard CNNs are used to solve the perceptual task, neural models such as Graph Neural Networks take care of the reasoning component. In an alternate view, one may want to solve these tasks by explicitly learning the constraints and cost of the underlying ILP. While the perceptual reasoning would still be handled by using modules such as CNN, reasoning is taken care of by having an explicit representation in the form of an ILP layer representing constraints and costs. Such an approach would have the potential advantage of being more interpretable, and also being more accurate if the underlying constraints could be learned in a precise manner. Some recent approaches take this route, and include works by
(paulus&al21; vlastelica&al20; berthet&al20). We refer to the latter set of approaches as Neural ILP architectures. ^{2}^{2}2 Although these methods can also train a NeuralILPNeural architecture, studying this is beyond our scope.Learning in Neural ILP architectures is complicated by the fact that there is a discrete optimization (in the form of an ILP layer) at the end of the network, which is typically nondifferentiable, making the endtoend learning of the system difficult. One of the possible ways is to instead use an iterative ILP solving algorithm such as a cuttingplane method (gomory10) that uses a continuous relaxation in each iteration which is shown to be differentiable due to the introduction of continuous variables (ferber&al20; wilder&al19). Most of these works are concerned with learning only the cost function and assume that the constraints are given. Recent work by paulus&al21 has proposed an approach to directly pass the gradients through a blackbox ILP solver. Specifically, they rely on Euclidean distances between the constraint hyperplanes and the current solution obtained by the ILP solver to produce gradients for backprop. Though this approach improves the quality of results compared to earlier works involving continuous relaxations, scalability gets severely affected since an ILP has to be solved at every learning iteration, making the training extremely slow.
We are interested in answering the following question: is there a way to train a neural ILP architecture in an endtoend manner, which does not require access to an underlying solver? Such an approach, if exists, could presumably result in significantly faster training times, resulting in scalability. In response, we propose a novel technique to backpropagate through the learnable constraints as well as the learnable cost function of an unknown Integer Linear Program (ILP). During training, our technique doesn’t solve the ILP to compute the gradients. Instead, we cast the learning of ILP constraints (and cost) as learning of a polyhedron, consisting of a set of hyperplanes, such that points inside the polyhedron are treated as positive, and points outside as negative. While a positive point needs to be classified as positive by each of the hyperplanes, a negative point needs to be classified as negative only by at least one of the hyperplanes. Our formulation incorporates the learning of ILP cost also as learning of one of the hyperplanes in the system. We formulate a novel marginbased loss to learn these hyperplanes in a joint fashion. A covariance based regularizer that minimizes the cosine similarity between all pairs of hyperplanes ensures that learned constraints (hyperplanes) are not redundant. Since the training data comes only with positive examples, i.e., solutions to respective optimization problems, we develop several techniques for sampling the negatives, each of which is central to the effective learning of our hyperplanes in our formulation.
We present several experiments on problems which require learning of ILP constraints and cost, with both symbolic as well as perceptual input. These include solving a symbolic sudoku as well as visual sudoku in which we are given an image of a partially filled board (wang&al19) (perceptual); ILPs with random constraints (symbolic), Knapsack from sentence description (perceptual) and keypoint matching (perceptual) from (paulus&al21). Our closest competitor, CombOptNet (paulus&al21), can not solve even the smallest of the sudoku boards sized , whereas we can easily scale to , getting close to 100% accuracy. We are slightly better on keypoint matching, and obtain significantly better accuracy on random ILPs and knapsack, especially on large problem sizes. We also outperform purely neural baselines (wherever applicable).
2 Related Work
In the first line of work, a neural model such as a Graph Neural Network (GNN) replaces a combinatorial solver altogether and implicitly encodes the rules or constraints in their weights, and learns them from data. nandwani&al22; nandwani&al21; palm&al18 train a Recurrent Relational Network for learning to solve puzzles like sudoku, graph coloring, futoshiki etc; dong&al19 propose Neural Logic Machines (NLMs) that learn lifted first order rules and experiment with blocks world, reasoning on family trees etc; ranjan&al22 use a siamese GNN architecture to learn the combinatorial problems of graph and subgraph distance computation; bajpai&al18, garg&al19,garg&al20; sharma&al22 use GNNs to train probabilistic planners for combinatorial domains with large, discrete state and action spaces; selsam&al19; amizadeh&al19a; amizadeh&al19b train a GNN for solving any CSP when constraints are explicitly given in a standard form such as CNF, DNF or Boolean Circuits. This is clearly different from us since we are interested in explicitly learning the constraints of the underlying optimization problem.
Inverse Optimization (chan&al21) aims to learn the constraints (cost) of a linear program (LP) from the observed optimal decisions. Recently (tan&al19; tan&al20) use the notion of Parameterized Linear Programs (PLP) in which both the cost and the constraints of an LP are parameterized by unknown weights. (gould&al19) show how to differentiate through continuous constrained optimization problem using the notion of a ‘declarative node’ that uses implicit function theorem (scarpello&al02). Similar to this are other works (amos&kolter17; agrawal&al19) that define methods for differentiating through other continuous problems like convex quadratic programs (QPs) or cone programs. While all these techniques are concerned with learning the constraints (cost) for optimization problems defined over continuous variables, our focus is on learning constraints (cost) for an ILP which involves optimization over discrete variables and can be a significantly harder problem.
In the space of learning cost for an ILP, several approaches have been proposed recently. (vlastelica&al20) give the gradient w.r.t. linear cost function that is optimized over a given discrete space. (rolinek&al20a; rolinek&al20b) exploit it for the task of Deep Graph Matching, and for differentiating through rank based metrics such as Average Precision/Recall respectively. (berthet&al20) replace the blackbox discrete solver with its smooth perturbed version during training and exploit ‘perturbandMAP’ (papandreou&yuille11) to compute the gradients. On similar lines, (niepert&al21) also exploit perturbandMAP but propose a different noise model for perturbation and also extends to the case when the output of the solver may feed into a downstream neural model. These methods assume the constraints to be given, and only learn the cost for an ILP.
Other methods use relaxations of specific algorithms for the task of learning the constraints and cost for an ILP. ferber&al20 backprop through the KKT conditions of the LP relaxation created iteratively while using cuttingplane method (gomory10) for solving an MILP; wilder&al19 add a quadratic penalty term to the continuous relaxation and use implicit function theorem to backprop through the KKT conditions, as done in amos&kolter17 for smooth programs; mandi&guns20 instead add a logbarrier term to get the LP relaxation and differentiate through its homogeneous selfdual formulation linking it to the iterative steps in the interior point method. wang&al19
uses a low rank SDP relaxation of MAXSAT and defines how the gradients flow during backpropagation. Presumably, these approaches are limited in their application since they rely on specific algorithmic relaxations.
Instead of working with a specific algorithm, some recent works differentiate through a blackbox combinatorial solver for the task of learning constraints (paulus&al21). The solver is called at every learning step to compute the gradient. This becomes a bottleneck during training, since the constraints are being learned on the go, and the problem could become illbehaved resulting in a larger solver time, severely limiting scalability of such approaches. In contrast, we would like to perform this task in a solverfree manner. (pan&al20) propose learning of constraints for the task of structured prediction. They represent the linear constraints compactly using a specific two layer Rectifier Network (pan&srikumar16). A significant limitation is that their method can only be used for learning constraints, and not costs. Further, their approach does not do well experimentally on the benchmark domains that we compare with. (meng&chang21) propose a nonlearning based approach for mining constraints in which for each training data , a new constraint is added which essentially imply that no other target can have better cost than the given target for the corresponding cost . We are specifically interested in learning not only the constraints, but also the cost of an ILP from data. Convex Polytope Machines (CPM) (kantchelian&al14) learns a non linear binary classifier in the form of a polytope from a dataset of positive and negative examples. In contrast, our goal is to learn the cost as well as constraints for an ILP where both the constraints and the cost could be parameterized by another neural network. Also, we do not have access to any negative samples.
3 Differentiable ILP Loss
3.1 Background and Task Description
We are interested in learning how to solve combinatorial optimization problems that can be expressed as an Integer Linear Program (ILP) with equality as well as inequality constraints:
(1) 
Here represents an
dimensional cost vector, the matrix
and represent linear equality constraints, and the matrix and represent linear inequality constraints, together defining the feasible region.Without loss of generality, one can replace an equality constraint by two inequality constraints: and . Here and represent the row of the matrix and element of the vector respectively. Using this, one can reduce ILP in eq. 1 to an equivalent form with just inequality constraints:
(2) 
Here matrix is rowwise concatenation of the matrices and . Vector is a rowwise concatenation of vectors and , and is the total number of inequality constraints. We represent the constraint , as . The integer constraints make the problem NP hard.
Neural ILP Architecture: In a neural ILP architecture, the constraint matrix , vector , and the cost vector are neural functions and (of input ) parameterized by learnable , i.e., ; ; . This results in a different ILP for each input . For an input , the solution given by a neural ILP model, , is nothing but the optima of the following paramterized ILP where are replaced by corresponding neural functions evaluated at input :
(3) 
Example of visualsudoku: In visualsudoku, the input is an image of a sudoku puzzle and is the corresponding solution, represented as a dimensional binary vector: the integer in each of the cells is represented by a dimensional onehot binary vector. Function parameterizing the cost is nothing but a neural digit classifier that classifies the content of each of the cells into one of the classes. The neural functions and are independent of the input as the constraints are the same for every sudoku puzzle. Therefore, , and , where is just a learnable matrix of dimension and is a learnable vector of dimension . See bartlett&al08 for an ILP formulation of sudoku with equality constraints in a dimensional binary space .
Learning Task: Given a training dataset with training samples, the task is to learn the parameters , such that for each . To do so, one needs to define derivatives and w.r.t. and
respectively, of an appropriate loss function
. Once such a derivative is defined, one can easily compute derivatives w.r.t. andusing the chain rule:
; ; and . Hence, in the formulation below, we only worry about computing gradients w.r.t. the constraint matrix , vector , and the cost vector .The existing approaches, e.g., (paulus&al21), explicitly need access to current model prediction . This requires solving the ILP in eq. 3, making the learning process extremely slow. In contrast, we present a ‘solverfree’ framework for computation of an appropriate loss and its derivatives w.r.t. and . Our framework does not require solving any ILP while training, thereby making it extremely scalable as compared to existing approaches.
3.2 A Solverfree Framework
Conversion to a constraint satisfaction problem:
As a first step, we convert the constraint optimization problem in eq. 2 to an equivalent constraint satisfaction problem by introducing an additional ‘cost–constraint’: , equivalent to
and
.
Note that the above cost–constraint separates the solution of the original ILP in eq. 2 from the other feasible integral points. This is because must achieve the minimum objective value amongst all the feasible integral points and hence no other feasible integral point can obtain objective value less than or equal to .
The new constraint together with the original constraints guarantee that is the only solution satisfying all of the constraints.
This results in the following equivalent linear constraint satisfaction problem:
(4) 
Constructing such an equivalent satisfaction problem requires access to the solution of the original ILP in eq. 2 and that is already available to us for the training data samples. By rolling up the cost vector into an additional constraint with the help of , we have converted our original objective of learning both the cost and constraints to just learning of constraints in eq. 4.
The main intuition behind our framework comes from the observation that each of the linear constraints defining the feasible region are essentially linear binary classifiers (hyperplanes) in separating the ground truth from the infeasible region. The additional cost–constraint in eq. 4 separates from other integral samples feasible w.r.t. the original constraints. Learning constraints of an ILP is akin to simultaneously learning linear binary classifiers separating from other infeasible points of the original ILP along with learning a classifier for cost–constraint, separating from other feasible integral points.
For a vector to lie inside the feasible region, all the classifiers need to classify it positively, and hence it acts as a positive data point for all the binary classifiers. In the absence of explicitly provided negative samples, we propose a couple of strategies for sampling them from for each ground truth . We discuss them in detail in section 3.3. For now, let be the set of all the negative samples generated by all our sampling techniques for a ground truth positive sample . In contrast to a positive point which needs to satisfy all the constraints, a negative point becomes infeasible even if it violates any one of the constraints. As a result, it would suffice if any one of the classifiers correctly classify it as a negative sample (refer to appendix for an illustration). While learning, one should not assign a negative sample to any specific classifier as their parameters are being updated continuously. Instead, we make a soft assignment depending upon the distance of the negative sample from the hyperplane. With this intuition, we now formally define our ILP–Loss.
Formulating solverfree ILP–Loss: Let represent the signed Euclidean distance of a point from the hyperplane corresponding to the constraint . We want the signed distance from all the hyperplanes to be positive for the ground truth samples and negative from at least one hyperplane for all the sampled negative points. We operationalize this via a margin based loss function:
(5)  
(6)  
(7)  
(8)  
(9) 
We call and as the positive, negative and covariance loss respectively. The average in ensures that a ground truth sample is positively classified by all the classifiers. and
are the hyperparameters representing the margins for the positive and the negative points respectively.
in represents the soft assignment of to the constraint and is computed by temperature annealed softmax over the negative distances in eq. 9. Softmax ensures that the hyperplane which is most confident to classify it as negative gets the maximum weight. When lies inside the feasible region, then the most confident classifier is the one closest to . To avoid the pathological behaviour of decreasing the loss by changing the weights, we ensure that gradients do not flow through in eq. 7.The temperature parameter needs to be annealed as the training progresses. A high temperature initially can be seen as ‘exploration’ for the right constraint that will be violated by . This is important as the constraints are also being learnt and are almost random initially, so a given negative should not commit to a particular hyperplane. Additionally, this encourages multiple constraints to be violated for each negative, which leads to a robust set of constraints. As the training progresses, we reduce the temperature , ensuring that the most confident classifier with the least signed distance gets almost all the weight, which can be seen as ‘exploitation’ of the most confident classifier. If is correctly classified as a negative with a margin by any classifier i.e., for some , then the corresponding negative loss, , becomes zero, and a low value of ensures that it gets all the weight.
The last term acts as a regularizer and tries to ensure that no two learnt constraints are similar. We call it the covariance loss as it maximizes the covariance between the constraint unit vectors. Equivalently, it minimizes the cosine similarity between all pairs of constraints. The weights and are computed dynamically during training with a multiloss weighing technique using coefficient of variations as described in groenendijk&al21
. Intuitively, the loss term with maximum variance over the learning iterations adaptively gets most of the weight.
Other details:
Parameterization of equality constraints: Recall that we replace an equality by two inequality constraints: . In practice, to enhance learnability, we add a small margin of on both sides: . We pick an
small enough so that the probability of the new feasible region to include an infeasible integral point is negligible, but higher than
so that the positive point can be inside the polyhedron by the specified margin.Known boundaries: In many cases, the boundary conditions on the output variables are known, i.e., , where and are the lower and upper bounds on each dimension of . We handle this by adding known boundary constraints: .
Overparameterization of constraints: As done in paulus&al21, we also overparameterize each constraint hyperplane by an additional learnable offset vector which can be viewed as its own local origin. Radius represents its distance from its own origin , resulting in the following hyperplane in the base coordinate system: .
Initialization of : The way we initialize the constraints may have an impact on the learnability. While (paulus&al21) propose to sample each entry of uniformly (and independently) between , we also experiment with a standard Gaussian initialization. The latter results in initial hyperplanes with their normal directions (’s) uniformly sampled from a unit hypersphere. In expectation, such initialization achieves minima of that measures total pairwise covariance.
3.3 Negative Sampling
A meaningful computation of ILP–Loss in eq. 5 depends crucially on the negative samples .
Randomly sampling negatives from is one plausible strategy, but it may not be efficient from the learning perspective: any point which is far away from any of the classifiers will easily be classified as negative and will not contribute much to the loss function.
In response, we propose multiple alternatives for sampling the negative points:
1. Integral khop neighbours: We sample the integral neighbours that are at an Distance of from .
For small , these form the hardest negatives as they are closest to the positive point.
Note that it is possible for a few integral neighbours to be feasible w.r.t. the constraints, but they must have worse cost than the given ground truth .
Such samples contribute towards learning the cost parameters .
2. Project and sample:
We project the ground truth on each of the hyperplanes and then randomly sample an integral neighbour of the projection, generating a total of negatives.
Sampling probability in the dimension depends on the coordinate of the projection: if value of the coordinate is , then we sample
and with probability and respectively. If , then we sample with probability .
Projection samples are close to the boundary of the currently learnt polyhedron, thus taking the training progress into account.
Further, each hyperplane is likely to be assigned to a closeby negative due to projection sampling.
3. Batch negatives:
We consider every other ground truth in the minibatch as a potential negative sample for .
This is particularly useful for learning the cost parameters when the learnable constraints are the same for all the ground truth samples, such as in sudoku.
In such cases, a batch–negative being a feasible point of the original ILP formulation, must always satisfy all of the learnable constraints of the original ILP in eq. 2. Hence, the only way for to be correctly classified as a negative for is by violating the cost constraint in eq. 4,
Learning cost parameters that result in violation of the cost constraint for every batch–negative helps in ensuring that the ground truth indeed has the minimum cost.
4. Solver based: Although our approach is motivated by the objective of avoiding solver calls, our framework is easily extensible to solverbased training by using the solution to the currently learnt ILP as a negative sample. This is useful when the underlying neural networks parameterizing or are the bottleneck instead of the ILP solver.
While we do not use solver negatives by default, we demonstrate their effectiveness in one of our experiments where the network parameterizing indeed takes most of the computation time.
4 Experiments
The goal of our experiments is to evaluate the effectiveness and scalability of our proposed approach compared to the SOTA blackbox solver based approach, i.e., CombOptNet (paulus&al21). We experiment on 4 problems: symbolic and visual sudoku (wang&al19), and three problems from (paulus&al21)
: random constraints, knapsack from sentence description and keypoint matching. We also compare with an appropriately designed neural baseline for each of our datasets. To measure scalability, in each of the domains, we compare performance of each of the algorithms in terms of training time, as well as accuracy obtained, for varying problem sizes. For each of the algorithms, we report time till the epoch that achieves best val set accuracy and exclude time taken during validation. We kept a maximum time limit of 12 hours for training of each algorithm. We next describe details of each of these datasets, appropriate baselines, and our results. See the appendix for the details of the ILP solver used in our experiments, the hardware specifications, the hyperparameters, and various other design choices.
4.1 Symbolic and Visual Sudoku
Symbolic Sudoku  Visual Sudoku  
Board Accuracy (%)  Training Time (min)  Board Accuracy (%)  Training Time (min)  
4x4  6x6  9x9  4x4  6x6  9x9  4x4  6x6  9x9  4x4  6x6  9x9  
Neural (RRN)  100.0  99.1  91.3  5  7  110  99.8  97.5  71.1  120  65  97 
CombOptNet  0.0  0.0  0.0        0.0  0.0  0.0       
SATNet  100.0  96.8  28.5  1  74  299  98.0  80.8  17.8  79  89  205 
ILP–Loss (Ours)  100.0  100.0  100.0  1  2  52  99.7  98.8  98.3  3  11  92 
(for CombOptNet, “” denotes timeout after 12 hours)
This task involves learning the rules of the sudoku puzzle. For symbolic sudoku, the input is a matrix of digits whereas for a visual sudoku, the
is an image of a sudoku puzzle where a digit is replaced by a random MNIST image of the same class. A
sudoku puzzle can be viewed as an ILP with equality constraints over binary variables (bartlett&al08). For symbolic sudoku, each of the cells of the puzzle is represented by a dimensional binary vector which takes a value of at dimension if and only if the cell is filled by digit , resulting in a dimensional representation of the input where . On the other hand, for visual sudoku, each of the digit images in the cells are decoded (using a neural network) into a dimensional real vector, resulting in a dimensional learnable cost . The dimensional binary output vector (solution) is created analogous to symbolic input. Our objective is to learn the constraint matrix and vector , representing the linear constraints of sudoku. For symbolic sudoku, the cost vector is known, where as for visual sudoku, the cost vector is a function of the input image , parameterized by which also needs to be learned.Dataset: We experiment with different datasets for and . We first build symbolic sudoku puzzles with digits, and use MNIST to convert a puzzle into an image for visual sudoku. For sudokus, we use a standard dataset from Kaggle (sudoku9), for we use publically available data from sudoku4, and for we use the data generation process described in nandwani&al22. We randomly select samples for training, and samples for testing for each . To generate the input images for visualsudoku, we use the official train and test split of MNIST(mnist12). Digits in our train and test splits are randomly replaced by images in the MNIST train and test sets respectively, avoiding any leakage.
Baselines: Our neural baseline is Recurrent Relational Networks (RRN) (palm&al18): a purely neural approach for reasoning based on Graph Neural Networks. We use the default set of parameters provided in their paper for training RRNs. We note that the RRN baseline uses additional information in the form of graph structure which is not available to our solver. We make this comparison to see how well we perform compared to one of the SOTA techniques for this problem. We also compare against another neurosymbolic architecture using SATNet (wang&al19) as the reasoning layer. The neural component in all the neurosymbolic methods is a CNN that decodes images into digits.
Results: Table 1 compares the performance and training time of our method against the different baselines described above. CompOptNet fails miserably on this problem, not being able to complete training for any of the sizes in the stipulated time. The purely neural model gives a decent performance and is competitive with ours for smaller board sizes. But for larger board size of , we beat RRN based model by a significant margin of about 25 points and 9 points in visual and symbolic sudoku, respectively. While SATNet performs comparable on smaller sudoku puzzles, its performance degrades drastically to 17.8% for board size on visual sudoku. See appendix for a comparison on the dataset used in wang&al19.
4.2 Random constraints
Vector Accuracy (%)  Training Time (min)  
1  2  4  8  1  2  4  8  
Binary  CombOptNet  97.6  95.3  84.3  63.4  8.2  13.5  26.5  40.8 
ILP–Loss (Ours)  97.8  96.0  92.8  87.8  7.3  11.6  18.1  27.5  
Dense  CombOptNet  89.3  74.8  34.3  2.0  9.9  16.8  24.7  48.2 
ILP–Loss (Ours)  96.6  86.3  74.0  41.5  7.3  15.6  17.6  20.6 
This is the synthetic dataset borrowed from paulus&al21. The training data is created by first generating a random polyhedron by sampling hyperplanes in an dimensional bounded continuous space . A cost vector is then randomly sampled and corresponding is obtained as . Objective is to learn a constraint matrix and vector such that for a ground truth pair, , , and .
Dataset: Restricting to and results in two variations, referred as ‘binary’ and ‘dense’ settings, respectively. For both of the output spaces, we experiment with four different settings in dimensional space by varying the number of ground truth constraints as and . For each , we experiment with all the 10 datasets provided by paulus&al21
and report mean and standard error over the 10 models. The training data in each case consists of
pairs and model performance is tested for cost vectors.Baseline: Here we compare only against CombOptNet. A neural baseline (an MLP) is shown to perform badly for this symbolic problem in paulus&al21. Hence we exclude this in our experiments.
Results: Table 2 presents the comparison of the two algorithms in terms of accuracy as well as time taken for training. While we perform marginally better than CombOptNet in terms of training time, our performance is significantly better, on almost all problems in the dense setting, and the larger problems in the binary setting. We are roughly and accuracy points better than CombOptNet on the largest problem instance with constraints in the binary and dense settings respectively. This result demonstrates that our approach is not only faster in terms of training time, but also results in better solutions compared to the baseline, validating the effectiveness of our approach.
4.3 Knapsack from Sentence Descriptions
Vector Accuracy (mean in %)  Training Time (mean in min)  
10  15  20  25  30  10  15  20  25  30  
CombOptNet  63.2  48.2  30.1  2.6  0.0  41.0  61.4  153.0     
ILP–Loss (Ours)  71.4  58.5  48.7  41.0  28.4  44.0  51.0  82.2  106.1  111.6 
This task, also borrowed from paulus&al21, is based on the classical NPhard Knapsack problem. Each input consists of sentences describing the price and weight of items and the objective is to select a subset of items that maximizes their total price while keeping their total weight less than the knapsack’s capacity : , where are the price and weight vectors, respectively. Each sentence has been converted into a dimensional dense embedding using conneau&al17, so that each input is dimensional vector . The corresponding output is . The knapsack capacity is fixed for all instances in a dataset. The task is to learn the parameters , and of a neural network that extracts the cost and the constraints from . Note that here both the cost and the constraints need to be inferred from the input.
Dataset: The dataset in paulus&al21 has train and test instances with fixed items, extracted from a corpus containing sentences and their embeddings. In addition to experimenting with the original dataset, to demonstrate the scalability of our method, we also bootstrap new datasets with and , by randomly selecting sentences from the original corpus. Each bootstrapped dataset has train and test instances.
Results: Table 3 presents a comparison between the training time and accuracy of our method against CombOptNet. We significantly outperform the baseline in terms of accuracy across all the datasets. For the smallest problem size with only 10 items, our training time is comparable to the baseline. This is expected as for small problems, ILP may not be the bottleneck, and the relative speedup obtained by solver free method gets offset by the increased number of iterations it may require to train. Our relative gain increases significantly with increasing problem size. CombOptNet fails to complete even a single epoch in the stipulated time and results in 0% accuracy on the largest problem.
4.4 Keypoint Matching
Here we experiment on a real world task of matching key points between two images of different objects of the same type rolinek&al20b. The input consists of a pair of images of the same type along with a shuffled list of coordinates (pixels) of keypoints in both images. The task is to match the same keypoints in the two images with each other. The ground truth constraint enforces a bijection between the keypoints in the two images. The output is represented as a binary permutation matrix, i.e., entries in a row or column should sum to . The cost is a sized vector parameterized by . The goal is to learn , and .
Dataset: We experiment with image pairs in SPair17k (min&al19) dataset used for the task of keypoint matching in (rolinek&al20b). Since the neural ILP models can deal with a fixed size output space, we artificially create 4 datasets for 4, 5, 6 and 7 keypoints from the original dataset. While generating samples for keypoints dataset, we randomly sample any annotated pairs from input image pairs that have more than keypoints annotated. See appendix for details on dataset size.
Baselines: We compare against a strong neural baseline which is same as the backbone model parameterizing the dimensional cost vector in (rolinek&al20b). It is trained by minimizing BCE loss between negative learnt cost () and target . We create an additional baseline by doing constrained inference with ground truth constraints and learnt cost (‘Neural + CI’) in table 4.
Pointwise Accuracy (in %)  Training Times (in min)  
4  5  6  7  4  5  6  7  
Neural  80.88  78.04  75.39  73.49  148  37  3  40 
Neural + CI  82.42  79.99  77.64  75.88  148  37  30  40 
CombOptNet  83.86  81.43  78.88  76.85  41  67  144  279 
ILP–Loss (Ours)  81.76  79.59  77.84  76.18  115  92  106  109 
ILP–Loss + Solver (Ours)  84.64  81.27  79.51  78.59  43  73  99  174 
Results: Table 4 presents the percentage of keypoints matched correctly by different models. In this experiment, the bottleneck w.r.t. time is the backbone neural model instead of the ILP solver. Therefore, we also experiment with solver based negatives in our method (ILP–Loss + Solver). While ILP–Loss using solverfree negatives performs somewhat worse than CombOptNet in terms of its accuracy, especially for smaller problem sizes, using solver based negatives helps ILP–Loss surpass CombOptNet in terms of both the accuracy and training efficiency for large problem sizes (and makes it comparable on others). This is because a solver based negative sample is guaranteed to be incorrectly classified (as positive) by the current hyperplanes, and hence provides a very strong training signal compared to solverfree negatives in ILP–Loss. We obtain a gain of up to 5 points over the Neural baseline, which is roughly double the gain obtained by Neural + CI.
5 Conclusion and Future Work
We have presented a solver–free framework for scalable training of a neural ILP architecture. Our method learns the linear constraints by viewing them as linear binary classifiers, separating the positive points (inside the polyhedron) from the negative points (outside the polyhedron). While given ground truth acts as positives, we propose multiple strategies for sampling negatives. A simple trick using the available ground truth outputs in the training data, converts the cost vector into a constraint, enabling us to learn the cost vector and constraints in a similar fashion.
Future work involves extending our method to neuralILPneural architectures, i.e., where the output of ILP is an input to a downstream neural model (see appendix for a detailed discussion). Second, a neural ILP model works with a fixed dimensional output space, even though the constraints for the same underlying problem are the same in first order logic, e.g., constraints for sudoku puzzles remain the same in first order irrespective of . Creating neural ILP models that can parameterize the constraints on the basis of the size of the input (or learnt) cost vector is a potential direction for future work. Lastly, the inference time with the learnt constraints can be high, especially for large problems like sudoku. In addition, the learnt constraints might not be interpretable even if the ground truth constraints are. In future, we would like to develop methods that distill the learnt constraints into a humaninterpretable form, which may address both these limitations.
Acknowledgements
We thank IIT Delhi HPC facility^{3}^{3}3http://supercomputing.iitd.ac.in
for computational resources. We thank anonymous reviewers for their insightful comments that helped in further improving our paper. We also thank Ashish Chiplunkar, whose course on Mathematical Programming helped us gain insights into existing methods for ILP solving, and Daman Arora for highly stimulating discussions. Mausam is supported by grants from Google, Bloomberg, 1MG and Jai Gupta chair fellowship by IIT Delhi. Parag Singla was supported by the DARPA Explainable Artificial Intelligence (XAI) Program with number N660011724032. Both Mausam and Parag are supported by IBM AI Horizon Networks (AIHN) grant and IBM SUR awards. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of the funding agencies.
References
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work? . In Conclusion and future works section 5

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results?

Did you include complete proofs of all theoretical results?


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Code is now available at https://github.com/dairiitd/ilploss

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Details are in experiments section 4 under ‘Training Methodology’ and in the appendix.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? In three out of four experiments.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? In the experiments section 4.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? Datasets are cited in the respective experiment sections.

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix
3 Differentiable ILP Loss
3.2 A Solverfree Framework
In fig. 1, we consider a 2dimensional ILP with ground truth constraints , a cost vector , and target solution . The dots represent points in . The solid lines represent hyperplanes (here lines) corresponding to the constraints, and the dashed line represents our costconstraint. The shaded area in fig. 1 is the feasible region of the constraint satisfaction problem:
Figure 1 shows the ground truth ILP. The signs of ’s and ’s are such that the closed polyhedron (here polygon) containing the blue points forms the feasible region . The green point is optimal w.r.t. the cost . The red points are infeasible, i.e., for red points. Note that the target solution (point with green border) is the only integral point in the shaded region. The other blue points violate the cost constraint , and the red points violate at least one of the four feasibility constraints .
Figure 1 shows a possible situation during learning. For simplicity, consider temperature close to zero (), so that only the closest hyperplane contributes to the negative loss for a negative sample. Also consider margins close to zero (). In fig. 1, the point with the green border is outside the shaded region, whereas one red point is inside. The ground truth is on the wrong side of only the fourth hyperplane, i.e., , and hence it is the only contributor to the positive loss, i.e., , where . The red negative point denoted as (inside the shaded region) being closest to the third hyperplane contributes to the negative loss . The green and red dotted arrows indicate the directions of movement of the constraint hyperplanes on weight update.
4 Experiments
Details of the ILP Solver and the hardware used for experiments: To solve the learnt ILPs, we use Gurobi ILP solver [gurobi] available under ‘Nameduser academic license’. All our experiments were run on 11 GB ‘GeForce GTX 1080 Ti’ GPUs installed on a machine with 2.60GHz Intel(R) Xeon(R) Gold 6142 CPU. For each of the algorithms, we kept a maximum time limit of 12 hours for training.
4.1 Symbolic and Visual Sudoku
Hyperparameters and other design choices: (a) # of learnable constraints: We keep as the number of learnable equality constraints where is the number of binary variables. (b) Margin: We find that a margin of works well across domains and problem sizes. (c) Temperature: we start with a temperature and anneal it by a factor of whenever the performance on a small held out set plateaus. (d) Early Stopping: we early stop the training based on validation set performance, with a timeout of 12 hours for each experiment. We bypass the validation bottleneck of solving the ILPs from scratch by providing the gold solutions as hints when invoking Gurobi. (e) Negative Sampling: For sampling neighbors, we select all the one hop neighbors, and an equal number of randomly selected 2,3 and 4 hop neighbors, resulting in a total of neighbors. (f) Initialization: we initialize
from a standard Gaussian distribution for CombOptNet and our method.
Comparison with SATNet on their dataset: wang&al19 use a different set of puzzles for training and testing sudoku and report 63.2% accuracy on visual sudoku, different from what it obtains on our dataset. Hence we trained both SATNet and our model on the dataset available on SATNet’s Github repo. Interestingly, on their visual sudoku dataset which we believe to be easier (as shown by performance numbers), our run of SATNet achieves 71.0% board accuracy whereas our method achieves 98.3%.
4.2 Random Constraints
Hyperparameters and other design choices: We keep number of learnable constraints as twice the number of ground truth constraints, i.e., , as paulus&al21 report best performance in most of the settings with it. Here we initialize uniformly between . Rest of the hyperparamters are set as in the case of sudoku.
Vector Accuracy (%)  Training Time (min)  
1  2  4  8  1  2  4  8  
Binary  
CombOpt  97.6 0.4  95.3 0.5  84.3 3.5  63.4 4.0  8.2 1.7  13.5 1.6  26.5 1.4  40.8 4.0  
ILP–Loss  97.8 0.4  96.0 0.3  92.8 0.6  87.8 3.4  7.3 1.7  11.6 1.9  18.1 2.4  27.5 4.8  
Dense  
CombOpt  89.3 1.1  74.8 1.9  34.3 5.6  2.0 0.6  9.9 1.4  16.8 1.3  24.7 2.0  48.2 2.3  
ILP–Loss  96.6 0.3  86.3 2.3  74.0 5.4  41.5 5.7  7.3 1.1  15.6 2.1  17.6 2.6  20.6 4.5 
Results: See table 5 for the standard error of the accuracy and training time over 10 random datasets for different number of ground truth constraints.
4.3 Knapsack from Sentence Descriptions
Vector Accuracy (mean std. err. in %)  Training Time (mean std. err. in min)  
10  15  20  25  30  10  15  20  25  30  
CombOpt  63.20.6  48.20.4  30.11.0  2.60.3  0.00.0  41.04.1  61.44.6  153.08.8  TO  TO 
ILP–Loss  71.40.4  58.50.3  48.70.7  41.00.5  28.40.7  44.05.7  51.07.9  82.29.8  106.17.7  111.611.0 
Hyperparameters and other design choices: Following paulus&al21, we use a twolayer MLP with hiddendimension 512 to extract and from dimensional sentence embeddings, and keep . Number of constraints is set to 4, a setting which achieves best performance in paulus&al21. A notable difference is in the output layer of our MLP. paulus&al21 assume access to the ground truth price and weight range ( and respectively), and use a sigmoid output nonlinearity with suitable scale and shift to produce and in the correct range. We do the same for CombOptNet, but for ILP–Loss we simply use a linear activation at the output. We note that training CombOptNet with linear activation without access to the ground truth ranges gives poorer results.
Results: See table 6 for standard error over 10 runs with different random seeds for varying knapsack sizes.
4.4 Keypoint Matching
Num Keypoints  4  5  6  7 
#Test  10,474  9,308  7,910  6,580 
#Train  43,916  37,790  31,782  26,312 
Hyperparameters: For each , the number of learnable constraints is set to : same as the number of ground truth constraints. For keypoints 5,6 and 7, in addition to random initialization, we also experiment by initializing the backbone cost parameters with the one obtained by training it on 4 keypoints and pick the one which obtains better accuracy on val set. This happens for all the methods for keypoints 6 and 7.
For ILP–Loss with only solver and batch negatives (ILP–Loss + Sol.) , we start with a temperature and anneal it by a factor of whenever performance on a small validation set plateaus. For ILP–Loss with only solver–free negatives, we start with a temperature and anneal it by a factor of at and epoch. As done in paulus&al21, we initialize uniformly between . Rest of the hyperparmeters are same as those used for sudoku.
Dataset details: See table 7 for the number of train and test samples in the four datasets created for and keypoints.
Pointwise Accuracy (mean std err in %)  Training Times (mean std. err. in min)  
4  5  6  7  4  5  6  7  
Neural  80.880.87  78.040.40  75.390.50  73.490.55  14826  3712  309  4013 
Neural + CI  82.420.55  79.990.16  77.640.25  75.880.43  14826  3712  309  4013 
CombOptNet  83.860.62  81.430.49  78.880.65  76.850.54  4115  678  14431  27939 
ILP–Loss  81.761.71  79.590.18  77.840.36  76.180.06  11513  923  1061  1095 
ILP–Loss + Sol.  84.640.62  81.271.12  79.510.53  78.590.55  4312  7310  999  17425 
Results: See table 8 for the standard error of pointwise accuracy and training times over 3 runs with different random seeds for varying number of keypoints.
5 Future Work
Discussion on training NeuralILPNeural architectures: In the current formulation, availability of the solution to the ground truth ILP is important for our solverfree approach to work. Specifically, it is required to: 1.) convert the constrained optimization problem to a constraint satisfaction problem by including the costconstraint eq. 4, and 2.) to calculate the positive loss eq. 6. However, in a NeuralILPNeural architecture, the intermediate supervision for only the NeuralILP part (i.e., solution of the ground truth ILP) is not available.
On the other hand, solver based methods such as CombOptNet, do not require access to the solution of the groundtruth ILP. Instead, they rely on the solution of the current intermediate ILP (during learning) to compute the gradients of the loss and thus their approach is not solver free. We note that even though in princple they can train NeuralILPNeural architectures, their experiments are only in the NeuralILP settings.
Extending our current work for NeuralILPNeural architectures is an important direction of future work. One plausible approach could be to train an auxiliary inverse network that converts a given output of NeuralILPNeural architecture to a predicted symbolic target of NeuralILP component. This predicted target can be used as a proxy to the ground truth solution of the NeuralILP part. Similar ideas of using an inverse network have been explored in [agarwal&al21], albeit under the setting where ILP is known and only the neural encoder and decoder needs to be learnt.