1 Introduction
Mixedinteger linear programs (MILPs) are widely used in operations research to model decision problems like personnel rostering [22], timetabling [9], and routing [16], among many others. A MILP is a constrained optimization program of the form:
(1)  
(2) 
over variables . Like in regular linear programs, the objective function is linear with parameter (Eq. 1) and the feasible set is a polytope defined by a coefficient matrix
and a bias vector
(Eq. 2). What makes MILPs special is that some of the variables are restricted to be integers. This allows MILPs to capture numerical optimization problems with a strong combinatorial component. For instance, – variables can be used to decide whether an action should be taken or not, like taking a particular route, buying a machine, or assigning a nurse to a shift. While MILPs are NPhard to solve in general, practical solvers like Gurobi [11] and CPLEX [7] can readily handle large instances, making MILPs the formalism of choice in many scientific and industrial applications.Designing MILPs, however, requires substantial modelling expertise, making it hard for nonexperts to take advantage of this powerful framework. One option is then to use constraint learning [8] to induce MILPs from historical data, e.g., examples of highquality solutions (positive examples) and subpar or infeasible configurations (negative examples). A major difficulty is that in practice historical data is collected under the effects of temporary restrictions and resource limitations. For instance, in a scheduling problem some workers may be temporarily unavailable due to sickness or parental leave. We use the term “context” to indicate such temporary constraints. Crucially, by restricting the feasible space, contexts can substantially alter the set of optimal solutions, biasing the data, cf. Figure 1. Learning algorithms that ignore the contextual nature of their supervision are bound to perform poorly [14]. Existing approaches for learning MILP programs suffer from this issue.
The aim of this work is to amend this situation. In particular, we formalize contextual learning of MILPs and show that the resulting learning problem contains both continuous and combinatorial elements that make it challenging for standard techniques. To solve this issue, we introduce missle, a novel approach that combines ideas from combinatorial search and gradientbased optimization. In particular, missle relaxes the original loss to obtain a natural, smoother surrogate on which gradient information can be readily computed, and then uses the latter to guide a stochastic local search procedure. Our empirical evaluation on synthetic data shows that missle performs better than two natural competitors – namely, pure gradient descent and pure stochastic local search – in terms of quality of the acquired MILPs and computational efficiency.
2 Learning MILPs from Contextual Data
Our goal is to acquire a MILP – and specifically its parameters , , and – from examples of contextual solutions and nonsolutions. Contexts can alter the set of solutions of a MILP: a configuration that is optimal in context might be arbitrarily suboptimal or even infeasible in a different context , cf. Figure 1. To see this, consider the following example:
Example 1.
A company makes two products ( and ) using two machines ( and ). Producing one unit of requires minutes of processing time on machine and minutes on machine , while each unit of takes minutes on and minutes on . Both machines can be run for a maximum of 4 hours every day. The company makes a profit of on each unit of and on each unit of . The aim is to decide the number of units to produce for each product so that the profit is maximised. This problem can be written as a MILP:
The optimal solution for this problem is , giving a total profit of . Now, consider a temporary reduction in demand for and hence the company wants to produce maximum units every day. This context can be added as a hard constraint: , leading to a different optimal solution, namely . Notice that the solution without context is infeasible in this context.
We define the problem of learning MILPs from contextual examples as follows:
Definition 1.
Given a data set of contextspecific examples , where is iff is a highquality solution for context , find a MILP with parameters that can be used to obtain highquality configurations in other contexts of interest.
By “highquality” configurations, we mean configurations that are feasible and considered (close to) optimal. Going forward we will use to indicate the MILP defined by .
2.1 Contexts
In the following, we will restrict ourselves to handling contexts that impose additional linear constraints to the problem. This choice is both convenient and flexible. It is convenient because injecting linear contexts into a MILP gives another MILP, namely:
(3)  
(4) 
It is also very flexible: the polytope can naturally encode frequently used constraints like partial assignments (which fix the value of a subset of the variables) and boundingbox constraints (which restrict the domain of a subset of the variables to a particular range). Furthermore, it is often possible to encode complex nonlinear and logical constraints into this form using linearization techniques [24].
2.2 An initial formulation
Learning a MILP amounts to searching for a program that fits the data well. Naturally, ignoring the contextual nature of the data – for instance, by treating all positive examples as if they were global, rather than contextual, optima – can dramatically reduce the quality of the learned program [14], cf. Figure 1.
Given a MILP and a context , we define
to be a binary classifier that labels configuration
as positive if and only if it is solution to in context :Here is the indicator function that evaluates to if holds and to otherwise, and is the set of contextual solutions of in context . Given a dataset of contextual examples, we propose to learn a MILP by minimizing the following 01 loss:
This amounts to looking for a MILP that minimizes the number of misclassified training examples.
2.3 Advantages and Limitations
The above strategy is motivated by the work of Kumar et al. [14], who have shown that, under suitable assumptions, classifiers that make few mistakes on the training set correspond to programs that output highquality configurations in (even unobserved) target contexts.^{1}^{1}1These results were obtained for MAXSAT programs but do carry over to general constraint optimization programs.
A downside of this formulation is that it is computationally challenging. Solving the above optimization problem is equivalent to minimizing the – loss on the training set, which is known to be NPhard [4]. Furthermore, the objective function is piecewise constant: there exist infinitely many pairs of MILP models that have the same set of solutions and that therefore lie on the same plateau of the empirical loss. This makes it hard to apply optimization procedures based on pure gradient descent, as the gradient is constant almost everywhere. Standard local search is also not very effective, as it proceeds by making small uninformed steps and therefore might spend a long time trapped on a plateau. A second issue is that evaluating the loss is hard: checking whether an instance is predicted as positive – that is, evaluating – involves finding an optimum and ensuring that it has the same value as . The issue is that computing requires to solve the candidate MILP in context , which is NPhard in general. This step needs to be carried out repeatedly when minimizing Eq. 2.2, so it is important to keep the number of optimization steps at a minimum.
3 Learning MILPs with MISSLE
Due to the difficulty of optimizing Eq. 2.2 directly, we employ a smoother surrogate loss that offers gradient information and use it to guide a stochastic local search procedure.
3.1 A Surrogate Loss
We build a surrogate loss by enumerating the various ways in which can mislabel a contextual example and defining a loss for each case. Below, we write to indicate the Euclidean distance between and the
th hyperplane
of the feasible polytope [21]:Additionally, we write to indicate the regret of in context , which measures the difference in quality between and a truly optimal configuration according to the learned cost vector :
In the remainder of this section, let be a misclassified example. If it is positive (that is, ), then it is predicted as either infeasible or suboptimal by
. If it is feasible, then it is sufficient to make it optimal. This can be achieved by either increasing the estimated cost of
, by excluding the actual optimum from the feasible region, or by bringing closer to the boundary. This leads to the following loss:where the operations select the closest hyperplane.
On the other hand, if is positive but infeasible, then we have to enlarge the feasible region to make it feasible. This is accomplished by penalizing the distance between and all the hyperplanes that exclude from the feasible region using the following loss:
Finally, if is negative () – but classified as positive – then we want to make it either suboptimal or infeasible. This can be achieved by increasing its cost or by adapting the feasible region so that it excludes this example, giving:
Note that we use exponential just to make sure that the loss is lower bounded by 0. The full surrogate loss is obtained by combining the various cases, as follows:
This surrogate loss is inspired by but not identical to the one given by Paulus et al. [17]. There are three core differences. First, Eq. 3.1 includes a term whose aim is to move the boundaries closer to , as an optimal solution for MILP lies close to the boundary. Second, our surrogate loss includes a term for misclassified negative examples, integrated through Eq. 3.1. Finally, we integrate the contextual information in the surrogate loss by defining as an optimal solution in the context . Apart from these technical differences, another major difference is in the setting in which we use this loss function. Paulus et al. focus on endtoend learning, for instance learning constraints from given textual descriptions. However, our focus is to use this loss to learn MILPs from contextual examples.
3.2 Searching for a MILP
The goal of missle is to find a program that minimizes the 01 loss defined in Eq 2.2. However, the major barrier here is the continuous search space. To tackle this we propose a variant of SLS guided by the gradient of the surrogate loss defined above.
The backbone of all SLS strategies is a heuristic search procedure that iteratively picks a promising candidate in the neighborhood
of the current candidate , while injecting randomness in the search to escape local optima. The challenge is how to define a neighborhood that contains promising candidates and allows the search algorithm to quickly reach the lowloss regions of the search space. To this end, we exploit the fact that the surrogate loss is differentiable w.r.t. to design a small set of promising neighbors, as follows. Each has three neighbours defined by the following moves:
Perturb the cost vector by taking a single gradient descent step of length while leaving and untouched: .

Similarly, rotating the hard constraints by performing gradient descent w.r.t. : .

Translate the hard constraints by updating the offsets : .
In each iteration of the SLS procedure, the next hypothesis is chosen greedily, picking the candidate with minimal true loss . Notice that this is different from the gradient step, which is based on the surrogate loss instead. The intuition is twofold. On the one hand, the gradient is cheap to compute and points towards higherquality candidates, while on the other the SLS procedure smartly picks among the candidates bypassing any approximations induced by the surrogate loss. Another advantage of this solution is that it naturally supports exploring continuous search spaces, whereas most SLS strategies focus on discrete ones.
We make use of two other minor refinements. First, the learning rate is adapted during the search, and more specifically it is set close to 1 when the loss is high and slowly decreased towards 0 as the loss decreases, more formally: . The idea is to take larger steps so to quickly escape highloss regions while carefully exploring the more promising areas. We also normalize both the direction of optimisation and the hyperplanes and after each step, thus regularizing the learning problem.
3.3 Missle
The pseudocode for missle is given in Algorithm 1. It starts with an initial candidate which is generated by first finding the convex hull of the positive points in the data, and then randomly picking
sides of this convex hull to define the initial feasible region, while a random vector is chosen to be the direction of optimisation. The rest of the code simply iterates through the neighbourhood picking the best candidate in each iteration, while randomly restarting with probability
, which helps the algorithm in avoiding a local minima.4 Experiments
We answer empirically the following research questions:

Q1: How does the performance change over time?

Q2: Does missle learn high quality models?

Q3: How does missle compare to pure SGD and SLS approaches?
To answer these questions, we used missle
to recover a set of synthetic MILPs from contextual examples sampled from them and compared its hybrid search strategy to two natural baselines: stochastic local search (SLS) and stochastic gradient descent (SGD). The baseline using SGD can be seen as an extension of the ideas presented by Paulus
et al. to learn in our setting.4.1 Experiment Description
To generate the training data, we first randomly generated 5 different groundtruth models , each with 5 variables and 5 hard constraints. For each , a dataset was collected by first sampling 250 random contexts and then taking 1 positive and 2 negative examples from each context. For each model , we generated 5 different set of data by changing the seed of randomization.
The quality of the learned model was captured by measuring recall, precision, infeasibility, and regret. Recall tells us what percentage of the true feasible region is covered by the learned model, while precision tells us what part of the learned feasible region is actually correct. Infeasibility gives us the percentage of the optimal solutions in that are infeasible in , while regret measures the quality of the optima in as defined in Equation 3.1. Naturally, a better model is characterised by higher value of recall and precision and lower value of infeasibility and regret.
4.2 Results and Discussion
The results are shown in Figure 2. The xaxis in each plot represents the cutoff time, which is the maximum time an algorithm is allowed to run before producing a learnt model. The yaxis for regret is logarithmic to make the differences more clear. As expected, increasing the cutoff time leads to better results, specially in terms of infeasibility and regret. With a cutoff time of 60 minutes, missle achieves recall and precision with infeasibility and regret , hence can be answered in affirmative. To answer , we compared missle with the two baselines based on SGD and SLS, the comparison clearly shows that missle outperforms both baselines. The only exception is that SLS shows better performance in terms of infeasibility when the cutoff time is low, however, this is because it learns a very restrictive model, as is suggested by the low recall value. SLS also has more than three times the regret attained by missle. Hence, it can be clearly concluded that combining SLS and SGD in missle leads to better performance compared to using each individually.
5 Related Work
Our work is motivated by advancements in constraint learning [8] and in particular by Kumar et al. [14], who introduced the problem of learning MAXSAT models from contextual examples. Our work extends these ideas to continuousdiscrete optimization. Kumar et al. cast the learning problem itself as a MILP. This strategy, while principled, can be computationally challenging. In contrast, missle leverages a hybrid strategy that combines elements of combinatorial and gradientbased optimization. One advantage is that missle naturally supports anytime learning.
Typical constraint learning methods [8] – like ConAcq [6] and ModelSeeker [3] – focus on acquiring constraints in higherlevel languages than MILP, but lack support for cost functions. Those approaches that do support learning soft constraints require the candidate constraints to be enumerated upfront, which is not possible in our discretecontinuous setting [19].
Most approaches to learning MILPs from data either learn the cost function or the hard constraints, but not both. Pawlak and Krawiec [18] acquire the feasibility polytope from positive and negative examples by encoding the learning problem itself as a MILP. Schede et al. [20] acquire the feasibility polytope using a similar strategy, but – building on work on learning Satisfiability Modulo Theory formulas [13]
and decision trees
[5] and on syntaxguided synthesis [1] – implement an incremental learning loop that achieves much improved efficiency. Approaches for learning polyhedral classifiers find a convex polytope that separates positive and negative examples, often by a large margin [2, 15, 12, 10], but do not learn a cost function. This is not a detail: negative examples in their classification setting are known to be infeasible, while in ours they can be either infeasible or suboptimal. This introduces a credit attribution problem that these approaches are not designed to handle.There are two notable exceptions. One is the work of Tan et al. [23], which acquires linear programs using stochastic gradient descent (SGD). Their technique, however, requires to differentiate through the solver, and this cannot be done for MILPs. A second one is the work of Paulus et al. [17], which learns integer programs from examples using SGD and a surrogate loss. missle uses a similar surrogate loss that however explicitly supports negative examples, as discussed in Section 3. Another major difference is that – as shown by our experiments – our hybrid optimization strategy outperforms pure SGD.
Most importantly, none of the works mentioned above support contextual examples. Because of this, they are bound to learn subpar MILPs when applied to contextual data.
6 Conclusion
We introduced the problem of learning MILPs from contextual examples as well as missle, a practical approach that combines stochastic local search with gradientbased guidance. Our preliminary evaluation shows that missle outperforms two natural competitors in terms of scalability and model quality.
This work can be improved and extended in several directions. First and foremost, a more extensive evaluation is needed. Second, missle should be sped up by integrating the incremental learning strategy of [20, 13] in which a model is learned from progressively larger subsets of examples, stopping whenever the improvement in score saturates. Third, it will be interesting to see the attribution of positive and negative examples in the learning process, we believe positive examples carry much more information compared to a negative example, hence it might be beneficiary to learn from only positive examples, however this needs more extensive experiments. Finally, if metainformation about which negatives are infeasible and which ones are suboptimal is available, however this information is often not known and thus missle does not use this information, we will work on a new algorithm that can utilise this extra information and check the impact on the performance.
Acknowledgments
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. [694980] SYNTH: Synthesising Inductive Data Models). The research of Stefano Teso was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.
References
 [1] (2018) Searchbased program synthesis. Communications of the ACM 61 (12), pp. 84–93. Cited by: §5.
 [2] (2002) Polyhedral separability through successive LP. Journal of Optimization theory and applications 112 (2). Cited by: §5.
 [3] (2012) A Model Seeker: Extracting global constraint models from positive examples. In International Conference on Principles and Practice of Constraint Programming, Cited by: §5.
 [4] (2003) On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences 66 (3), pp. 496–514. Cited by: §2.3.
 [5] (2017) Optimal classification trees. Machine Learning. Cited by: §5.
 [6] (2017) Constraint acquisition. Artificial Intelligence 244, pp. 315–342. Cited by: §5.
 [7] (2021) User’s Manual for CPLEX. Cited by: §1.
 [8] (2018) Learning constraints from examples. In AAAI, Cited by: §1, §5, §5.
 [9] (1985) An introduction to timetabling. European Journal of Operational Research 19 (2), pp. 151–162. External Links: ISSN 03772217, Document, Link Cited by: §1.
 [10] (2018) Learning convex polytopes with margin. In NeurIPS, Cited by: §5.
 [11] (2021) Gurobi optimizer reference manual. External Links: Link Cited by: §1.
 [12] (2014) Largemargin convex polytope machine. NeurIPS. Cited by: §5.
 [13] (2018) Learning SMT(LRA) Constraints using SMT Solvers. In IJCAI, Cited by: §5, §6.
 [14] (2020) Learning MAXSAT from Contextual Examples for Combinatorial Optimisation. In AAAI, Cited by: §1, §2.2, §2.3, §5.
 [15] (2010) Learning polyhedral classifiers using logistic function. In ACML, Cited by: §5.
 [16] (2010) Traveling salesman problem: an overview of applications, formulations, and solution approaches. In Traveling Salesman Problem, D. Davendra (Ed.), External Links: Document, Link Cited by: §1.
 [17] (2020) Fit the Right NPHard Problem: Endtoend Learning of Integer Programming Constraints. In Learning Meets Combinatorial Algorithms Workshop at NeurIPS, Cited by: §3.1, §5.
 [18] (2017) Automatic synthesis of constraints from examples using mixed integer linear programming. EJOR. Cited by: §5.
 [19] (2004) Acquiring both constraint and solution preferences in interactive constraint systems. Constraints. Cited by: §5.
 [20] (2019) Learning linear programs from data. In ICTAI, Cited by: §5, §6.

[21]
(2002)
Learning with kernels: support vector machines, regularization, optimization, and beyond
. Cited by: §3.1.  [22] (2013) Nurse rostering: a complex example of personnel scheduling with perspectives. In Automated Scheduling and Planning, pp. 129–153. Cited by: §1.
 [23] (2020) Learning linear programs from optimal decisions. In NeurIPS, Cited by: §5.
 [24] (2015) Mixed integer linear programming formulation techniques. Siam Review 57 (1), pp. 3–57. Cited by: §2.1.