Learning Mixed-Integer Linear Programs from Contextual Examples

by   Mohit Kumar, et al.
Università di Trento

Mixed-integer linear programs (MILPs) are widely used in artificial intelligence and operations research to model complex decision problems like scheduling and routing. Designing such programs however requires both domain and modelling expertise. In this paper, we study the problem of acquiring MILPs from contextual examples, a novel and realistic setting in which examples capture solutions and non-solutions within a specific context. The resulting learning problem involves acquiring continuous parameters – namely, a cost vector and a feasibility polytope – but has a distinctly combinatorial flavor. To solve this complex problem, we also contribute MISSLE, an algorithm for learning MILPs from contextual examples. MISSLE uses a variant of stochastic local search that is guided by the gradient of a continuous surrogate loss function. Our empirical evaluation on synthetic data shows that MISSLE acquires better MILPs faster than alternatives based on stochastic local search and gradient descent.


page 1

page 2

page 3

page 4


Learning MAX-SAT from Contextual Examples for Combinatorial Optimisation

Combinatorial optimisation problems are ubiquitous in artificial intelli...

Heteroscedasticity-aware residuals-based contextual stochastic optimization

We explore generalizations of some integrated learning and optimization ...

Combinatorial Losses through Generalized Gradients of Integer Linear Programs

When samples have internal structure, we often see a mismatch between th...

Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives

We consider a discrete optimization based approach for learning sparse c...

Black-box Mixed-Variable Optimisation using a Surrogate Model that Satisfies Integer Constraints

A challenging problem in both engineering and computer science is that o...

Fast Continuous and Integer L-shaped Heuristics Through Supervised Learning

We propose a methodology at the nexus of operations research and machine...

Gnowee: A Hybrid Metaheuristic Optimization Algorithm for Constrained, Black Box, Combinatorial Mixed-Integer Design

This paper introduces Gnowee, a modular, Python-based, open-source hybri...

1 Introduction

Mixed-integer linear programs (MILPs) are widely used in operations research to model decision problems like personnel rostering [22], timetabling [9], and routing [16], among many others. A MILP is a constrained optimization program of the form:


over variables . Like in regular linear programs, the objective function is linear with parameter (Eq. 1) and the feasible set is a polytope defined by a coefficient matrix

and a bias vector

(Eq. 2). What makes MILPs special is that some of the variables are restricted to be integers. This allows MILPs to capture numerical optimization problems with a strong combinatorial component. For instance, variables can be used to decide whether an action should be taken or not, like taking a particular route, buying a machine, or assigning a nurse to a shift. While MILPs are NP-hard to solve in general, practical solvers like Gurobi [11] and CPLEX [7] can readily handle large instances, making MILPs the formalism of choice in many scientific and industrial applications.

Designing MILPs, however, requires substantial modelling expertise, making it hard for non-experts to take advantage of this powerful framework. One option is then to use constraint learning [8] to induce MILPs from historical data, e.g., examples of high-quality solutions (positive examples) and sub-par or infeasible configurations (negative examples). A major difficulty is that in practice historical data is collected under the effects of temporary restrictions and resource limitations. For instance, in a scheduling problem some workers may be temporarily unavailable due to sickness or parental leave. We use the term “context” to indicate such temporary constraints. Crucially, by restricting the feasible space, contexts can substantially alter the set of optimal solutions, biasing the data, cf. Figure 1. Learning algorithms that ignore the contextual nature of their supervision are bound to perform poorly [14]. Existing approaches for learning MILP programs suffer from this issue.

The aim of this work is to amend this situation. In particular, we formalize contextual learning of MILPs and show that the resulting learning problem contains both continuous and combinatorial elements that make it challenging for standard techniques. To solve this issue, we introduce missle, a novel approach that combines ideas from combinatorial search and gradient-based optimization. In particular, missle relaxes the original loss to obtain a natural, smoother surrogate on which gradient information can be readily computed, and then uses the latter to guide a stochastic local search procedure. Our empirical evaluation on synthetic data shows that missle performs better than two natural competitors – namely, pure gradient descent and pure stochastic local search – in terms of quality of the acquired MILPs and computational efficiency.

2 Learning MILPs from Contextual Data

Our goal is to acquire a MILP – and specifically its parameters , , and – from examples of contextual solutions and non-solutions. Contexts can alter the set of solutions of a MILP: a configuration that is optimal in context might be arbitrarily sub-optimal or even infeasible in a different context , cf. Figure 1. To see this, consider the following example:

Example 1.

A company makes two products ( and ) using two machines ( and ). Producing one unit of requires minutes of processing time on machine and minutes on machine , while each unit of takes minutes on and minutes on . Both machines can be run for a maximum of 4 hours every day. The company makes a profit of on each unit of and on each unit of . The aim is to decide the number of units to produce for each product so that the profit is maximised. This problem can be written as a MILP:

The optimal solution for this problem is , giving a total profit of . Now, consider a temporary reduction in demand for and hence the company wants to produce maximum units every day. This context can be added as a hard constraint: , leading to a different optimal solution, namely . Notice that the solution without context is infeasible in this context.

We define the problem of learning MILPs from contextual examples as follows:

Definition 1.

Given a data set of context-specific examples , where is iff is a high-quality solution for context , find a MILP with parameters that can be used to obtain high-quality configurations in other contexts of interest.

By “high-quality” configurations, we mean configurations that are feasible and considered (close to) optimal. Going forward we will use to indicate the MILP defined by .

opt. dir.

Figure 1: Toy MILP and two contexts. Left: MILP over two integer variables . The feasible polytope is in gray, the optimization direction goes top-right. The optimum is in green. Middle: Once a context (blue) is added, the optimum changes substantially (red). Right: same for . A MILP learned using the three optima as positive examples while ignoring their contextual nature would be forced to consider all points in the plane as having the same cost (i.e., ), which is clearly undesirable.

2.1 Contexts

In the following, we will restrict ourselves to handling contexts that impose additional linear constraints to the problem. This choice is both convenient and flexible. It is convenient because injecting linear contexts into a MILP gives another MILP, namely:


It is also very flexible: the polytope can naturally encode frequently used constraints like partial assignments (which fix the value of a subset of the variables) and bounding-box constraints (which restrict the domain of a subset of the variables to a particular range). Furthermore, it is often possible to encode complex non-linear and logical constraints into this form using linearization techniques [24].

2.2 An initial formulation

Learning a MILP amounts to searching for a program that fits the data well. Naturally, ignoring the contextual nature of the data – for instance, by treating all positive examples as if they were global, rather than contextual, optima – can dramatically reduce the quality of the learned program [14], cf. Figure 1.

Given a MILP and a context , we define

to be a binary classifier that labels configuration

as positive if and only if it is solution to in context :

Here is the indicator function that evaluates to if holds and to otherwise, and is the set of contextual solutions of in context . Given a dataset of contextual examples, we propose to learn a MILP by minimizing the following 0-1 loss:

This amounts to looking for a MILP that minimizes the number of misclassified training examples.

2.3 Advantages and Limitations

The above strategy is motivated by the work of Kumar et al. [14], who have shown that, under suitable assumptions, classifiers that make few mistakes on the training set correspond to programs that output high-quality configurations in (even unobserved) target contexts.111These results were obtained for MAX-SAT programs but do carry over to general constraint optimization programs.

A downside of this formulation is that it is computationally challenging. Solving the above optimization problem is equivalent to minimizing the loss on the training set, which is known to be NP-hard [4]. Furthermore, the objective function is piece-wise constant: there exist infinitely many pairs of MILP models that have the same set of solutions and that therefore lie on the same plateau of the empirical loss. This makes it hard to apply optimization procedures based on pure gradient descent, as the gradient is constant almost everywhere. Standard local search is also not very effective, as it proceeds by making small uninformed steps and therefore might spend a long time trapped on a plateau. A second issue is that evaluating the loss is hard: checking whether an instance is predicted as positive – that is, evaluating – involves finding an optimum and ensuring that it has the same value as . The issue is that computing requires to solve the candidate MILP in context , which is NP-hard in general. This step needs to be carried out repeatedly when minimizing Eq. 2.2, so it is important to keep the number of optimization steps at a minimum.

3 Learning MILPs with MISSLE

Due to the difficulty of optimizing Eq. 2.2 directly, we employ a smoother surrogate loss that offers gradient information and use it to guide a stochastic local search procedure.

3.1 A Surrogate Loss

We build a surrogate loss by enumerating the various ways in which can mislabel a contextual example and defining a loss for each case. Below, we write to indicate the Euclidean distance between and the

th hyperplane

of the feasible polytope [21]:

Additionally, we write to indicate the regret of in context , which measures the difference in quality between and a truly optimal configuration according to the learned cost vector :

In the remainder of this section, let be a misclassified example. If it is positive (that is, ), then it is predicted as either infeasible or suboptimal by

. If it is feasible, then it is sufficient to make it optimal. This can be achieved by either increasing the estimated cost of

, by excluding the actual optimum from the feasible region, or by bringing closer to the boundary. This leads to the following loss:

where the operations select the closest hyperplane.

On the other hand, if is positive but infeasible, then we have to enlarge the feasible region to make it feasible. This is accomplished by penalizing the distance between and all the hyperplanes that exclude from the feasible region using the following loss:

Finally, if is negative () – but classified as positive – then we want to make it either sub-optimal or infeasible. This can be achieved by increasing its cost or by adapting the feasible region so that it excludes this example, giving:

Note that we use exponential just to make sure that the loss is lower bounded by 0. The full surrogate loss is obtained by combining the various cases, as follows:

This surrogate loss is inspired by but not identical to the one given by Paulus et al. [17]. There are three core differences. First, Eq. 3.1 includes a term whose aim is to move the boundaries closer to , as an optimal solution for MILP lies close to the boundary. Second, our surrogate loss includes a term for mis-classified negative examples, integrated through Eq. 3.1. Finally, we integrate the contextual information in the surrogate loss by defining as an optimal solution in the context . Apart from these technical differences, another major difference is in the setting in which we use this loss function. Paulus et al. focus on end-to-end learning, for instance learning constraints from given textual descriptions. However, our focus is to use this loss to learn MILPs from contextual examples.

3.2 Searching for a MILP

The goal of missle is to find a program that minimizes the 0-1 loss defined in Eq 2.2. However, the major barrier here is the continuous search space. To tackle this we propose a variant of SLS guided by the gradient of the surrogate loss defined above.

The backbone of all SLS strategies is a heuristic search procedure that iteratively picks a promising candidate in the neighborhood

of the current candidate , while injecting randomness in the search to escape local optima. The challenge is how to define a neighborhood that contains promising candidates and allows the search algorithm to quickly reach the low-loss regions of the search space. To this end, we exploit the fact that the surrogate loss is differentiable w.r.t. to design a small set of promising neighbors, as follows. Each has three neighbours defined by the following moves:

  • Perturb the cost vector by taking a single gradient descent step of length while leaving and untouched: .

  • Similarly, rotating the hard constraints by performing gradient descent w.r.t. : .

  • Translate the hard constraints by updating the offsets : .

In each iteration of the SLS procedure, the next hypothesis is chosen greedily, picking the candidate with minimal true loss . Notice that this is different from the gradient step, which is based on the surrogate loss instead. The intuition is twofold. On the one hand, the gradient is cheap to compute and points towards higher-quality candidates, while on the other the SLS procedure smartly picks among the candidates bypassing any approximations induced by the surrogate loss. Another advantage of this solution is that it naturally supports exploring continuous search spaces, whereas most SLS strategies focus on discrete ones.

We make use of two other minor refinements. First, the learning rate is adapted during the search, and more specifically it is set close to 1 when the loss is high and slowly decreased towards 0 as the loss decreases, more formally: . The idea is to take larger steps so to quickly escape high-loss regions while carefully exploring the more promising areas. We also normalize both the direction of optimisation and the hyperplanes and after each step, thus regularizing the learning problem.

3.3 Missle

The pseudo-code for missle is given in Algorithm 1. It starts with an initial candidate which is generated by first finding the convex hull of the positive points in the data, and then randomly picking

sides of this convex hull to define the initial feasible region, while a random vector is chosen to be the direction of optimisation. The rest of the code simply iterates through the neighbourhood picking the best candidate in each iteration, while randomly restarting with probability

, which helps the algorithm in avoiding a local minima.

while  and  do
     with probability do
        best neighbour
     if  then
          track best-so-far      
Algorithm 1 The missle algorithm. Inputs: data set , max constraints , learning rate , restart prob. , cutoff , max accuracy .

4 Experiments

We answer empirically the following research questions:

  • Q1: How does the performance change over time?

  • Q2: Does missle learn high quality models?

  • Q3: How does missle compare to pure SGD and SLS approaches?

To answer these questions, we used missle

to recover a set of synthetic MILPs from contextual examples sampled from them and compared its hybrid search strategy to two natural baselines: stochastic local search (SLS) and stochastic gradient descent (SGD). The baseline using SGD can be seen as an extension of the ideas presented by Paulus 

et al. to learn in our setting.

4.1 Experiment Description

To generate the training data, we first randomly generated 5 different ground-truth models , each with 5 variables and 5 hard constraints. For each , a dataset was collected by first sampling 250 random contexts and then taking 1 positive and 2 negative examples from each context. For each model , we generated 5 different set of data by changing the seed of randomization.

The quality of the learned model was captured by measuring recall, precision, infeasibility, and regret. Recall tells us what percentage of the true feasible region is covered by the learned model, while precision tells us what part of the learned feasible region is actually correct. Infeasibility gives us the percentage of the optimal solutions in that are infeasible in , while regret measures the quality of the optima in as defined in Equation 3.1. Naturally, a better model is characterised by higher value of recall and precision and lower value of infeasibility and regret.

Figure 2: Combining gradient and SLS in missle leads to better performance (higher recall and precision, lower regret) compared to using them individually. The only exception is that SLS shows better performance in terms of infeasibility, however, this is because it learns a very restrictive model, as is suggested by the low recall value.

4.2 Results and Discussion

The results are shown in Figure 2. The x-axis in each plot represents the cutoff time, which is the maximum time an algorithm is allowed to run before producing a learnt model. The y-axis for regret is logarithmic to make the differences more clear. As expected, increasing the cutoff time leads to better results, specially in terms of infeasibility and regret. With a cutoff time of 60 minutes, missle achieves recall and precision with infeasibility and regret , hence can be answered in affirmative. To answer , we compared missle with the two baselines based on SGD and SLS, the comparison clearly shows that missle outperforms both baselines. The only exception is that SLS shows better performance in terms of infeasibility when the cutoff time is low, however, this is because it learns a very restrictive model, as is suggested by the low recall value. SLS also has more than three times the regret attained by missle. Hence, it can be clearly concluded that combining SLS and SGD in missle leads to better performance compared to using each individually.

5 Related Work

Our work is motivated by advancements in constraint learning [8] and in particular by Kumar et al. [14], who introduced the problem of learning MAX-SAT models from contextual examples. Our work extends these ideas to continuous-discrete optimization. Kumar et al. cast the learning problem itself as a MILP. This strategy, while principled, can be computationally challenging. In contrast, missle leverages a hybrid strategy that combines elements of combinatorial and gradient-based optimization. One advantage is that missle naturally supports anytime learning.

Typical constraint learning methods [8] – like ConAcq [6] and ModelSeeker [3] – focus on acquiring constraints in higher-level languages than MILP, but lack support for cost functions. Those approaches that do support learning soft constraints require the candidate constraints to be enumerated upfront, which is not possible in our discrete-continuous setting [19].

Most approaches to learning MILPs from data either learn the cost function or the hard constraints, but not both. Pawlak and Krawiec [18] acquire the feasibility polytope from positive and negative examples by encoding the learning problem itself as a MILP. Schede et al. [20] acquire the feasibility polytope using a similar strategy, but – building on work on learning Satisfiability Modulo Theory formulas [13]

and decision trees 

[5] and on syntax-guided synthesis [1] – implement an incremental learning loop that achieves much improved efficiency. Approaches for learning polyhedral classifiers find a convex polytope that separates positive and negative examples, often by a large margin [2, 15, 12, 10], but do not learn a cost function. This is not a detail: negative examples in their classification setting are known to be infeasible, while in ours they can be either infeasible or suboptimal. This introduces a credit attribution problem that these approaches are not designed to handle.

There are two notable exceptions. One is the work of Tan et al. [23], which acquires linear programs using stochastic gradient descent (SGD). Their technique, however, requires to differentiate through the solver, and this cannot be done for MILPs. A second one is the work of Paulus et al. [17], which learns integer programs from examples using SGD and a surrogate loss. missle uses a similar surrogate loss that however explicitly supports negative examples, as discussed in Section 3. Another major difference is that – as shown by our experiments – our hybrid optimization strategy outperforms pure SGD.

Most importantly, none of the works mentioned above support contextual examples. Because of this, they are bound to learn sub-par MILPs when applied to contextual data.

6 Conclusion

We introduced the problem of learning MILPs from contextual examples as well as missle, a practical approach that combines stochastic local search with gradient-based guidance. Our preliminary evaluation shows that missle outperforms two natural competitors in terms of scalability and model quality.

This work can be improved and extended in several directions. First and foremost, a more extensive evaluation is needed. Second, missle should be sped up by integrating the incremental learning strategy of [20, 13] in which a model is learned from progressively larger subsets of examples, stopping whenever the improvement in score saturates. Third, it will be interesting to see the attribution of positive and negative examples in the learning process, we believe positive examples carry much more information compared to a negative example, hence it might be beneficiary to learn from only positive examples, however this needs more extensive experiments. Finally, if meta-information about which negatives are infeasible and which ones are sub-optimal is available, however this information is often not known and thus missle does not use this information, we will work on a new algorithm that can utilise this extra information and check the impact on the performance.


This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. [694980] SYNTH: Synthesising Inductive Data Models). The research of Stefano Teso was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.


  • [1] R. Alur, R. Singh, D. Fisman, and A. Solar-Lezama (2018) Search-based program synthesis. Communications of the ACM 61 (12), pp. 84–93. Cited by: §5.
  • [2] A. Astorino and M. Gaudioso (2002) Polyhedral separability through successive LP. Journal of Optimization theory and applications 112 (2). Cited by: §5.
  • [3] N. Beldiceanu and H. Simonis (2012) A Model Seeker: Extracting global constraint models from positive examples. In International Conference on Principles and Practice of Constraint Programming, Cited by: §5.
  • [4] S. Ben-David, N. Eiron, and P. M. Long (2003) On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences 66 (3), pp. 496–514. Cited by: §2.3.
  • [5] D. Bertsimas and J. Dunn (2017) Optimal classification trees. Machine Learning. Cited by: §5.
  • [6] C. Bessiere, F. Koriche, N. Lazaar, and B. O’Sullivan (2017) Constraint acquisition. Artificial Intelligence 244, pp. 315–342. Cited by: §5.
  • [7] I. I. CPLEX (2021) User’s Manual for CPLEX. Cited by: §1.
  • [8] L. De Raedt, A. Passerini, and S. Teso (2018) Learning constraints from examples. In AAAI, Cited by: §1, §5, §5.
  • [9] D. de Werra (1985) An introduction to timetabling. European Journal of Operational Research 19 (2), pp. 151–162. External Links: ISSN 0377-2217, Document, Link Cited by: §1.
  • [10] L. Gottlieb, E. Kaufman, A. Kontorovich, and G. Nivasch (2018) Learning convex polytopes with margin. In NeurIPS, Cited by: §5.
  • [11] L. Gurobi Optimization (2021) Gurobi optimizer reference manual. External Links: Link Cited by: §1.
  • [12] A. Kantchelian, M. Tschantz, L. Huang, P. Bartlett, A. Joseph, and D. Tygar (2014) Large-margin convex polytope machine. NeurIPS. Cited by: §5.
  • [13] S. Kolb, S. Teso, A. Passerini, and L. De Raedt (2018) Learning SMT(LRA) Constraints using SMT Solvers. In IJCAI, Cited by: §5, §6.
  • [14] M. Kumar, S. Kolb, S. Teso, and L. De Raedt (2020) Learning MAX-SAT from Contextual Examples for Combinatorial Optimisation. In AAAI, Cited by: §1, §2.2, §2.3, §5.
  • [15] N. Manwani and P. Sastry (2010) Learning polyhedral classifiers using logistic function. In ACML, Cited by: §5.
  • [16] R. Matai, S. Singh, and M. L. Mittal (2010) Traveling salesman problem: an overview of applications, formulations, and solution approaches. In Traveling Salesman Problem, D. Davendra (Ed.), External Links: Document, Link Cited by: §1.
  • [17] A. Paulus, M. Rolínek, V. Musil, B. Amos, and G. Martius (2020) Fit the Right NP-Hard Problem: End-to-end Learning of Integer Programming Constraints. In Learning Meets Combinatorial Algorithms Workshop at NeurIPS, Cited by: §3.1, §5.
  • [18] T. Pawlak and K. Krawiec (2017) Automatic synthesis of constraints from examples using mixed integer linear programming. EJOR. Cited by: §5.
  • [19] F. Rossi and A. Sperduti (2004) Acquiring both constraint and solution preferences in interactive constraint systems. Constraints. Cited by: §5.
  • [20] E. A. Schede, S. Kolb, and S. Teso (2019) Learning linear programs from data. In ICTAI, Cited by: §5, §6.
  • [21] B. Schölkopf and A. J. Smola (2002)

    Learning with kernels: support vector machines, regularization, optimization, and beyond

    Cited by: §3.1.
  • [22] P. Smet, P. De Causmaecker, B. Bilgin, and G. V. Berghe (2013) Nurse rostering: a complex example of personnel scheduling with perspectives. In Automated Scheduling and Planning, pp. 129–153. Cited by: §1.
  • [23] Y. Tan, D. Terekhov, and A. Delong (2020) Learning linear programs from optimal decisions. In NeurIPS, Cited by: §5.
  • [24] J. P. Vielma (2015) Mixed integer linear programming formulation techniques. Siam Review 57 (1), pp. 3–57. Cited by: §2.1.