Log In Sign Up

Smart Predict-and-Optimize for Hard Combinatorial Optimization Problems

Combinatorial optimization assumes that all parameters of the optimization problem, e.g. the weights in the objective function is fixed. Often, these weights are mere estimates and increasingly machine learning techniques are used to for their estimation. Recently, Smart Predict and Optimize (SPO) has been proposed for problems with a linear objective function over the predictions, more specifically linear programming problems. It takes the regret of the predictions on the linear problem into account, by repeatedly solving it during learning. We investigate the use of SPO to solve more realistic discrete optimization problems. The main challenge is the repeated solving of the optimization problem. To this end, we investigate ways to relax the problem as well as warmstarting the learning and the solving. Our results show that even for discrete problems it often suffices to train by solving the relaxation in the SPO loss. Furthermore, this approach outperforms, for most instances, the state-of-the-art approach of Wilder, Dilkina, and Tambe. We experiment with weighted knapsack problems as well as complex scheduling problems and show for the first time that a predict-and-optimize approach can successfully be used on large-scale combinatorial optimization problems.


page 1

page 2

page 3

page 4


Divide and Learn: A Divide and Conquer Approach for Predict+Optimize

The predict+optimize problem combines machine learning ofproblem coeffic...

Learning for Robust Combinatorial Optimization: Algorithm and Application

Learning to optimize (L2O) has recently emerged as a promising approach ...

Faster Predict-and-Optimize with Three-Operator Splitting

In many practical settings, a combinatorial problem must be repeatedly s...

Digraph description of k-interchange technique for optimization over permutations and adaptive algorithm system

The paper describes a general glance to the use of element exchange tech...

Securely Solving the Distributed Graph Coloring Problem

Combinatorial optimization is a fundamental problem found in many fields...

Discrete solution pools and noise-contrastive estimation for predict-and-optimize

Numerous real-life decision-making processes involve solving a combinato...


Combinatorial optimization aims to optimize an objective function over a set of feasible solutions defined on a discrete space. Numerous real-life decision-making problems can be formulated as combinatorial optimization problems [13, 21]. In the last decade, development of time-efficient algorithms for combinatorial optimization problems paved the way for these algorithms to be widely utilized in industry, including, but not limited to, in resource allocation [1], efficient energy scheduling [15], price optimization [11], sales promotion planning [4], etc.

The last decade has, in parallel, witnessed a tremendous growth in machine learning (ML) methods, which can produce very accurate predictions by leveraging historical and contextual data. In real-world applications, not all parameters of an optimization problem are known at the time of execution and predictive ML models can be used for estimation of those parameters from historical data. For instance,  cohen2017impact first predicted future demand of products using an ML model and then use the predicted demand to compute the optimal promotion pricing scheme over the products through non-linear integer programming.

When predictive ML is followed by optimization, it is often assumed that improvements in the quality of the predictions (with respect to some suitable evaluation metric) will result in better optimization outcomes. However, ML models make errors and the impact of prediction errors is not uniform throughout the underlying solution space, for example, overestimating the highest-valued prediction might not change a maximization problem outcome, while underestimating it can. Hence, a better prediction model may not ensure a better outcome in the optimization stage. In this regard, ifrim2012properties ifrim2012properties experienced that a better predictive model does not always translate to optimized energy-saving schedules.

The alternative is to take the effect of the errors on the optimization outcome into account during learning. In the context of linear programming problems, elmachtoub2017smart proposed an approach, called “Smart Predict and Optimize” (SPO), for training ML models by minimizing a convex surrogate loss function which considers the outcome of the optimization stage. Specifically they consider optimization problems where predictions occur as weights that are linear in the objective.

In this work, we build on that approach and consider discrete combinatorial optimization problems. Indeed, the SPO loss is valid for any optimization problem with a linear objective over predictions, where the constraints implicitly define a convex region. Furthermore, any black box optimization method can be used as only its outcome is used to compute the (sub)gradient of the loss.

The main challenge is the computational cost of the repeated solving of the optimization problem during training, namely once for every evaluation of the loss function on an instance. For NP-hard problems, this may quickly become infeasible. In order to scale up to large problem instances, we investigate the importance of finding the optimal discrete solution for learning, showing that continuous relaxations are highly informative for learning. Furthermore, we investigate how to speed up the learning by transfer learning from easier-to-learn models as well as method for speeding up the solving by warmstarting from earlier solutions. Our approach outperforms the state-of-the-art Melding approach 

[23] in most cases, and for the first time we are able to show the applicability of predict-and-optimize on large scale combinatorial instances, namely from the ICON energy-cost aware scheduling challenge [19].

Related Work

Predict-and-optimize problems arise in many applications. Current practice is to use a two-stage approach where the ML models are trained independent of the optimization problem. As a consequence, the ML models do not account for the optimization tasks [22, 14]. In recent years there is a growing interest in decision-focused learning [9, 7, 23], that aims to couple ML and decision making.

In the context of portfolio optimization, bengio1997using bengio1997using report a deep learning model fails to improve future profit when trained with respect to a standard ML loss function, but a profit-driven loss function turns out to be effective. kao2009directed kao2009directed consider an unconstrained quadratic optimisation problem, where the predicted values appear linearly with the objective. They train a linear model with respect to a combination of prediction error and optimization loss. They do not mention how this can be applied to optimization problems with constraints.

A number of works aim to exploit the fact that the KKT conditions of a quadratic program (QP) define a system of linear equations around the optimal points. For instance, donti2017task donti2017task propose a framework which computes the gradient of the solution of the QP with respect to the predictions by applying the implicit function theorem to differentiate through the KKT conditions around the optimal point. wilder2018melding wilder2018melding use the same approach, and propose its use for linear programs by adding a small quadratic term to convert it into a concave QP. They also propose a specialisation of it for submodular functions.

Our work builds on the SPO approach of elmachtoub2017smart elmachtoub2017smart, where the authors provide a framework to train an ML model, which learns with respect to the error in the optimization problem. This is investigated for linear optimization problems with a convex feasible region. We will use the approach for discrete combinatorial problems with a linear objective. They are computationally expensive to solve, e.g. often -hard. The decision variables and search space of these problems is discrete, meaning gradients can not be computed in a straightforward manner. However, the SPO approach remains applicable as we will see.

demirovicinvestigation demirovicinvestigation investigate the prediction+optimisation problem for the knapsack problem, and prove that optimizing over predictions are as valid as stochastic optimisation over learned distributions, in case the predictions are used as weights in a linear objective. They further investigate possible learning approaches, and classified them into three groups:

indirect approaches, which do not use knowledge of the optimisation problem; semi-direct approaches, which encode knowledge of the optimisation problem, such as the importance of ranking and direct approaches which encode or use the optimisation problem in the learning in some way [7]. Our approach is a direct approach and we examine how to combine the best of such techniques in order to scale to large and hard combinatorial problems.

Problem Formulation and Approach

Optimizing a parameterized problem

Traditional optimization algorithms work under the assumption that all the parameters are known precisely. But in a predict-and-optimize setup we assume some parameters of the optimization problem are not known.

We formalize a combinatorial optimization problem as follows :


where defines the set of parameters (coefficients) of the optimization problem, are the decision variables, is an objective to be minimized, and is a (set of) constraints that determine(s) the feasible region; hence is an oracle that returns the optimal solution.

Consider, the - knapsack, where a set of items, with their values and wights, are provided. The objective is to select a subset of items respecting a capacity constraint on the sum of weights so that the total value of the subset is maximized. The parameter set of the problem consists of the value and weight of each item and the total capacity. The decision variable set consists of - decision variable for each item, is a linear sum of the variables and the item values and describing the capacity constraint.

We decompose where are the set of parameters that are observed (e.g. the weights and capacity of a knapsack) and are the set of unobserved parameters (the value of the knapsack items).

To predict the unobserved parameters , some attributes correlated with them are observed. We are equipped with a training dataset where

’s are vectors of attributes correlated to

. An ML model is trained to generate a prediction . The model is characterized by a set of learning parameters

. E.g. in linear regression the

encompasses the slope and the intercept of the regression line. Once the predicted are obtained, is used for the optimization problem.

To ease notation, we will write , containing both the observed parameters and predicted parameters. Recall from Equation (1) that is the optimal solution using parameters . Then, the objective value of this solution is . Whereas if the actual is known a priori, one could obtain the actual optimal solution . The difference in using the predicted instead of the actual values is hence measured by


Ideally the aim of the training should be to generate predictions which minimize this regret on unseen data.

Two Stage Learning

First we formalize a two-stage approach, which is widely used in industry and where the prediction and the optimization are performed in a decoupled manner. First the predictive model is trained with the objective of minimizing the expected loss for a suitable choice of loss function

For regression, if we use squared-error as the loss function, model parameters are estimated by minimizing the Mean Squared Error (MSE) on the training data:


Training by gradient descent

      Sample training datapoints for  in  do
             predict using current //gradient of
       end for
until convergence ;
Algorithm 1 Stochastic Batch gradient descent for the two-stage learning for regression tasks (batchsize:) and learning rate

The process of estimating

to minimize the loss function is executed through stochastic gradient descent algorithms

111There are different variations of sgd, for a detailed discussion refer [18]

,where at each epoch,

are updated after calculating the gradient of the loss function with respect to the predictions as shown in Algorithm 1.

The advantage of the two-stage approach is that training by gradient descent is straightforward. In the prediction stage the objective is to minimize the MSE-loss to generate accurate predictions without considering their impact on the final solution. Clearly this does not require solving the optimization problem during training.

Model validation and early stopping with regret

It is common practice to perform model validation on a separate validation set while training the model. Early stopping [3]

is the practice of choosing the epoch with the smallest loss on the validation set as the final output of the training procedure. The objective is to avoid overfitting on training data. The performance on the validation set is also used to select hyperparameters of the ML models 


Considering the final task, in our setup, is minimizing the regret, we modify the two stage learning by measuring regret on the validation set for early stopping and hyperparameter selection. We call this the MSE-r approach. It is more computationally expensive than MSE given that computing regret on the validation data for every epoch requires solving the optimization problems each time.

Smart Predict then Optimize (SPO)

The major drawback of the two-stage approach is it does not aim to minimize the regret, but minimizes the error between and directly. As Figure 1 shows, minimizing loss between and does not necessarily result in minimization of the regret. Early stopping with regret can avoid worsening results, but can not improve the learning. The SPO framework proposed by elmachtoub2017smart addresses this by integrating prediction and optimization.

Figure 1: MSE (left axis) versus Regret (right axis) while training a knapsack instance; with no correlation and worsening regret.

Note, to minimize directly we have to find the gradient of it with respect to which requires differentiating the argmin operator in Eq. 1. This differentiation may not be feasible as can be discontinuous in and exponential in size. Consequently we can not train an ML model to minimize the regret through gradient descent.

The SPO framework integrates the optimization problem into the loop of gradient descent algorithms in a clever way. In their work, elmachtoub2017smart consider an optimization problem with a convex feasible region and a linear objective :


where the cost vector is not known beforehand. Following Eq. (2), the regret for such a problem when using predicted values instead of actual is: , which as discussed is not differentiable. To make it differentiable, they use a convex surrogate upper bound of the regret function, which they name the SPO+ loss function . The gradient of may not exist as it also involves the argmin operator. However, they have shown that is a subgradient of , that is


The subgadient formulation is the key to bring the optimization problem into the loop of gradient descent as shown in algorithm 2.

      Sample training datapoints for  in  do
             predict using current compute //sub-gradient
       end for
until convergence ;
Algorithm 2 Stochastic Batch gradient descent for the SPO approach for regression tasks (batchsize:) and learning rate

The difference between algorithm 1 and algorithm 2 is in their (sub)gradients. In Algorithm 1, the MSE gradient is the signed difference between the actual values and predicted ones; in Algorithm 2 the SPO subgradient is the difference of an optimization solution obtained using the actual parameter values and another solution obtained using a convex combination of the predicted values and the true values.

For the - knapsack problem, the solution of a knapsack instance is a - vector of length equal to the size of the set, where represents the corresponding item is selected. In this case, the subgradient is the element-wise difference between the two solutions and if the solution using the transformed predicted values is the same as the solution using actual values, all entries of the subgradient are zero. In essence, the non-zero entries in the subgradient indicate places where the two solutions contradict.

Note, to compute the subgradient for this SPO approach, the optimization problem needs to be solved for each training instance, while can be precomputed and cached. Moreover, one training instance typically contains multiple predictions. For example, if we consider a 0-1 knapsack problem with 10 items, then one training instance always contains 10 value predictions, one for each item. Furthermore, the dataset may contain thousands of training instances of 10 values each. Hence, one iteration over the training data (one epoch) requires solving hundreds of knapsack problems. As an ML model is trained over several epochs, clearly the training process is computationally expensive.

Combinatorial problems and scaling up

We observe that the SPO approach and its corresponding loss function places no restriction on the type of oracle used. Given that our target task is to minimize the regret of the combinatorial problem, an oracle that solves the combinatorial optimisation problem is the most natural choice. We call this approach SPO-full.

Weaker oracles

Repeatedly solving combinatorial problems is computationally expensive. Hence for large and hard problems, it is necessary to look at ways to reduce the solving time.

As there is no restriction on the oracle used, we consider using weaker oracles during training. NP-hard problems that have a polynomial (bounded) approximation algorithm could use the approximation algorithm in the loss as a proxy instead. For example, in case of knapsack, the greedy algorithm [5]. For mixed integer programming (MIP) formulations, a natural weaker oracle to use is the continuous relaxation of the problem. While disregarding the discrete part, relaxations can often identify what part of the problem is trivial (variable assignments close to 0 or 1) from what part is non-trivial. For example for knapsack, the continuous relaxation leads to very similar solutions compared to the greedy algorithm. Note that we always use the same oracle for and when computing the loss. We call the approach of using the continuous relaxation as oracle SPO-relax.

In case of weak MIP relaxations, one can also use a cutting plane algorithm in the root node and use the resulting tighter relaxation thereof [10]. Other weaker oracles could also be used, for example setting a time-limit on an any-time solver and using the best solution found, or a node-limit on search algorithms. In case of mixed integer programming, we can also set a gap tolerance, which means the solver does not have to prove optimality. We call this SPO-gap. For stability of the learning, it is recommended that the solution returned by the oracle does not vary much when called with (near) identical input.

Apart from changing what is being solved, we also investigate ways to warmstart the learning, and to warmstart across solver calls:

Warmstarting the learning

We consider warmstarting the learning by transfer learning [17], that is, to train the model with an easy to compute loss function, and then continue training it with a more difficult one. In our case, we can pre-train the model using MSE as loss, which means the predictions will already be more informed when we start using an SPO loss afterwards.

More elaborate learning schemes are possible, such as curriculum learning [20, 17] where we gradually move from easier to harder to compute loss functions, e.g. by moving from SPO-relax to SPO-gap for decreasing gaps to SPO-full. As we will see later, this is not needed for the cases we studied.

Warmstarting the solving

When computing the loss, we must solve both using the true values , and . Furthermore, we know that an optimal solution to is also a valid (but potentially suboptimal) solution to as only the coefficients of the objective differ. Furthermore, if this is an optimal solution to the latter than we would achieve -regret, hence we can expect the solution of to be of decent quality for too.

We hence want to aide the solving of by using the optimal solution of . One way to do this for CP solvers is solution-guided search [6]. For MIP/LP we can use warmstarting [24, 25], that is, to use the previous solution as starting point for MIP. In case of linear programming (and hence the relaxation), we can reuse the basis of the solution.

An alternative is to use the true solution to compute a bound on the objective function. Indeed, as the solution to is valid for and has an objective value of . Hence, we can use this as a bound on the objective and potentially cut away a large part of the search space.

While the true solutions can be cached, we must compute this solution once for each training instance , which may already take significant time for large problems. We observe that only the objective changes between calls to the oracle , and hence any previously computed solution is also a candidate solution for the other calls. We can hence use warmstarting for any solver call after the first, and from any previously computed solution so far.

Experimental Evaluation

We consider three types of combinatorial problems: unweighted and weighted knapsack and energy-cost aware scheduling. Below we briefly discuss the problem formulations:

Unweighted/weighted knapsack problem

The knapsack problem can be formalized as . The values will be predicted from data and weights and capacity are given. In the unweighted knapsack, all weights are 1 and the problem is polynomial time solvable. Weighted knapsacks are NP-hard and it is known that the computational difficulty increases with the correlation between weights and values [16]. We generated mildly correlated knapsacks as follows: for each of the 48 half-hour slots we assign a weight by sampling from the set , then we multiply each profit value by its corresponding weight and include some randomness by adding Gaussian noise to each before multiplying by weight.

In the unweighted case, we consider 9 different capacity values , namely from 5 to 45 increasing by 5. For the weighted knapsack experiment, we consider 7 different capacity values from 30 to 210 increasing by 30.

Energy-cost aware scheduling

It is a resource-constrained job scheduling problem where the goal is to minimize (predicted) energy cost, it is described in the CSPLib as problem 059 [19]. In summary, we are given a number of machines and have to schedule a given number of tasks, where each task has a duration, an earliest start and a latest end, resource requirement and a power usage. Each machine has a resource capacity constraint. We omit startup/shutdown costs. No task is allowed to stretch over midnight between two days and cannot be interrupted once started, nor migrate to another machine. Time is discretized in timeslots and a schedule has to be made over all timeslots at once. For each timeslot, a (predicted) energy cost is given and the objective is to minimize the total energy cost of running the tasks on the machines.

We consider two variants of the problem: easy instances consisting of 30 minute timeslots (e.g. 48 timeslots per day), and hard instances as used in the ICON energy challenge consisting of 5 minute timeslots (288 timeslots per day). The easy instances have 3 machines and respectively 10, 15 and 20 tasks. The hard instances each have 10 machines and 200 tasks.


Our data is drawn from the Irish Single Electricity Market Operator (SEMO) [12]. This dataset consists of historical energy price data at 30-minute intervals starting from Midnight 1st November, 2011 to 31st December, 2013. Each instance of the data has calendar attributes; day-ahead estimates of weather characteristics; SEMO day-ahead forecasted energy-load, wind-energy production and prices; and actual wind-speed, temperature, intensity and price. Of the actual attributes, we keep only the actual price, and use it as a target for prediction. For the hard scheduling instances, each 5-minute timeslots have the same price as the 30-minute timeslot it belongs to, following the ICON challenge.

Experimental setup

For all our experiments, we use a linear model without any hidden layer as the underlying predictive model. Note, the SPO approach is a model-free approach and it is compatible wih any deep neural network; but earlier work

[12] showed accuracy in predictions is not effective for the downstream optimization task. For the experiments, we divide our data into three sets: training (70%), validation (10%) and test (20%), and evaluate the performance by measuring regret on the test set. Training, validation and test data covers 552, 60 and 177 days of energy data respectively.

Our model is trained by batch gradient descent, where each batch corresponds to one day, that is, consecutive training instances namely one for each half hour of that day. This batch together forms one set of parameters of one optimisation instance, e.g. knapsack values or schedule costs. The learning rate and momentum for each model are selected through a grid search based on the regret on the validation dataset. The best combination of parameters is then used in the experiments shown.

Solving the knapsack instances takes sub-second time per optimisation instance, solving the easy scheduling problems takes 0.1 to 2 seconds per optimisation instance and for the hard scheduling problems solving just the relaxation already costs 30 to 150 seconds per optimisation instance. For the latter, this means that merely evaluating regret on the test-set of 177 optimisation instances takes about 3 hours. With 552 training instances, one epoch of training requires 9 hours.

For all experiments except the hard instances, we repeat it 10 times and report on the mean and standard deviation.

We use the Gurobi optimization-solver for solving the combinatorial problems, the

nn module in Pytorch

to implement the predictive models and the optim module in Pytorch for training the models with corresponding loss functions. Experiments were run on Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz processors with 32GB memory 222The code of our experiments is available at˙predit˙then˙optimize.git.

(a) Unweighted
(b) Weighted
Knapsack (cap:60)
(c) EnergySchedule
(1st instance)
Figure 2: Per epoch learning curves of MSE-r, SPO-full and SPO-relax
(a) Unweighted
(b) Weighted
(c) EnergySchedule
(1st instance)
Figure 3: Per second learning curves of MSE-r, SPO-full and SPO-relax

RQ1: exact versus weaker oracles

The first research question is what the loss in accuracy of solving the full discrete problems (SPO-full) versus solving only the relaxation (SPO-relax) during training is. Together with this, we look at what the gain is in terms of reduction-in-regret over time. We visualise both through the learning curves as evaluated on the test set. For all methods, we compute the exact regret on the test-set, e.g. by fully solving the instances with the predictions.

Figure 2 shows the learning curves over epochs, where one epoch is one iteration over all instances of the training data, for three problem instances. We also include the regret of MSE-r as baseline. In all three case we see that MSE-r is worse, and stagnates or slightly decreases in performance over the epochs. This validates the use of MSE-r where the best epoch is chosen retrospectively based on a validation set. It also validates that the SPO-surrogate loss function captures the essence of the intended loss, namely regret.

We also see, surprisingly, that SPO-relax achieves very similar performance to SPO-full on all problem instances in our experiments. This means that even though SPO can reason over exact discrete solutions, reasoning over these continuous relaxation solutions is sufficient for the loss function to guide the learning to where it matters.

The real advantage of SPO-relax over SPO-full is evident from Figure 3. Here we present the test regret not against the number of epochs but against the model run-time. MSE-r here includes the time to compute the regret on the validation set as needed to choose the best epoch. These figures show SPO-relax runs, and hence converges, much quicker in time than SPO. This is thanks to the fact that solving the continuous relaxation can be done in polynomial time while the discrete problem is worst-case exponential. SPO-relax is slower than MSE-r but the difference in quality clearly justifies it.

In the subsequent experiments, we will use SPO-relax only.

RQ2 benefits of warmstarting

Instance Baseline MSE-warmstart
from earlier basis
1 6.5 (1.5) sec 8 (0.5) sec 1.5 (0.2) sec
2 7 (1.5) sec 6 (1.0) sec 1 (0.2) sec
3 10 (0.5) sec 12 (1.0) sec 2.5 (0.1) sec
Table 1: Comparison of per epoch average (sd) runtime of warmstart strategies

As baseline we use the standard SPO-relax approach. We test warmstarting the learning by first training the network with MSE as loss function for 6 epochs, after which we continue learning with SPO-relax. We indicate this approach by MSE-warmstart. We summarizes the effect of warmstarting in Table 1, We observe that warmstarting from MSE results in a slightly faster start in the initial seconds, but this has no benefit, nor penalty over the longer run. Warmstarting from an earlier basis, after the MIP pre-solving, did result in runtime improvements overall.

We also attempted warmstarting by adding objective cuts, but this slowed down the solving of the relaxation, often doubling it, because more iterations were needed.

RQ3: SPO versus QPTL

Next, we compare our approach against the state-of-the-art QP approach (QPTL) of wilder2018melding wilder2018melding which proposes to transform the discrete linear integer program into a continuous QP by taking the continuous relaxation and a squared L-norm of the decision variables . This makes the problem quadratic and twice differentiable allowing them to use the differntiable QP solver [8].

Figure 4 shows the average regret on all unweighted and weighted knapsack instances and easy scheduling instances. We can see that for unweighted knapsack, SPO-relax almost always outperforms the other methods, while QPTL performs worse than MSE-r. For weighted knapsacks, SPO-relax is best for all but the lower capacities. For these lower capacities, QPTL is better though its results worsen for higher capacities.

The same narrative is reflected in Figure 5. In Figure 4(c) (weighted knapsack, capacity:60) QPTL converges to a better solution than SPO. In all other cases SPO-relax produces better quality of solution and in most cases QPTL converges slower than SPO-relax. The poor quality of QPTL at higher capacities may stem from the squared norm which favors sparse solutions, while at high capacities, most items will be included and the best solutions are those that identify which items not to include.

On two energy-scheduling instances SPO-relax performs better whereas for the other instance, the regrets of SPO-relax and QPTL are similar. From Figure 4(f) and 4(e), we can see, again, SPO-relax converges faster than QPTL.

(a) Unweighted knapsack
(b) Weighted Knapsack
(c) Energy Scheduling
Figure 4: MSE-r, SPO-relax and QPTL, all instances
Figure 5: Learning Curves of SPO-relax vs QPTL
a: Unweighted(cap:10), b: Unweighted(cap:20), c: Weighted(cap:60), d: Weighted(cap:120), e: Energy Scheduling(I), f: Energy Scheduling(II)

RQ4: Suitability on large, hard optimisation instances

MSE-r SPO-relax
Instance 2 epochs 4 epochs 6 epochs 8 epochs 2 hour 4 hour 6 hour
1 90,769 88,952 86,059 86,464 72,662 74,572 79,990
2 128,067 124,450 124,280 123,738 120,800 110,944 114,800
3 129,761 128,400 122,956 119,000 108,748 102,203 112,970
4 135,398 132,366 132,167 126,755 109,694 99,657 97,351
5 122,310 120,949 122,116 123,443 118,946 116,960 118,460
Table 2: Relaxed regret on hard ICON challenge instances

While SPO-relax performs well across the combinatorial instances used so far, these are still toy-level problems with relatively few decision variables that can be solved in a few seconds.

We will use the large-scale optimization instances of the ICON challenge, for which no exact solutions are known. Hence, for this experiment we will report the regret when solving the relaxation of the problem for the test instances, rather than solving the discrete problem during testing as in the previous experiments.

We impose a timelimit of

hours on the total time budget that SPO-relax can spend on calling the solver. This includes the time to compute and cache the ground-truth solution of a training instance, and the timing of solving for each backpropagation of a training instance. The remaining time spent on handling the data and backpropagation is negligible in respect to the solving time.

The results are shown in Table 2, for hard scheduling instances. First, we show the test (relaxed) regret after , , and MSE-r epochs. The results show that the test regret slightly decreases over the epochs; thereafter, we observed, regret tends to increase.

With SPO-relax, in hours, it was possible to train only on to different instances, which is only to of the training instances. Table 2 shows even for a limited solving budget of hour and without MSE-warmstarting, it already outperforms the MSE learned models.

This shows that even on very large and hard instances that are computationally expensive to solve, training with SPO-relax on a limited time-budget is better than training in a two-stage approach with a non optimisation-directed loss.

Conclusions and future work

Smart “Predict and Optimize” methods have shown to be able to learn from, and improve task loss. Extending these techniques to be applicable beyond toy problems, more specifically hard combinatorial problems, is essential to the applicability of this promising idea.

SPO is able to outperform QPTL and lends itself to a wide applicability as it allows for the use of black-box oracles in its loss computation. We investigated the use of weaker oracles and showed that for the problems studied, learning with SPO loss while solving the relaxation leads to equal performance as solving the discrete combinatorial problem. We have shown how this opens the way to solve larger and more complex combinatorial problems, for which solving the exact solution may not be possible, let alone to do so repeatedly.

In case of problems with weaker relaxations, one could consider adding cutting planes prior to solving [10]. Moreover, further improvements could be achieved by exploiting the fact that all previously computed solutions are valid candidates. So far we have only used this for warmstarting the solver.

Our work hence encourages more research into the use of weak oracles and relaxation methods, especially those that can benefit from repeated solving. One interesting direction are local search methods and other iterative refinement methods, as they can improve the solutions during the loss computation. With respect to exact methods, knowledge compilation methods such as (relaxed) BDDs could offer both a runtime improvement from faster repeat solving and employing a relaxation.


We would like to thank the anonymous reviewers for the valuable comments and suggestions. This research is supported by Data-driven logistics (FWO-S007318N).


  • [1] M. Angalakudati, S. Balwani, J. Calzada, B. Chatterjee, G. Perakis, N. Raad, and J. Uichanco (2014) Business analytics for flexible resource allocation under random emergencies. Management Science 60 (6), pp. 1552–1573. Cited by: Introduction.
  • [2] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: Model validation and early stopping with regret.
  • [3] C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: Model validation and early stopping with regret.
  • [4] M. C. Cohen, N. Z. Leung, K. Panchamgam, G. Perakis, and A. Smith (2017) The impact of linear optimization on promotion planning. Operations Research 65 (2), pp. 446–468. Cited by: Introduction.
  • [5] G. B. Dantzig (1957) Discrete-variable extremum problems. Operations Research 5 (2), pp. 266–288. Cited by: Weaker oracles.
  • [6] E. Demirovíc, G. Chu, and P. J. Stuckey (2018) Solution-based phase saving and large neighbourhood search. In Proceedings of the 24th International Conference on Principles and Practice of Constraint Programming, J. Hooker (Ed.), LNCS, Vol. 11008, pp. 99–108. Cited by: Warmstarting the solving.
  • [7] E. Demirović, P. J. Stuckey, J. Bailey, J. Chan, C. Leckie, K. Ramamohanarao, and T. Guns (2019) An investigation into prediction + optimisation for the knapsack problem. In

    Integration of Constraint Programming, Artificial Intelligence, and Operations Research

    , L. Rousseau and K. Stergiou (Eds.),
    pp. 241–257. External Links: ISBN 978-3-030-19212-9 Cited by: Related Work, Related Work.
  • [8] P. Donti, B. Amos, and J. Z. Kolter (2017) Task-based end-to-end model learning in stochastic optimization. In Advances in Neural Information Processing Systems, pp. 5484–5494. Cited by: RQ3: SPO versus QPTL.
  • [9] A. N. Elmachtoub and P. Grigas (2017) Smart“ predict, then optimize”. arXiv preprint arXiv:1710.08005. Cited by: Related Work.
  • [10] A. Ferber, B. Wilder, B. Dilkina, and M. Tambe (2019) MIPaaL: mixed integer program as a layer. In Proceedings IJCAI 2019, Cited by: Weaker oracles, Conclusions and future work.
  • [11] K. J. Ferreira, B. H. A. Lee, and D. Simchi-Levi (2015) Analytics for an online retailer: demand forecasting and price optimization. Manufacturing & Service Operations Management 18 (1), pp. 69–88. Cited by: Introduction.
  • [12] G. Ifrim, B. O’Sullivan, and H. Simonis (2012) Properties of energy-price forecasts for scheduling. In International Conference on Principles and Practice of Constraint Programming, pp. 957–972. Cited by: Experimental setup, Data.
  • [13] B. Korte, J. Vygen, B. Korte, and J. Vygen (2012) Combinatorial optimization. Vol. 2, Springer. Cited by: Introduction.
  • [14] A. Mukhopadhyay, Y. Vorobeychik, A. Dubey, and G. Biswas (2017) Prioritized allocation of emergency responders based on a continuous-time incident prediction model. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 168–177. Cited by: Related Work.
  • [15] S. U. Ngueveu, C. Artigues, and P. Lopez (2016) Scheduling under a non-reversible energy source: an application of piecewise linear bounding of non-linear demand/cost functions. Discrete Applied Mathematics 208, pp. 98–113. Cited by: Introduction.
  • [16] D. Pisinger (2005) Where are the hard knapsack problems?. Computers & Operations Research 32 (9), pp. 2271–2284. Cited by: Unweighted/weighted knapsack problem.
  • [17] L. Pratt and B. Jennings (1996) A survey of connectionist network reuse through transfer. In Learning to learn, pp. 19–43. Cited by: Warmstarting the learning, Warmstarting the learning.
  • [18] S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: footnote 1.
  • [19] H. Simonis, B. O’Sullivan, D. Mehta, B. Hurley, and M. D. CauwerC. Jefferson, I. Miguel, B. Hnich, T. Walsh, and I. P. Gent (Eds.) CSPLib problem 059: energy-cost aware scheduling. Note: Cited by: Introduction, Energy-cost aware scheduling.
  • [20] S. Thrun and L. Pratt (2012) Learning to learn. Springer Science & Business Media. Cited by: Warmstarting the learning.
  • [21] L. Trevisan (2011) Combinatorial optimization: exact and approximate algorithms. Standford University. Cited by: Introduction.
  • [22] H. Wang, H. Xie, L. Qiu, Y. R. Yang, Y. Zhang, and A. Greenberg (2006) COPE: traffic engineering in dynamic networks. In Sigcomm, Vol. 6, pp. 194. Cited by: Related Work.
  • [23] B. Wilder, B. Dilkina, and M. Tambe (2019) Melding the data-decisions pipeline: decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1658–1665. Cited by: Introduction, Related Work.
  • [24] E. A. Yildirim and S. J. Wright (2002) Warm-start strategies in interior-point methods for linear programming. SIAM Journal on Optimization 12 (3), pp. 782–810. Cited by: Warmstarting the solving.
  • [25] M. N. Zeilinger, C. N. Jones, and M. Morari (2011) Real-time suboptimal model predictive control using a combination of explicit mpc and online optimization. IEEE Transactions on Automatic Control 56 (7), pp. 1524–1534. Cited by: Warmstarting the solving.