Introduction
Combinatorial optimization aims to optimize an objective function over a set of feasible solutions defined on a discrete space. Numerous reallife decisionmaking problems can be formulated as combinatorial optimization problems [13, 21]. In the last decade, development of timeefficient algorithms for combinatorial optimization problems paved the way for these algorithms to be widely utilized in industry, including, but not limited to, in resource allocation [1], efficient energy scheduling [15], price optimization [11], sales promotion planning [4], etc.
The last decade has, in parallel, witnessed a tremendous growth in machine learning (ML) methods, which can produce very accurate predictions by leveraging historical and contextual data. In realworld applications, not all parameters of an optimization problem are known at the time of execution and predictive ML models can be used for estimation of those parameters from historical data. For instance, cohen2017impact first predicted future demand of products using an ML model and then use the predicted demand to compute the optimal promotion pricing scheme over the products through nonlinear integer programming.
When predictive ML is followed by optimization, it is often assumed that improvements in the quality of the predictions (with respect to some suitable evaluation metric) will result in better optimization outcomes. However, ML models make errors and the impact of prediction errors is not uniform throughout the underlying solution space, for example, overestimating the highestvalued prediction might not change a maximization problem outcome, while underestimating it can. Hence, a better prediction model may not ensure a better outcome in the optimization stage. In this regard, ifrim2012properties ifrim2012properties experienced that a better predictive model does not always translate to optimized energysaving schedules.
The alternative is to take the effect of the errors on the optimization outcome into account during learning. In the context of linear programming problems, elmachtoub2017smart proposed an approach, called “Smart Predict and Optimize” (SPO), for training ML models by minimizing a convex surrogate loss function which considers the outcome of the optimization stage. Specifically they consider optimization problems where predictions occur as weights that are linear in the objective.
In this work, we build on that approach and consider discrete combinatorial optimization problems. Indeed, the SPO loss is valid for any optimization problem with a linear objective over predictions, where the constraints implicitly define a convex region. Furthermore, any black box optimization method can be used as only its outcome is used to compute the (sub)gradient of the loss.
The main challenge is the computational cost of the repeated solving of the optimization problem during training, namely once for every evaluation of the loss function on an instance. For NPhard problems, this may quickly become infeasible. In order to scale up to large problem instances, we investigate the importance of finding the optimal discrete solution for learning, showing that continuous relaxations are highly informative for learning. Furthermore, we investigate how to speed up the learning by transfer learning from easiertolearn models as well as method for speeding up the solving by warmstarting from earlier solutions. Our approach outperforms the stateoftheart Melding approach
[23] in most cases, and for the first time we are able to show the applicability of predictandoptimize on large scale combinatorial instances, namely from the ICON energycost aware scheduling challenge [19].Related Work
Predictandoptimize problems arise in many applications. Current practice is to use a twostage approach where the ML models are trained independent of the optimization problem. As a consequence, the ML models do not account for the optimization tasks [22, 14]. In recent years there is a growing interest in decisionfocused learning [9, 7, 23], that aims to couple ML and decision making.
In the context of portfolio optimization, bengio1997using bengio1997using report a deep learning model fails to improve future profit when trained with respect to a standard ML loss function, but a profitdriven loss function turns out to be effective. kao2009directed kao2009directed consider an unconstrained quadratic optimisation problem, where the predicted values appear linearly with the objective. They train a linear model with respect to a combination of prediction error and optimization loss. They do not mention how this can be applied to optimization problems with constraints.
A number of works aim to exploit the fact that the KKT conditions of a quadratic program (QP) define a system of linear equations around the optimal points. For instance, donti2017task donti2017task propose a framework which computes the gradient of the solution of the QP with respect to the predictions by applying the implicit function theorem to differentiate through the KKT conditions around the optimal point. wilder2018melding wilder2018melding use the same approach, and propose its use for linear programs by adding a small quadratic term to convert it into a concave QP. They also propose a specialisation of it for submodular functions.
Our work builds on the SPO approach of elmachtoub2017smart elmachtoub2017smart, where the authors provide a framework to train an ML model, which learns with respect to the error in the optimization problem. This is investigated for linear optimization problems with a convex feasible region. We will use the approach for discrete combinatorial problems with a linear objective. They are computationally expensive to solve, e.g. often hard. The decision variables and search space of these problems is discrete, meaning gradients can not be computed in a straightforward manner. However, the SPO approach remains applicable as we will see.
demirovicinvestigation demirovicinvestigation investigate the prediction+optimisation problem for the knapsack problem, and prove that optimizing over predictions are as valid as stochastic optimisation over learned distributions, in case the predictions are used as weights in a linear objective. They further investigate possible learning approaches, and classified them into three groups:
indirect approaches, which do not use knowledge of the optimisation problem; semidirect approaches, which encode knowledge of the optimisation problem, such as the importance of ranking and direct approaches which encode or use the optimisation problem in the learning in some way [7]. Our approach is a direct approach and we examine how to combine the best of such techniques in order to scale to large and hard combinatorial problems.Problem Formulation and Approach
Optimizing a parameterized problem
Traditional optimization algorithms work under the assumption that all the parameters are known precisely. But in a predictandoptimize setup we assume some parameters of the optimization problem are not known.
We formalize a combinatorial optimization problem as follows :
(1) 
where defines the set of parameters (coefficients) of the optimization problem, are the decision variables, is an objective to be minimized, and is a (set of) constraints that determine(s) the feasible region; hence is an oracle that returns the optimal solution.
Consider, the  knapsack, where a set of items, with their values and wights, are provided. The objective is to select a subset of items respecting a capacity constraint on the sum of weights so that the total value of the subset is maximized. The parameter set of the problem consists of the value and weight of each item and the total capacity. The decision variable set consists of  decision variable for each item, is a linear sum of the variables and the item values and describing the capacity constraint.
We decompose where are the set of parameters that are observed (e.g. the weights and capacity of a knapsack) and are the set of unobserved parameters (the value of the knapsack items).
To predict the unobserved parameters , some attributes correlated with them are observed. We are equipped with a training dataset where
’s are vectors of attributes correlated to
. An ML model is trained to generate a prediction . The model is characterized by a set of learning parameters. E.g. in linear regression the
encompasses the slope and the intercept of the regression line. Once the predicted are obtained, is used for the optimization problem.To ease notation, we will write , containing both the observed parameters and predicted parameters. Recall from Equation (1) that is the optimal solution using parameters . Then, the objective value of this solution is . Whereas if the actual is known a priori, one could obtain the actual optimal solution . The difference in using the predicted instead of the actual values is hence measured by
(2) 
Ideally the aim of the training should be to generate predictions which minimize this regret on unseen data.
Two Stage Learning
First we formalize a twostage approach, which is widely used in industry and where the prediction and the optimization are performed in a decoupled manner. First the predictive model is trained with the objective of minimizing the expected loss for a suitable choice of loss function
For regression, if we use squarederror as the loss function, model parameters are estimated by minimizing the Mean Squared Error (MSE) on the training data:
(3) 
Training by gradient descent
The process of estimating
to minimize the loss function is executed through stochastic gradient descent algorithms
^{1}^{1}1There are different variations of sgd, for a detailed discussion refer [18],where at each epoch,
are updated after calculating the gradient of the loss function with respect to the predictions as shown in Algorithm 1.The advantage of the twostage approach is that training by gradient descent is straightforward. In the prediction stage the objective is to minimize the MSEloss to generate accurate predictions without considering their impact on the final solution. Clearly this does not require solving the optimization problem during training.
Model validation and early stopping with regret
It is common practice to perform model validation on a separate validation set while training the model. Early stopping [3]
is the practice of choosing the epoch with the smallest loss on the validation set as the final output of the training procedure. The objective is to avoid overfitting on training data. The performance on the validation set is also used to select hyperparameters of the ML models
[2].Considering the final task, in our setup, is minimizing the regret, we modify the two stage learning by measuring regret on the validation set for early stopping and hyperparameter selection. We call this the MSEr approach. It is more computationally expensive than MSE given that computing regret on the validation data for every epoch requires solving the optimization problems each time.
Smart Predict then Optimize (SPO)
The major drawback of the twostage approach is it does not aim to minimize the regret, but minimizes the error between and directly. As Figure 1 shows, minimizing loss between and does not necessarily result in minimization of the regret. Early stopping with regret can avoid worsening results, but can not improve the learning. The SPO framework proposed by elmachtoub2017smart addresses this by integrating prediction and optimization.
Note, to minimize directly we have to find the gradient of it with respect to which requires differentiating the argmin operator in Eq. 1. This differentiation may not be feasible as can be discontinuous in and exponential in size. Consequently we can not train an ML model to minimize the regret through gradient descent.
The SPO framework integrates the optimization problem into the loop of gradient descent algorithms in a clever way. In their work, elmachtoub2017smart consider an optimization problem with a convex feasible region and a linear objective :
(4) 
where the cost vector is not known beforehand. Following Eq. (2), the regret for such a problem when using predicted values instead of actual is: , which as discussed is not differentiable. To make it differentiable, they use a convex surrogate upper bound of the regret function, which they name the SPO+ loss function . The gradient of may not exist as it also involves the argmin operator. However, they have shown that is a subgradient of , that is
(5) 
The subgadient formulation is the key to bring the optimization problem into the loop of gradient descent as shown in algorithm 2.
The difference between algorithm 1 and algorithm 2 is in their (sub)gradients. In Algorithm 1, the MSE gradient is the signed difference between the actual values and predicted ones; in Algorithm 2 the SPO subgradient is the difference of an optimization solution obtained using the actual parameter values and another solution obtained using a convex combination of the predicted values and the true values.
For the  knapsack problem, the solution of a knapsack instance is a  vector of length equal to the size of the set, where represents the corresponding item is selected. In this case, the subgradient is the elementwise difference between the two solutions and if the solution using the transformed predicted values is the same as the solution using actual values, all entries of the subgradient are zero. In essence, the nonzero entries in the subgradient indicate places where the two solutions contradict.
Note, to compute the subgradient for this SPO approach, the optimization problem needs to be solved for each training instance, while can be precomputed and cached. Moreover, one training instance typically contains multiple predictions. For example, if we consider a 01 knapsack problem with 10 items, then one training instance always contains 10 value predictions, one for each item. Furthermore, the dataset may contain thousands of training instances of 10 values each. Hence, one iteration over the training data (one epoch) requires solving hundreds of knapsack problems. As an ML model is trained over several epochs, clearly the training process is computationally expensive.
Combinatorial problems and scaling up
We observe that the SPO approach and its corresponding loss function places no restriction on the type of oracle used. Given that our target task is to minimize the regret of the combinatorial problem, an oracle that solves the combinatorial optimisation problem is the most natural choice. We call this approach SPOfull.
Weaker oracles
Repeatedly solving combinatorial problems is computationally expensive. Hence for large and hard problems, it is necessary to look at ways to reduce the solving time.
As there is no restriction on the oracle used, we consider using weaker oracles during training. NPhard problems that have a polynomial (bounded) approximation algorithm could use the approximation algorithm in the loss as a proxy instead. For example, in case of knapsack, the greedy algorithm [5]. For mixed integer programming (MIP) formulations, a natural weaker oracle to use is the continuous relaxation of the problem. While disregarding the discrete part, relaxations can often identify what part of the problem is trivial (variable assignments close to 0 or 1) from what part is nontrivial. For example for knapsack, the continuous relaxation leads to very similar solutions compared to the greedy algorithm. Note that we always use the same oracle for and when computing the loss. We call the approach of using the continuous relaxation as oracle SPOrelax.
In case of weak MIP relaxations, one can also use a cutting plane algorithm in the root node and use the resulting tighter relaxation thereof [10]. Other weaker oracles could also be used, for example setting a timelimit on an anytime solver and using the best solution found, or a nodelimit on search algorithms. In case of mixed integer programming, we can also set a gap tolerance, which means the solver does not have to prove optimality. We call this SPOgap. For stability of the learning, it is recommended that the solution returned by the oracle does not vary much when called with (near) identical input.
Apart from changing what is being solved, we also investigate ways to warmstart the learning, and to warmstart across solver calls:
Warmstarting the learning
We consider warmstarting the learning by transfer learning [17], that is, to train the model with an easy to compute loss function, and then continue training it with a more difficult one. In our case, we can pretrain the model using MSE as loss, which means the predictions will already be more informed when we start using an SPO loss afterwards.
Warmstarting the solving
When computing the loss, we must solve both using the true values , and . Furthermore, we know that an optimal solution to is also a valid (but potentially suboptimal) solution to as only the coefficients of the objective differ. Furthermore, if this is an optimal solution to the latter than we would achieve regret, hence we can expect the solution of to be of decent quality for too.
We hence want to aide the solving of by using the optimal solution of . One way to do this for CP solvers is solutionguided search [6]. For MIP/LP we can use warmstarting [24, 25], that is, to use the previous solution as starting point for MIP. In case of linear programming (and hence the relaxation), we can reuse the basis of the solution.
An alternative is to use the true solution to compute a bound on the objective function. Indeed, as the solution to is valid for and has an objective value of . Hence, we can use this as a bound on the objective and potentially cut away a large part of the search space.
While the true solutions can be cached, we must compute this solution once for each training instance , which may already take significant time for large problems. We observe that only the objective changes between calls to the oracle , and hence any previously computed solution is also a candidate solution for the other calls. We can hence use warmstarting for any solver call after the first, and from any previously computed solution so far.
Experimental Evaluation
We consider three types of combinatorial problems: unweighted and weighted knapsack and energycost aware scheduling. Below we briefly discuss the problem formulations:
Unweighted/weighted knapsack problem
The knapsack problem can be formalized as . The values will be predicted from data and weights and capacity are given. In the unweighted knapsack, all weights are 1 and the problem is polynomial time solvable. Weighted knapsacks are NPhard and it is known that the computational difficulty increases with the correlation between weights and values [16]. We generated mildly correlated knapsacks as follows: for each of the 48 halfhour slots we assign a weight by sampling from the set , then we multiply each profit value by its corresponding weight and include some randomness by adding Gaussian noise to each before multiplying by weight.
In the unweighted case, we consider 9 different capacity values , namely from 5 to 45 increasing by 5. For the weighted knapsack experiment, we consider 7 different capacity values from 30 to 210 increasing by 30.
Energycost aware scheduling
It is a resourceconstrained job scheduling problem where the goal is to minimize (predicted) energy cost, it is described in the CSPLib as problem 059 [19]. In summary, we are given a number of machines and have to schedule a given number of tasks, where each task has a duration, an earliest start and a latest end, resource requirement and a power usage. Each machine has a resource capacity constraint. We omit startup/shutdown costs. No task is allowed to stretch over midnight between two days and cannot be interrupted once started, nor migrate to another machine. Time is discretized in timeslots and a schedule has to be made over all timeslots at once. For each timeslot, a (predicted) energy cost is given and the objective is to minimize the total energy cost of running the tasks on the machines.
We consider two variants of the problem: easy instances consisting of 30 minute timeslots (e.g. 48 timeslots per day), and hard instances as used in the ICON energy challenge consisting of 5 minute timeslots (288 timeslots per day). The easy instances have 3 machines and respectively 10, 15 and 20 tasks. The hard instances each have 10 machines and 200 tasks.
Data
Our data is drawn from the Irish Single Electricity Market Operator (SEMO) [12]. This dataset consists of historical energy price data at 30minute intervals starting from Midnight 1st November, 2011 to 31st December, 2013. Each instance of the data has calendar attributes; dayahead estimates of weather characteristics; SEMO dayahead forecasted energyload, windenergy production and prices; and actual windspeed, temperature, intensity and price. Of the actual attributes, we keep only the actual price, and use it as a target for prediction. For the hard scheduling instances, each 5minute timeslots have the same price as the 30minute timeslot it belongs to, following the ICON challenge.
Experimental setup
For all our experiments, we use a linear model without any hidden layer as the underlying predictive model. Note, the SPO approach is a modelfree approach and it is compatible wih any deep neural network; but earlier work
[12] showed accuracy in predictions is not effective for the downstream optimization task. For the experiments, we divide our data into three sets: training (70%), validation (10%) and test (20%), and evaluate the performance by measuring regret on the test set. Training, validation and test data covers 552, 60 and 177 days of energy data respectively.Our model is trained by batch gradient descent, where each batch corresponds to one day, that is, consecutive training instances namely one for each half hour of that day. This batch together forms one set of parameters of one optimisation instance, e.g. knapsack values or schedule costs. The learning rate and momentum for each model are selected through a grid search based on the regret on the validation dataset. The best combination of parameters is then used in the experiments shown.
Solving the knapsack instances takes subsecond time per optimisation instance, solving the easy scheduling problems takes 0.1 to 2 seconds per optimisation instance and for the hard scheduling problems solving just the relaxation already costs 30 to 150 seconds per optimisation instance. For the latter, this means that merely evaluating regret on the testset of 177 optimisation instances takes about 3 hours. With 552 training instances, one epoch of training requires 9 hours.
For all experiments except the hard instances, we repeat it 10 times and report on the mean and standard deviation.
We use the Gurobi optimizationsolver for solving the combinatorial problems, the
nn module in Pytorch
to implement the predictive models and the optim module in Pytorch for training the models with corresponding loss functions. Experiments were run on Intel(R) Xeon(R) CPU E31225 v5 @ 3.30GHz processors with 32GB memory ^{2}^{2}2The code of our experiments is available at https://github.com/JayMan91/aaai˙predit˙then˙optimize.git.RQ1: exact versus weaker oracles
The first research question is what the loss in accuracy of solving the full discrete problems (SPOfull) versus solving only the relaxation (SPOrelax) during training is. Together with this, we look at what the gain is in terms of reductioninregret over time. We visualise both through the learning curves as evaluated on the test set. For all methods, we compute the exact regret on the testset, e.g. by fully solving the instances with the predictions.
Figure 2 shows the learning curves over epochs, where one epoch is one iteration over all instances of the training data, for three problem instances. We also include the regret of MSEr as baseline. In all three case we see that MSEr is worse, and stagnates or slightly decreases in performance over the epochs. This validates the use of MSEr where the best epoch is chosen retrospectively based on a validation set. It also validates that the SPOsurrogate loss function captures the essence of the intended loss, namely regret.
We also see, surprisingly, that SPOrelax achieves very similar performance to SPOfull on all problem instances in our experiments. This means that even though SPO can reason over exact discrete solutions, reasoning over these continuous relaxation solutions is sufficient for the loss function to guide the learning to where it matters.
The real advantage of SPOrelax over SPOfull is evident from Figure 3. Here we present the test regret not against the number of epochs but against the model runtime. MSEr here includes the time to compute the regret on the validation set as needed to choose the best epoch. These figures show SPOrelax runs, and hence converges, much quicker in time than SPO. This is thanks to the fact that solving the continuous relaxation can be done in polynomial time while the discrete problem is worstcase exponential. SPOrelax is slower than MSEr but the difference in quality clearly justifies it.
In the subsequent experiments, we will use SPOrelax only.
RQ2 benefits of warmstarting
Instance  Baseline  MSEwarmstart 


1  6.5 (1.5) sec  8 (0.5) sec  1.5 (0.2) sec  
2  7 (1.5) sec  6 (1.0) sec  1 (0.2) sec  
3  10 (0.5) sec  12 (1.0) sec  2.5 (0.1) sec 
As baseline we use the standard SPOrelax approach. We test warmstarting the learning by first training the network with MSE as loss function for 6 epochs, after which we continue learning with SPOrelax. We indicate this approach by MSEwarmstart. We summarizes the effect of warmstarting in Table 1, We observe that warmstarting from MSE results in a slightly faster start in the initial seconds, but this has no benefit, nor penalty over the longer run. Warmstarting from an earlier basis, after the MIP presolving, did result in runtime improvements overall.
We also attempted warmstarting by adding objective cuts, but this slowed down the solving of the relaxation, often doubling it, because more iterations were needed.
RQ3: SPO versus QPTL
Next, we compare our approach against the stateoftheart QP approach (QPTL) of wilder2018melding wilder2018melding which proposes to transform the discrete linear integer program into a continuous QP by taking the continuous relaxation and a squared Lnorm of the decision variables . This makes the problem quadratic and twice differentiable allowing them to use the differntiable QP solver [8].
Figure 4 shows the average regret on all unweighted and weighted knapsack instances and easy scheduling instances. We can see that for unweighted knapsack, SPOrelax almost always outperforms the other methods, while QPTL performs worse than MSEr. For weighted knapsacks, SPOrelax is best for all but the lower capacities. For these lower capacities, QPTL is better though its results worsen for higher capacities.
The same narrative is reflected in Figure 5. In Figure 4(c) (weighted knapsack, capacity:60) QPTL converges to a better solution than SPO. In all other cases SPOrelax produces better quality of solution and in most cases QPTL converges slower than SPOrelax. The poor quality of QPTL at higher capacities may stem from the squared norm which favors sparse solutions, while at high capacities, most items will be included and the best solutions are those that identify which items not to include.
On two energyscheduling instances SPOrelax performs better whereas for the other instance, the regrets of SPOrelax and QPTL are similar. From Figure 4(f) and 4(e), we can see, again, SPOrelax converges faster than QPTL.
a: Unweighted(cap:10), b: Unweighted(cap:20), c: Weighted(cap:60), d: Weighted(cap:120), e: Energy Scheduling(I), f: Energy Scheduling(II)
RQ4: Suitability on large, hard optimisation instances
MSEr  SPOrelax  
Instance  2 epochs  4 epochs  6 epochs  8 epochs  2 hour  4 hour  6 hour 
1  90,769  88,952  86,059  86,464  72,662  74,572  79,990 
2  128,067  124,450  124,280  123,738  120,800  110,944  114,800 
3  129,761  128,400  122,956  119,000  108,748  102,203  112,970 
4  135,398  132,366  132,167  126,755  109,694  99,657  97,351 
5  122,310  120,949  122,116  123,443  118,946  116,960  118,460 
While SPOrelax performs well across the combinatorial instances used so far, these are still toylevel problems with relatively few decision variables that can be solved in a few seconds.
We will use the largescale optimization instances of the ICON challenge, for which no exact solutions are known. Hence, for this experiment we will report the regret when solving the relaxation of the problem for the test instances, rather than solving the discrete problem during testing as in the previous experiments.
We impose a timelimit of
hours on the total time budget that SPOrelax can spend on calling the solver. This includes the time to compute and cache the groundtruth solution of a training instance, and the timing of solving for each backpropagation of a training instance. The remaining time spent on handling the data and backpropagation is negligible in respect to the solving time.
The results are shown in Table 2, for hard scheduling instances. First, we show the test (relaxed) regret after , , and MSEr epochs. The results show that the test regret slightly decreases over the epochs; thereafter, we observed, regret tends to increase.
With SPOrelax, in hours, it was possible to train only on to different instances, which is only to of the training instances. Table 2 shows even for a limited solving budget of hour and without MSEwarmstarting, it already outperforms the MSE learned models.
This shows that even on very large and hard instances that are computationally expensive to solve, training with SPOrelax on a limited timebudget is better than training in a twostage approach with a non optimisationdirected loss.
Conclusions and future work
Smart “Predict and Optimize” methods have shown to be able to learn from, and improve task loss. Extending these techniques to be applicable beyond toy problems, more specifically hard combinatorial problems, is essential to the applicability of this promising idea.
SPO is able to outperform QPTL and lends itself to a wide applicability as it allows for the use of blackbox oracles in its loss computation. We investigated the use of weaker oracles and showed that for the problems studied, learning with SPO loss while solving the relaxation leads to equal performance as solving the discrete combinatorial problem. We have shown how this opens the way to solve larger and more complex combinatorial problems, for which solving the exact solution may not be possible, let alone to do so repeatedly.
In case of problems with weaker relaxations, one could consider adding cutting planes prior to solving [10]. Moreover, further improvements could be achieved by exploiting the fact that all previously computed solutions are valid candidates. So far we have only used this for warmstarting the solver.
Our work hence encourages more research into the use of weak oracles and relaxation methods, especially those that can benefit from repeated solving. One interesting direction are local search methods and other iterative refinement methods, as they can improve the solutions during the loss computation. With respect to exact methods, knowledge compilation methods such as (relaxed) BDDs could offer both a runtime improvement from faster repeat solving and employing a relaxation.
Acknowledgments
We would like to thank the anonymous reviewers for the valuable comments and suggestions. This research is supported by Datadriven logistics (FWOS007318N).
References
 [1] (2014) Business analytics for flexible resource allocation under random emergencies. Management Science 60 (6), pp. 1552–1573. Cited by: Introduction.
 [2] (2012) Random search for hyperparameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: Model validation and early stopping with regret.
 [3] (2006) Pattern recognition and machine learning. springer. Cited by: Model validation and early stopping with regret.
 [4] (2017) The impact of linear optimization on promotion planning. Operations Research 65 (2), pp. 446–468. Cited by: Introduction.
 [5] (1957) Discretevariable extremum problems. Operations Research 5 (2), pp. 266–288. Cited by: Weaker oracles.
 [6] (2018) Solutionbased phase saving and large neighbourhood search. In Proceedings of the 24th International Conference on Principles and Practice of Constraint Programming, J. Hooker (Ed.), LNCS, Vol. 11008, pp. 99–108. Cited by: Warmstarting the solving.

[7]
(2019)
An investigation into prediction + optimisation for the knapsack problem.
In
Integration of Constraint Programming, Artificial Intelligence, and Operations Research
, L. Rousseau and K. Stergiou (Eds.), pp. 241–257. External Links: ISBN 9783030192129 Cited by: Related Work, Related Work.  [8] (2017) Taskbased endtoend model learning in stochastic optimization. In Advances in Neural Information Processing Systems, pp. 5484–5494. Cited by: RQ3: SPO versus QPTL.
 [9] (2017) Smart“ predict, then optimize”. arXiv preprint arXiv:1710.08005. Cited by: Related Work.
 [10] (2019) MIPaaL: mixed integer program as a layer. In Proceedings IJCAI 2019, Cited by: Weaker oracles, Conclusions and future work.
 [11] (2015) Analytics for an online retailer: demand forecasting and price optimization. Manufacturing & Service Operations Management 18 (1), pp. 69–88. Cited by: Introduction.
 [12] (2012) Properties of energyprice forecasts for scheduling. In International Conference on Principles and Practice of Constraint Programming, pp. 957–972. Cited by: Experimental setup, Data.
 [13] (2012) Combinatorial optimization. Vol. 2, Springer. Cited by: Introduction.
 [14] (2017) Prioritized allocation of emergency responders based on a continuoustime incident prediction model. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 168–177. Cited by: Related Work.
 [15] (2016) Scheduling under a nonreversible energy source: an application of piecewise linear bounding of nonlinear demand/cost functions. Discrete Applied Mathematics 208, pp. 98–113. Cited by: Introduction.
 [16] (2005) Where are the hard knapsack problems?. Computers & Operations Research 32 (9), pp. 2271–2284. Cited by: Unweighted/weighted knapsack problem.
 [17] (1996) A survey of connectionist network reuse through transfer. In Learning to learn, pp. 19–43. Cited by: Warmstarting the learning, Warmstarting the learning.
 [18] (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: footnote 1.
 [19] C. Jefferson, I. Miguel, B. Hnich, T. Walsh, and I. P. Gent (Eds.) CSPLib problem 059: energycost aware scheduling. Note: http://www.csplib.org/Problems/prob059 Cited by: Introduction, Energycost aware scheduling.
 [20] (2012) Learning to learn. Springer Science & Business Media. Cited by: Warmstarting the learning.
 [21] (2011) Combinatorial optimization: exact and approximate algorithms. Standford University. Cited by: Introduction.
 [22] (2006) COPE: traffic engineering in dynamic networks. In Sigcomm, Vol. 6, pp. 194. Cited by: Related Work.
 [23] (2019) Melding the datadecisions pipeline: decisionfocused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1658–1665. Cited by: Introduction, Related Work.
 [24] (2002) Warmstart strategies in interiorpoint methods for linear programming. SIAM Journal on Optimization 12 (3), pp. 782–810. Cited by: Warmstarting the solving.
 [25] (2011) Realtime suboptimal model predictive control using a combination of explicit mpc and online optimization. IEEE Transactions on Automatic Control 56 (7), pp. 1524–1534. Cited by: Warmstarting the solving.