A learning-based algorithm to quickly compute good primal solutions for Stochastic Integer Programs

We propose a novel approach using supervised learning to obtain near-optimal primal solutions for two-stage stochastic integer programming (2SIP) problems with constraints in the first and second stages. The goal of the algorithm is to predict a "representative scenario" (RS) for the problem such that, deterministically solving the 2SIP with the random realization equal to the RS, gives a near-optimal solution to the original 2SIP. Predicting an RS, instead of directly predicting a solution ensures first-stage feasibility of the solution. If the problem is known to have complete recourse, second-stage feasibility is also guaranteed. For computational testing, we learn to find an RS for a two-stage stochastic facility location problem with integer variables and linear constraints in both stages and consistently provide near-optimal solutions. Our computing times are very competitive with those of general-purpose integer programming solvers to achieve a similar solution quality.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning Primal Heuristics for Mixed Integer Programs

This paper proposes a novel primal heuristic for Mixed Integer Programs,...

Efficient, Fair and QoS-Aware Policies for Wirelessly Powered Communication Networks

Wireless power transfer (WPT) is a viable source of energy for wirelessl...

Efficient sequential and parallel algorithms for multistage stochastic integer programming using proximity

We consider the problem of solving integer programs of the form min{ c^⊺...

Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks

Bayesian Networks (BNs) represent conditional probability relations amon...

Using 3D-printing in disaster response: The two-stage stochastic 3D-printing knapsack problem

In this paper, we will shed light on when to pack and use 3D-printers in...

Staff dimensioning in homecare services with uncertain demands

The problem addressed in this paper is how to calculate the amount of pe...

Predicting Solution Summaries to Integer Linear Programs under Imperfect Information with Machine Learning

The paper provides a methodological contribution at the intersection of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two-stage stochastic integer programming (2SIP) is a standard framework to model decision making under uncertainty. In this framework, first the so-called first-stage decisions are made. Then, the values of some uncertain parameters in the problem are determined, as if sampled from a known distribution. Finally, the second set of decisions are made depending upon the realized value of the uncertain parameters, the so-called second-stage or recourse of the problem. The decision maker, in this setting, minimizes the sum of (i) a linear function of the first-stage decision variables and (ii) the expected value of the second-stage optimization problem.

2SIP is studied extensively in the literature [4, 15, 14, 11, 20, 24, 7, 17, 13, 18, 19] owing to its applicability in various decision making situations with uncertainty, like the stochastic unit-commitment problems for electricity generation [18, 19], stochastic facility location problems [14], stochastic supply chain network design [21], among others. With the overwhelming importance of 2SIP a wide range of solution algorithms have been proposed, for example, [2, 15, 23, 1, 22, 6].

In this paper we are interested in using machine learning (ML) to obtain good primal solutions to 2SIP. Along this line, Nair et al. [16]

proposed a reinforcement learning-based heuristic solver to quickly find solutions to 2SIP. Given that the agent can be trained offline, the algorithm provided solutions much faster for some classes of problems compared to an open-source general-purpose mixed-integer programming (MIP) solver, in their case, SCIP

[9, 10]. However, their method is based on the following restrictive assumptions:

  1. All first-stage variables are required to be binary. General integer variables or continuous variables in the first stage cannot be handled.

  2. Every

    binary vector is required to be feasible for the first stage of the problem, i.e., no constraints are allowed in the first stage.

Assumption a above is intrinsic to the method in [16], as both the initialization policy and the local move policy of the method involves flipping the bits of the first-stage decision vector. Hence, one cannot easily have general integer variables or continuous variables. Assumption b is again crucial to the algorithm in [16], as flipping a bit in the first stage could potentially make the new decision infeasible and it might require a more complicated feedback mechanism to check and discard infeasible solutions. In fact, if there are constraints, it is -hard to decide if there exists a flip that keeps the decision feasible. Alternatively, one could empirically penalize the infeasible solutions, but tuning the penalty might be a hard problem in itself.

In contrast, our method does not require either of these two restrictive assumptions. We allow binary, general integer as well as continuous variables in both first and second stage of the problem. We also allow constraints in both stages of the problem. Furthermore, we have a simple and direct approach to handle the first-stage constraints, without requiring any empirical penalties.

We make the following common assumption to exclude pathological cases, where an uncertain realization can turn a feasible first-stage decision infeasible.

Assumption 1.

The 2SIP has complete recourse, i.e., if a first-stage decision is feasible given the first-stage constraints, then it is feasible for all the second stage problems as well.

We make another assumption of uncertainty with finite support, so we can have a proper benchmark to compare our solution against. However, this assumption can be readily removed, without affecting the proposed algorithm.

Assumption 2.

The uncertainty distribution in the 2SIP has a finite support.

2 Problem definition

We formally define a 2SIP as follows: equationparentequation

subject to (1b)


where and are the first and second-stage decisions respectively, , , , , , , , , .

When Assumption 2 holds, the 2SIP described above can also be expressed as a single deterministic MIP as follows: equationparentequation

subject to (2b)

When Assumption 2 does not hold, the formulation creftype 2 could be a finite-sample approximation of creftype 1, which is extensively studied in the stochastic programming literature. Imitating [16], we compare our algorithm against solving creftype 2 with a general-purpose MIP solver.

3 Methodology

In this section, we discuss the algorithmic contribution of the paper.

3.1 Surrogate formulation

We first define the objective value function (OVF) , mapping - the function we are trying to optimize over the mixed-integer set defined in creftype 1.

Given creftype 2, we define the surrogate problem associated with , as follows: equationparentequation

subject to (3b)

In other words, should the value that the uncertain parameters are going to take is deterministically known to be , then the decision maker can solve the surrogate problem associated with . Now, the idea behind the algorithm proposed in this paper is captured by Conjecture 1.

Conjecture 1.

Let creftype 2 (and hence creftype 1) have an optimal objective value of . There exists in , and such that if solves the (much smaller) surrogate problem defined by , then, .

Observe that by construction, is feasible to the original problem in creftype 1. Also, Conjecture 1 asserts that, there exists a realization of the uncertainty () such that if one deterministically optimizes for that realization , then its solutions are optimal for the original 2SIP. Each such is called a representative scenario (RS) for the given 2SIP.

Now, given adequate computing resources, one can solve the following bilevel program to obtain an RS. equationparentequation

subject to (4b)

If the optimal value of this problem matches the optimal value of the original 2SIP, then the corresponding values for form the RS. Note that if , then creftype 4 is a mixed-integer bilevel linear program (MIBLP) and can hopefully be solved faster than the general case.

3.2 Learning algorithm

The goal of ML algorithms is to predict an optimal () to creftype 4, given the data for the 2SIP. On the one hand, we are expecting the ML algorithms to predict the solutions of a seemingly much harder optimization problem than the original 2SIP. On the other hand, this is easier for ML since there are no constraints on the predicted variables – . Supervised learning is a natural tool to achieve this goal.

Supervised learning can be used if there is a training dataset of problem instances and their corresponding RS. The task of predicting RS can be formulated as a regression task as RS is real valued. The algorithm tries to minimize the mean squared error (MSE) between the true and predicted RS. The prediction can also be evaluated on the merits of optimization metrics, comparing the solution and objective value of true and predicted RS.

4 Computational study

This section discusses the computation study performed to support Conjecture 1.

4.1 Problem definition

In this work, we consider a version of two-stage stochastic capacitated facility location (S-CFLP) for computational analysis. The problem is enhanced such that both the first and the second stage of the problem have integer as well as continuous variables. More precisely, the first stage consists of deciding (i) the locations where a facility has to be opened (binary decisions), and (ii) if a facility is open, then the maximum demand that the facility can serve (continuous decisions). There are also constraints which dictate bounds on the total number of facilities that can be opened. The uncertainty in the problem pertains to the demand values at various locations in the second stage of the problem, which are sampled from a finite distribution. Once the demand is realized, the second stage consists in deciding (i) if a given open facility should serve the demand in a location (binary decisions) (ii) if a facility serves the demand in a location, then what quantity of demand is to be served (continuous decisions). These decisions have to ensure that the demand and supply constraints are met. The problem formulation is presented formally in Section 0.A.1.

4.2 Data generation

Generate instances.

We generate 50K instances of S-CFLP, with 10 facilities, 10 clients and 50 scenarios. We provide details on how the data for these instances are generated in Section 0.A.2. The generated instances are solved using Gurobi, running 2 threads, to at most 2% gap or 10 min time limit.

Next, we compute an RS for each of the 50K instances. As stated earlier, one could solve mixed-integer bilevel programs creftype 4 to obtain the RSs. However, due to the computational burden, we use heuristics that work using the knowledge of the (nearly) optimal solutions to the 2SIP already obtained from Gurobi. These heuristics are detailed in Section 0.A.3. Out of 50K instances, they find an RS for 49,290 instances. We believe that a more thorough search will enable us to find the RS for all the problems.

4.3 Learning algorithm

We formulate the task of predicting the RS as a regression task. The size of the dataset, which comprises of instances and their corresponding RS, is 49,290. The dataset is split into training and test sets of size 45K and 4,290, respectively. Further, a validation set of 5K is carved out from the training set. We use linear regression (LR) and artificial neural network (ANN) to minimize the MSE between the true and predicted RS.

Feature engineering.

It is well known that features describing the connection between variables, constraints and other interaction help ML to perform well rather than just providing plain data matrices [12, 5, 8, 3]. In this spirit, along with the fixed and variable costs to open facilities at different locations, we also provide aggregated features on the set of scenarios. These features give information about each of the potential locations for facilities in S-CFLP as well as the way different locations interact through the demands in adjacent nodes. A detailed set of the features is given in Section 0.A.4.

4.4 Comparison

In order to evaluate the ML-based prediction of , which we refer to as , we compare the solution obtained by solving the surrogate problem associated with against solutions obtained by various algorithms.

We use LR and ANN to predict . We compare these predictions against (i) Solution obtained using Gurobi by solving creftype 2 (GRB) (ii) a solution obtained by solving the surrogate associated with the average scenario, namely (AVG) (iii) a solution obtained by solving the surrogate associated with a randomly chosen scenario from the choices (RND) (iv) a solution obtained by solving the surrogate associated with a randomly chosen scenario from the distribution of the scenario predicted by LR (DIST). Note that GRB produces better solutions (in most cases) than the ML methods, however, taking a significantly longer time. We therefore assess the time it takes GRB to get a solution of comparable quality to LR and ANN. We refer to these as GRB-L and GRB-A, respectively.

5 Results

Table 1 reports the objective value difference ratio defined as Obj val by a methodGRB obj valGRB obj val for each method and Table 2 statistics on computing times. Before analyzing the results in more detail, we note a key finding that emerges. Namely, LR and ANN perform almost as good as GRB (in terms of quality of the objective value) in a fraction of the time taken by GRB. Figure 1 captures the trade off between the quality of solutions obtained by different methods as well as the time taken to obtain these solutions.

Figure 1: Objective value difference ratio vs. avg time in seconds

We observe from Table 1 that LR and ANN produce decisions that are as good as GRB ones on an average (and by the median value), and in some cases the ML-based methods even beat GRB, i.e., produce solutions whose objective value is better than that of GRB. This is possible because GRB does not necessarily solve the problem to optimality, but only up to a gap of 2%. Further, even in the worst of the 4,290 test cases, LR is at most 2.64% away from GRB. To show that this is not easily achieved, we also compare GRB against AVG, RND and DIST. We observe that these methods perform much poorer than GRB, unlike LR and ANN.

Min 3.36 -0.33 -0.17 -0.62 -0.45
Max 14.23 94.42 49.87 2.64 7.85
Avg. 8.10 12.10 5.41 0.64 1.02
Median 8.08 8.24 3.54 0.60 0.90
Std. dev. 1.59 12.11 5.39 0.41 0.70
Table 1: Objective value difference ratio, GRB vs. the five other methods (in %)
Min 0.4354 0.0041 0.0041 0.0040 0.4049 0.4268 0.0194 0.0197
Max 559.32 0.0077 0.0081 0.0065 140.43 140.43 0.0454 0.0457
Avg. 8.7713 0.0043 0.0045 0.0043 4.2650 3.0398 0.0206 0.0209
Median 2.2621 0.0043 0.0044 0.0042 1.4613 1.2898 0.0200 0.0204
Std. dev. 17.919 0.0001 0.0003 0.0001 8.5576 6.8362 0.0019 0.0019
Table 2: Statistics on computing times of the different methods

Analyzing the time improvement in using LR and ANN, we observe from Table 2 that these methods solve the S-CFLP orders of magnitude faster than GRB. Indeed, GRB takes over 8 seconds on an average to solve these problems, while the maximum time is 0.046 seconds using LR and ANN. We emphasize that the time taken to solve using the ML methods includes the time elapsed in computing the values of the features used in ML and the time elapsed in solving the surrogate associated with the corresponding . Recall that GRB-L and GRB-A denote the time it takes GRB to produce a solution of comparable quality to LR and ANN. The results show that GRB cannot produce a solution of the same quality as LR and ANN in a comparable time. In fact, LR and ANN are still orders of magnitude faster than GRB.

6 Discussion

In this paper, we present an algorithm to solve 2SIP using ML-based methods. The method hinges on the existence of the RS conjectured in Section 3.1. Computationally, we see that the methods proposed in this paper consistently provide good quality solutions to S-CFLP orders of magnitude faster.

An important observation we had while training the models is that we were never able to get the training loss close to zero. Naturally, the predicted RS in the test dataset were quite different from the RS estimated using our heuristics. The differences in the predicted values of the components of RS and those obtained using the heuristics are documented by

Figure 2 in the Appendix. Despite this, the solution value to the 2SIP as determined by our algorithm were near optimal as shown in the results and significantly better than those obtained with other methods. The mismatch between the ML metrics and those characterizing discrete optimization problems is a known issue [3] requiring extensive research and, in our context, we believe that exploring this avenue might produce better solutions.

Another interesting observation is that LR beats ANN in this task. We suspect that this is partly caused by the parsimony offered by LR. However, this is also encouraging news that the sample complexity of the learning task might be relatively small in general. We believe that a natural extension to this work is to provide these analyses more formally.

Further, we believe that computational tests assessing the performance of the algorithms in different datasets of 2SIP is crucial to show how much and where our method generalizes. This might involve learning solutions to other forms of 2SIP like the stochastic unit-commitment problem, the stochastic supply chain-network design problem etc. These are cases where we believe that Conjecture 1 still holds, but we do not have computational validation.

Finally, we would also be interested in extending the theory side when Conjecture 1 is not expected to hold at all or holds only with weaker guarantees; for example, in the case where both the stages are mixed-integer nonlinear programming (MINLP) problems. In such cases, it will be useful to understand the reach of ML-based solution techniques as opposed to traditional MINLP solvers.


  • Ahmed [2013] Shabbir Ahmed. A scenario decomposition algorithm for 0–1 stochastic programs. Operations Research Letters, 41(6):565–569, 2013.
  • Ahmed et al. [2004] Shabbir Ahmed, Mohit Tawarmalani, and Nikolaos V Sahinidis. A finite branch-and-bound algorithm for two-stage stochastic integer programs. Mathematical Programming, 100(2):355–377, 2004.
  • Bengio et al. [2018] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128, 2018.
  • Birge and Louveaux [2011] John R Birge and Francois Louveaux. Introduction to stochastic programming. Springer Science & Business Media, 2011.
  • Bonami et al. [2018] Pierre Bonami, Andrea Lodi, and Giulia Zarpellon. Learning a classification of mixed-integer quadratic programming problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 595–604. Springer, 2018.
  • Carøe and Tind [1998] Claus C Carøe and Jørgen Tind. L-shaped decomposition of two-stage stochastic programs with integer recourse. Mathematical Programming, 83(1-3):451–464, 1998.
  • Dupačová et al. [2003] Jitka Dupačová, Nicole Gröwe-Kuska, and Werner Römisch. Scenario reduction in stochastic programming. Mathematical programming, 95(3):493–511, 2003.
  • Gasse et al. [2019] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. arXiv preprint arXiv:1906.01629, 2019.
  • Gleixner et al. [2018a] Ambros Gleixner, Michael Bastubbe, Leon Eifler, Tristan Gally, Gerald Gamrath, Robert Lion Gottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Marco E. Lübbecke, Stephen J. Maher, Matthias Miltenberger, Benjamin Müller, Marc E. Pfetsch, Christian Puchert, Daniel Rehfeldt, Franziska Schlösser, Christoph Schubert, Felipe Serrano, Yuji Shinano, Jan Merlin Viernickel, Matthias Walter, Fabian Wegscheider, Jonas T. Witt, and Jakob Witzig. The SCIP Optimization Suite 6.0. Technical report, Optimization Online, July 2018a. URL http://www.optimization-online.org/DB_HTML/2018/07/6692.html.
  • Gleixner et al. [2018b] Ambros Gleixner, Michael Bastubbe, Leon Eifler, Tristan Gally, Gerald Gamrath, Robert Lion Gottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Marco E. Lübbecke, Stephen J. Maher, Matthias Miltenberger, Benjamin Müller, Marc E. Pfetsch, Christian Puchert, Daniel Rehfeldt, Franziska Schlösser, Christoph Schubert, Felipe Serrano, Yuji Shinano, Jan Merlin Viernickel, Matthias Walter, Fabian Wegscheider, Jonas T. Witt, and Jakob Witzig. The SCIP Optimization Suite 6.0. ZIB-Report 18-26, Zuse Institute Berlin, July 2018b. URL http://nbn-resolving.de/urn:nbn:de:0297-zib-69361.
  • Kall et al. [1994] Peter Kall, Stein W Wallace, and Peter Kall. Stochastic programming. Springer, 1994.
  • Khalil et al. [2016] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • Linderoth et al. [2006] Jeff Linderoth, Alexander Shapiro, and Stephen Wright. The empirical behavior of sampling methods for stochastic programming. Annals of Operations Research, 142(1):215–241, 2006.
  • Louveaux and Peeters [1992] François V Louveaux and Dominique Peeters. A dual-based procedure for stochastic facility location. Operations research, 40(3):564–573, 1992.
  • Lulli and Sen [2004] Guglielmo Lulli and Suvrajeet Sen. A branch-and-price algorithm for multistage stochastic integer programming with application to stochastic batch-sizing problems. Management Science, 50(6):786–796, 2004.
  • Nair et al. [2018] Vinod Nair, Dj Dvijotham, Iain Dunning, and Oriol Vinyals. Learning fast optimizers for contextual stochastic integer programs. In UAI, pages 591–600, 2018.
  • Nemirovski et al. [2009] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  • Powell and Meisel [2015a] Warren B Powell and Stephan Meisel. Tutorial on stochastic optimization in energy—part i: Modeling and policies. IEEE Transactions on Power Systems, 31(2):1459–1467, 2015a.
  • Powell and Meisel [2015b] Warren B Powell and Stephan Meisel. Tutorial on stochastic optimization in energy—part ii: An energy storage illustration. IEEE Transactions on Power Systems, 31(2):1468–1475, 2015b.
  • Prékopa [2013] András Prékopa. Stochastic programming, volume 324. Springer Science & Business Media, 2013.
  • Santoso et al. [2005] Tjendera Santoso, Shabbir Ahmed, Marc Goetschalckx, and Alexander Shapiro. A stochastic programming approach for supply chain network design under uncertainty. European Journal of Operational Research, 167(1):96–115, 2005.
  • Sen [2010] Suvrajeet Sen. Stochastic mixed-integer programming algorithms: Beyond benders’ decomposition. Wiley Encyclopedia of Operations Research and Management Science, 2010.
  • Sen and Higle [2005] Suvrajeet Sen and Julia L Higle. The c 3 theorem and a d 2 algorithm for large scale stochastic mixed-integer programming: set convexification. Mathematical Programming, 104(1):1–20, 2005.
  • Shapiro et al. [2009] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.

Appendix 0.A Computational test details

0.a.1 Problem Formulation

We provide below the problem considered in this work for computational study. equationparentequation


In this problem, we minimize the fixed and variable costs of opening a facility along with the fixed and variable costs of transportation between the facilities and the demand nodes. There are potential locations where a facility could be opened. A fixed cost of is incurred, if a facility is opened in location , and a variable cost of is incurred per-unit capacity of the facility opened in location

. The binary variable,

tracks if a facility is opened in location and the continuous variable indicates the size of the facility at location . The constraint in creftype 5c, along with the binary constraint on ensures that the costs are incurred in the right way. Finally, creftype 5b is a complicating constraint, which says that at least a tenth of the locations must have a facility open and not more than three-quarters of the locations must have a facility open.

In the second stage, is the fixed cost incurred in transporting from location to ; is the per-unit variable cost incurred in transporting from to . The binary variable denotes if any transportation happens from to and the continuous variable denotes the actual quantity transported from to . The second-stage objective in creftype 5d minimizes the transportation cost incurred under a random demand scenario parameterized by . Then, creftype 5e ensures that the total quantity transported out of a facility is not greater than the capacity of the facility, while creftype 5f ensure that the total quantity supplied to a location equals the (random) demand at . Finally, constraints creftype 5g link and variables appropriately.

0.a.2 Data generation

Instance generation.

We generate 50K instances of S-CFLP, with 10 facilities, 10 clients and 50 scenarios. The random parameters , and vary across instances, where as and are fixed across all instances. Moreover, and

are sampled from a discrete uniform distribution [15, 20) and [5, 10), respectively. The demand matrix

is generated by first evaluating . The demand scenario (

) is generated by sampling from a Poisson distribution with the mean equal to


The generated instances are solved using Gurobi, running 2 threads, to optimality (less than 2% gap) or 10 min time limit. We store the objective value, gap closed and master solution , . We were able to solve all the instances up to the specified gap, within the specified time limit.

We follow Algorithm 1 for generating . Then, refers to the cardinality of and in step 6.

Data: 2SIP with objective value and first-stage solution , max iterations
Result: or NULL
1 ;
2 while iter do
3       Formulate surrogate problem using ;
4       Solve and extract first-stage solution ;
5       if  then
6             return ;
8      else
9             Perturb using heuristics based on and
return NULL
Algorithm 1 GenerateXiHat

The heuristics for updating the RS, based on and , are described in Section 0.A.3.

0.a.3 Heuristics

Let and be the first-stage optimal and surrogate solution associated with the scenario , respectively. There are three heuristics that we use in tandem to generate . The first heuristic is based on the comparison of facilities being open or close in the optimal and surrogate solution. The demand in the is zeroed out at clients for which the is close and is open, as suggested by


The remaining two heuristics are based on the comparison of capacities installed in facilities in the optimal and the surrogate solution. First, the client with maximum absolute difference between capacities installed in optimal and surrogate solution is identified i.e., . The demand at this client in the is updated either by a fixed percentage of the current demand


or by a fraction of the difference between the capacities installed in optimal and surrogate solutions


0.a.4 Feature engineering

Figure 2: Demand histogram for different methods.

The inputs of the models are and . We do not provide and in the input as they are fixed across all instances. We do feature engineering on , instead of providing it as a raw input, to extract information which can be useful in predicting . Let be an matrix, where is the number of scenarios and

is the number of clients. We calculate minimum, maximum, average, standard deviation, median,

quantile, and quantile of ( column of ) for .

We also find the percentage of scenarios in which some fraction of demand for a client is greater than and less than the demand on all the other nodes

We set to different values (0.9, 1, 1.1, 1.2 and 1.5) and we thus end up with an input vector of size 190, combining ,

and features extracted from

. The feature extraction performs an aggregation over the number of scenarios.