1 Introduction
Twostage stochastic integer programming (2SIP) is a standard framework to model decision making under uncertainty. In this framework, first the socalled firststage decisions are made. Then, the values of some uncertain parameters in the problem are determined, as if sampled from a known distribution. Finally, the second set of decisions are made depending upon the realized value of the uncertain parameters, the socalled secondstage or recourse of the problem. The decision maker, in this setting, minimizes the sum of (i) a linear function of the firststage decision variables and (ii) the expected value of the secondstage optimization problem.
2SIP is studied extensively in the literature [4, 15, 14, 11, 20, 24, 7, 17, 13, 18, 19] owing to its applicability in various decision making situations with uncertainty, like the stochastic unitcommitment problems for electricity generation [18, 19], stochastic facility location problems [14], stochastic supply chain network design [21], among others. With the overwhelming importance of 2SIP a wide range of solution algorithms have been proposed, for example, [2, 15, 23, 1, 22, 6].
In this paper we are interested in using machine learning (ML) to obtain good primal solutions to 2SIP. Along this line, Nair et al. [16]
proposed a reinforcement learningbased heuristic solver to quickly find solutions to 2SIP. Given that the agent can be trained offline, the algorithm provided solutions much faster for some classes of problems compared to an opensource generalpurpose mixedinteger programming (MIP) solver, in their case, SCIP
[9, 10]. However, their method is based on the following restrictive assumptions:
All firststage variables are required to be binary. General integer variables or continuous variables in the first stage cannot be handled.

Every
binary vector is required to be feasible for the first stage of the problem, i.e., no constraints are allowed in the first stage.
Assumption a above is intrinsic to the method in [16], as both the initialization policy and the local move policy of the method involves flipping the bits of the firststage decision vector. Hence, one cannot easily have general integer variables or continuous variables. Assumption b is again crucial to the algorithm in [16], as flipping a bit in the first stage could potentially make the new decision infeasible and it might require a more complicated feedback mechanism to check and discard infeasible solutions. In fact, if there are constraints, it is hard to decide if there exists a flip that keeps the decision feasible. Alternatively, one could empirically penalize the infeasible solutions, but tuning the penalty might be a hard problem in itself.
In contrast, our method does not require either of these two restrictive assumptions. We allow binary, general integer as well as continuous variables in both first and second stage of the problem. We also allow constraints in both stages of the problem. Furthermore, we have a simple and direct approach to handle the firststage constraints, without requiring any empirical penalties.
We make the following common assumption to exclude pathological cases, where an uncertain realization can turn a feasible firststage decision infeasible.
Assumption 1.
The 2SIP has complete recourse, i.e., if a firststage decision is feasible given the firststage constraints, then it is feasible for all the second stage problems as well.
We make another assumption of uncertainty with finite support, so we can have a proper benchmark to compare our solution against. However, this assumption can be readily removed, without affecting the proposed algorithm.
Assumption 2.
The uncertainty distribution in the 2SIP has a finite support.
2 Problem definition
We formally define a 2SIP as follows: equationparentequation
(1a)  
subject to  (1b)  
(1c) 
where,
where and are the first and secondstage decisions respectively, , , , , , , , , .
When Assumption 2 holds, the 2SIP described above can also be expressed as a single deterministic MIP as follows: equationparentequation
(2a)  
subject to  (2b)  
(2c)  
(2d)  
(2e) 
When Assumption 2 does not hold, the formulation creftype 2 could be a finitesample approximation of creftype 1, which is extensively studied in the stochastic programming literature. Imitating [16], we compare our algorithm against solving creftype 2 with a generalpurpose MIP solver.
3 Methodology
In this section, we discuss the algorithmic contribution of the paper.
3.1 Surrogate formulation
We first define the objective value function (OVF) , mapping  the function we are trying to optimize over the mixedinteger set defined in creftype 1.
Given creftype 2, we define the surrogate problem associated with , as follows: equationparentequation
(3a)  
subject to  (3b)  
(3c)  
(3d) 
In other words, should the value that the uncertain parameters are going to take is deterministically known to be , then the decision maker can solve the surrogate problem associated with . Now, the idea behind the algorithm proposed in this paper is captured by Conjecture 1.
Conjecture 1.
Let creftype 2 (and hence creftype 1) have an optimal objective value of . There exists in , and such that if solves the (much smaller) surrogate problem defined by , then, .
Observe that by construction, is feasible to the original problem in creftype 1. Also, Conjecture 1 asserts that, there exists a realization of the uncertainty () such that if one deterministically optimizes for that realization , then its solutions are optimal for the original 2SIP. Each such is called a representative scenario (RS) for the given 2SIP.
Now, given adequate computing resources, one can solve the following bilevel program to obtain an RS. equationparentequation
(4a)  
subject to  (4b)  
(4c)  
(4d) 
If the optimal value of this problem matches the optimal value of the original 2SIP, then the corresponding values for form the RS. Note that if , then creftype 4 is a mixedinteger bilevel linear program (MIBLP) and can hopefully be solved faster than the general case.
3.2 Learning algorithm
The goal of ML algorithms is to predict an optimal () to creftype 4, given the data for the 2SIP. On the one hand, we are expecting the ML algorithms to predict the solutions of a seemingly much harder optimization problem than the original 2SIP. On the other hand, this is easier for ML since there are no constraints on the predicted variables – . Supervised learning is a natural tool to achieve this goal.
Supervised learning can be used if there is a training dataset of problem instances and their corresponding RS. The task of predicting RS can be formulated as a regression task as RS is real valued. The algorithm tries to minimize the mean squared error (MSE) between the true and predicted RS. The prediction can also be evaluated on the merits of optimization metrics, comparing the solution and objective value of true and predicted RS.
4 Computational study
This section discusses the computation study performed to support Conjecture 1.
4.1 Problem definition
In this work, we consider a version of twostage stochastic capacitated facility location (SCFLP) for computational analysis. The problem is enhanced such that both the first and the second stage of the problem have integer as well as continuous variables. More precisely, the first stage consists of deciding (i) the locations where a facility has to be opened (binary decisions), and (ii) if a facility is open, then the maximum demand that the facility can serve (continuous decisions). There are also constraints which dictate bounds on the total number of facilities that can be opened. The uncertainty in the problem pertains to the demand values at various locations in the second stage of the problem, which are sampled from a finite distribution. Once the demand is realized, the second stage consists in deciding (i) if a given open facility should serve the demand in a location (binary decisions) (ii) if a facility serves the demand in a location, then what quantity of demand is to be served (continuous decisions). These decisions have to ensure that the demand and supply constraints are met. The problem formulation is presented formally in Section 0.A.1.
4.2 Data generation
Generate instances.
We generate 50K instances of SCFLP, with 10 facilities, 10 clients and 50 scenarios. We provide details on how the data for these instances are generated in Section 0.A.2. The generated instances are solved using Gurobi, running 2 threads, to at most 2% gap or 10 min time limit.
Next, we compute an RS for each of the 50K instances. As stated earlier, one could solve mixedinteger bilevel programs creftype 4 to obtain the RSs. However, due to the computational burden, we use heuristics that work using the knowledge of the (nearly) optimal solutions to the 2SIP already obtained from Gurobi. These heuristics are detailed in Section 0.A.3. Out of 50K instances, they find an RS for 49,290 instances. We believe that a more thorough search will enable us to find the RS for all the problems.
4.3 Learning algorithm
We formulate the task of predicting the RS as a regression task. The size of the dataset, which comprises of instances and their corresponding RS, is 49,290. The dataset is split into training and test sets of size 45K and 4,290, respectively. Further, a validation set of 5K is carved out from the training set. We use linear regression (LR) and artificial neural network (ANN) to minimize the MSE between the true and predicted RS.
Feature engineering.
It is well known that features describing the connection between variables, constraints and other interaction help ML to perform well rather than just providing plain data matrices [12, 5, 8, 3]. In this spirit, along with the fixed and variable costs to open facilities at different locations, we also provide aggregated features on the set of scenarios. These features give information about each of the potential locations for facilities in SCFLP as well as the way different locations interact through the demands in adjacent nodes. A detailed set of the features is given in Section 0.A.4.
4.4 Comparison
In order to evaluate the MLbased prediction of , which we refer to as , we compare the solution obtained by solving the surrogate problem associated with against solutions obtained by various algorithms.
We use LR and ANN to predict . We compare these predictions against (i) Solution obtained using Gurobi by solving creftype 2 (GRB) (ii) a solution obtained by solving the surrogate associated with the average scenario, namely (AVG) (iii) a solution obtained by solving the surrogate associated with a randomly chosen scenario from the choices (RND) (iv) a solution obtained by solving the surrogate associated with a randomly chosen scenario from the distribution of the scenario predicted by LR (DIST). Note that GRB produces better solutions (in most cases) than the ML methods, however, taking a significantly longer time. We therefore assess the time it takes GRB to get a solution of comparable quality to LR and ANN. We refer to these as GRBL and GRBA, respectively.
5 Results
Table 1 reports the objective value difference ratio defined as Obj val by a methodGRB obj valGRB obj val for each method and Table 2 statistics on computing times. Before analyzing the results in more detail, we note a key finding that emerges. Namely, LR and ANN perform almost as good as GRB (in terms of quality of the objective value) in a fraction of the time taken by GRB. Figure 1 captures the trade off between the quality of solutions obtained by different methods as well as the time taken to obtain these solutions.
We observe from Table 1 that LR and ANN produce decisions that are as good as GRB ones on an average (and by the median value), and in some cases the MLbased methods even beat GRB, i.e., produce solutions whose objective value is better than that of GRB. This is possible because GRB does not necessarily solve the problem to optimality, but only up to a gap of 2%. Further, even in the worst of the 4,290 test cases, LR is at most 2.64% away from GRB. To show that this is not easily achieved, we also compare GRB against AVG, RND and DIST. We observe that these methods perform much poorer than GRB, unlike LR and ANN.
GRB vs.  AVG  RND  DIST  LR  ANN 

Min  3.36  0.33  0.17  0.62  0.45 
Max  14.23  94.42  49.87  2.64  7.85 
Avg.  8.10  12.10  5.41  0.64  1.02 
Median  8.08  8.24  3.54  0.60  0.90 
Std. dev.  1.59  12.11  5.39  0.41  0.70 
Method  GRB  AVG  RND  DIST  GRBL  GRBA  LR  ANN 

Min  0.4354  0.0041  0.0041  0.0040  0.4049  0.4268  0.0194  0.0197 
Max  559.32  0.0077  0.0081  0.0065  140.43  140.43  0.0454  0.0457 
Avg.  8.7713  0.0043  0.0045  0.0043  4.2650  3.0398  0.0206  0.0209 
Median  2.2621  0.0043  0.0044  0.0042  1.4613  1.2898  0.0200  0.0204 
Std. dev.  17.919  0.0001  0.0003  0.0001  8.5576  6.8362  0.0019  0.0019 
Analyzing the time improvement in using LR and ANN, we observe from Table 2 that these methods solve the SCFLP orders of magnitude faster than GRB. Indeed, GRB takes over 8 seconds on an average to solve these problems, while the maximum time is 0.046 seconds using LR and ANN. We emphasize that the time taken to solve using the ML methods includes the time elapsed in computing the values of the features used in ML and the time elapsed in solving the surrogate associated with the corresponding . Recall that GRBL and GRBA denote the time it takes GRB to produce a solution of comparable quality to LR and ANN. The results show that GRB cannot produce a solution of the same quality as LR and ANN in a comparable time. In fact, LR and ANN are still orders of magnitude faster than GRB.
6 Discussion
In this paper, we present an algorithm to solve 2SIP using MLbased methods. The method hinges on the existence of the RS conjectured in Section 3.1. Computationally, we see that the methods proposed in this paper consistently provide good quality solutions to SCFLP orders of magnitude faster.
An important observation we had while training the models is that we were never able to get the training loss close to zero. Naturally, the predicted RS in the test dataset were quite different from the RS estimated using our heuristics. The differences in the predicted values of the components of RS and those obtained using the heuristics are documented by
Figure 2 in the Appendix. Despite this, the solution value to the 2SIP as determined by our algorithm were near optimal as shown in the results and significantly better than those obtained with other methods. The mismatch between the ML metrics and those characterizing discrete optimization problems is a known issue [3] requiring extensive research and, in our context, we believe that exploring this avenue might produce better solutions.Another interesting observation is that LR beats ANN in this task. We suspect that this is partly caused by the parsimony offered by LR. However, this is also encouraging news that the sample complexity of the learning task might be relatively small in general. We believe that a natural extension to this work is to provide these analyses more formally.
Further, we believe that computational tests assessing the performance of the algorithms in different datasets of 2SIP is crucial to show how much and where our method generalizes. This might involve learning solutions to other forms of 2SIP like the stochastic unitcommitment problem, the stochastic supply chainnetwork design problem etc. These are cases where we believe that Conjecture 1 still holds, but we do not have computational validation.
Finally, we would also be interested in extending the theory side when Conjecture 1 is not expected to hold at all or holds only with weaker guarantees; for example, in the case where both the stages are mixedinteger nonlinear programming (MINLP) problems. In such cases, it will be useful to understand the reach of MLbased solution techniques as opposed to traditional MINLP solvers.
References
 Ahmed [2013] Shabbir Ahmed. A scenario decomposition algorithm for 0–1 stochastic programs. Operations Research Letters, 41(6):565–569, 2013.
 Ahmed et al. [2004] Shabbir Ahmed, Mohit Tawarmalani, and Nikolaos V Sahinidis. A finite branchandbound algorithm for twostage stochastic integer programs. Mathematical Programming, 100(2):355–377, 2004.
 Bengio et al. [2018] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128, 2018.
 Birge and Louveaux [2011] John R Birge and Francois Louveaux. Introduction to stochastic programming. Springer Science & Business Media, 2011.
 Bonami et al. [2018] Pierre Bonami, Andrea Lodi, and Giulia Zarpellon. Learning a classification of mixedinteger quadratic programming problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 595–604. Springer, 2018.
 Carøe and Tind [1998] Claus C Carøe and Jørgen Tind. Lshaped decomposition of twostage stochastic programs with integer recourse. Mathematical Programming, 83(13):451–464, 1998.
 Dupačová et al. [2003] Jitka Dupačová, Nicole GröweKuska, and Werner Römisch. Scenario reduction in stochastic programming. Mathematical programming, 95(3):493–511, 2003.
 Gasse et al. [2019] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. arXiv preprint arXiv:1906.01629, 2019.
 Gleixner et al. [2018a] Ambros Gleixner, Michael Bastubbe, Leon Eifler, Tristan Gally, Gerald Gamrath, Robert Lion Gottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Marco E. Lübbecke, Stephen J. Maher, Matthias Miltenberger, Benjamin Müller, Marc E. Pfetsch, Christian Puchert, Daniel Rehfeldt, Franziska Schlösser, Christoph Schubert, Felipe Serrano, Yuji Shinano, Jan Merlin Viernickel, Matthias Walter, Fabian Wegscheider, Jonas T. Witt, and Jakob Witzig. The SCIP Optimization Suite 6.0. Technical report, Optimization Online, July 2018a. URL http://www.optimizationonline.org/DB_HTML/2018/07/6692.html.
 Gleixner et al. [2018b] Ambros Gleixner, Michael Bastubbe, Leon Eifler, Tristan Gally, Gerald Gamrath, Robert Lion Gottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Marco E. Lübbecke, Stephen J. Maher, Matthias Miltenberger, Benjamin Müller, Marc E. Pfetsch, Christian Puchert, Daniel Rehfeldt, Franziska Schlösser, Christoph Schubert, Felipe Serrano, Yuji Shinano, Jan Merlin Viernickel, Matthias Walter, Fabian Wegscheider, Jonas T. Witt, and Jakob Witzig. The SCIP Optimization Suite 6.0. ZIBReport 1826, Zuse Institute Berlin, July 2018b. URL http://nbnresolving.de/urn:nbn:de:0297zib69361.
 Kall et al. [1994] Peter Kall, Stein W Wallace, and Peter Kall. Stochastic programming. Springer, 1994.
 Khalil et al. [2016] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 Linderoth et al. [2006] Jeff Linderoth, Alexander Shapiro, and Stephen Wright. The empirical behavior of sampling methods for stochastic programming. Annals of Operations Research, 142(1):215–241, 2006.
 Louveaux and Peeters [1992] François V Louveaux and Dominique Peeters. A dualbased procedure for stochastic facility location. Operations research, 40(3):564–573, 1992.
 Lulli and Sen [2004] Guglielmo Lulli and Suvrajeet Sen. A branchandprice algorithm for multistage stochastic integer programming with application to stochastic batchsizing problems. Management Science, 50(6):786–796, 2004.
 Nair et al. [2018] Vinod Nair, Dj Dvijotham, Iain Dunning, and Oriol Vinyals. Learning fast optimizers for contextual stochastic integer programs. In UAI, pages 591–600, 2018.
 Nemirovski et al. [2009] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
 Powell and Meisel [2015a] Warren B Powell and Stephan Meisel. Tutorial on stochastic optimization in energy—part i: Modeling and policies. IEEE Transactions on Power Systems, 31(2):1459–1467, 2015a.
 Powell and Meisel [2015b] Warren B Powell and Stephan Meisel. Tutorial on stochastic optimization in energy—part ii: An energy storage illustration. IEEE Transactions on Power Systems, 31(2):1468–1475, 2015b.
 Prékopa [2013] András Prékopa. Stochastic programming, volume 324. Springer Science & Business Media, 2013.
 Santoso et al. [2005] Tjendera Santoso, Shabbir Ahmed, Marc Goetschalckx, and Alexander Shapiro. A stochastic programming approach for supply chain network design under uncertainty. European Journal of Operational Research, 167(1):96–115, 2005.
 Sen [2010] Suvrajeet Sen. Stochastic mixedinteger programming algorithms: Beyond benders’ decomposition. Wiley Encyclopedia of Operations Research and Management Science, 2010.
 Sen and Higle [2005] Suvrajeet Sen and Julia L Higle. The c 3 theorem and a d 2 algorithm for large scale stochastic mixedinteger programming: set convexification. Mathematical Programming, 104(1):1–20, 2005.
 Shapiro et al. [2009] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
Appendix 0.A Computational test details
0.a.1 Problem Formulation
We provide below the problem considered in this work for computational study. equationparentequation
(5a)  
(5b)  
(5c)  
(5d)  
(5e)  
(5f)  
(5g) 
In this problem, we minimize the fixed and variable costs of opening a facility along with the fixed and variable costs of transportation between the facilities and the demand nodes. There are potential locations where a facility could be opened. A fixed cost of is incurred, if a facility is opened in location , and a variable cost of is incurred perunit capacity of the facility opened in location
. The binary variable,
tracks if a facility is opened in location and the continuous variable indicates the size of the facility at location . The constraint in creftype 5c, along with the binary constraint on ensures that the costs are incurred in the right way. Finally, creftype 5b is a complicating constraint, which says that at least a tenth of the locations must have a facility open and not more than threequarters of the locations must have a facility open.In the second stage, is the fixed cost incurred in transporting from location to ; is the perunit variable cost incurred in transporting from to . The binary variable denotes if any transportation happens from to and the continuous variable denotes the actual quantity transported from to . The secondstage objective in creftype 5d minimizes the transportation cost incurred under a random demand scenario parameterized by . Then, creftype 5e ensures that the total quantity transported out of a facility is not greater than the capacity of the facility, while creftype 5f ensure that the total quantity supplied to a location equals the (random) demand at . Finally, constraints creftype 5g link and variables appropriately.
0.a.2 Data generation
Instance generation.
We generate 50K instances of SCFLP, with 10 facilities, 10 clients and 50 scenarios. The random parameters , and vary across instances, where as and are fixed across all instances. Moreover, and
are sampled from a discrete uniform distribution [15, 20) and [5, 10), respectively. The demand matrix
is generated by first evaluating . The demand scenario () is generated by sampling from a Poisson distribution with the mean equal to
.The generated instances are solved using Gurobi, running 2 threads, to optimality (less than 2% gap) or 10 min time limit. We store the objective value, gap closed and master solution , . We were able to solve all the instances up to the specified gap, within the specified time limit.
We follow Algorithm 1 for generating . Then, refers to the cardinality of and in step 6.
The heuristics for updating the RS, based on and , are described in Section 0.A.3.
0.a.3 Heuristics
Let and be the firststage optimal and surrogate solution associated with the scenario , respectively. There are three heuristics that we use in tandem to generate . The first heuristic is based on the comparison of facilities being open or close in the optimal and surrogate solution. The demand in the is zeroed out at clients for which the is close and is open, as suggested by
(6) 
The remaining two heuristics are based on the comparison of capacities installed in facilities in the optimal and the surrogate solution. First, the client with maximum absolute difference between capacities installed in optimal and surrogate solution is identified i.e., . The demand at this client in the is updated either by a fixed percentage of the current demand
(7) 
or by a fraction of the difference between the capacities installed in optimal and surrogate solutions
(8) 
0.a.4 Feature engineering
The inputs of the models are and . We do not provide and in the input as they are fixed across all instances. We do feature engineering on , instead of providing it as a raw input, to extract information which can be useful in predicting . Let be an matrix, where is the number of scenarios and
is the number of clients. We calculate minimum, maximum, average, standard deviation, median,
quantile, and quantile of ( column of ) for .We also find the percentage of scenarios in which some fraction of demand for a client is greater than and less than the demand on all the other nodes
We set to different values (0.9, 1, 1.1, 1.2 and 1.5) and we thus end up with an input vector of size 190, combining ,
and features extracted from
. The feature extraction performs an aggregation over the number of scenarios.
Comments
There are no comments yet.