There has been significant proliferation of research in and application of machine learning and discrete optimization. These two analytical domains have frequently been used in a single business decision-making problem but for different purposes. Machine learning techniques have typically been used to predict what is likely to happen in the future, while optimization methods have been used to strategically search through feasible solutions.
However, the integration of the two analytics streams is often disjointed.
As an example, suppose a neural network is built to predict customer churn of a telecommunication company, with features consisting of demographic information together with service
to the customer (the lower the price, the lower the probability of customer churn).
How then should the company set service price in order to maximize revenue? The features of the predictive model are variables in the decision problem.
In other words, the output of neural networks whose features (e.g., service price) are decision variables are part of the objective function. This makes the optimization problem of maximizing revenue particularly challenging. There are few, if any, techniques, much less tools, available for solving such an optimization problem. This paper seeks to fill that void by introducing
However, the integration of the two analytics streams is often disjointed. As an example, suppose a neural network is built to predict customer churn of a telecommunication company, with features consisting of demographic information together with service price to the customer (the lower the price, the lower the probability of customer churn). How then should the company set service price in order to maximize revenue? The features of the predictive model are variables in the decision problem. In other words, the output of neural networks whose features (e.g., service price) are decision variables are part of the objective function. This makes the optimization problem of maximizing revenue particularly challenging. There are few, if any, techniques, much less tools, available for solving such an optimization problem. This paper seeks to fill that void by introducingJANOS, a modeling framework for joint predictive-and-prescriptive analytics.
JANOS allows a user to specify portions of an objective function as commonly utilized predictive models, whose features are fixed or are variables in the optimization model. JANOS currently supports predictive models of the following three forms: linear regression, logistic regression, and neural networks (NNs) with rectified linear (ReLU) activation functions. These models are commonly used in practice, and the framework is easily extensible to other predictive models.
currently supports predictive models of the following three forms: linear regression, logistic regression, and neural networks (NNs) with rectified linear (ReLU) activation functions. These models are commonly used in practice, and the framework is easily extensible to other predictive models.
To embed these predictive models in an optimization model, we utilized linearization techniques. For linear regression this is straightforward. For logistic regression, we employ a piece-wise linear approximation. Details are provided in Section 4.2. For NNs, we make use of recent work that formulates a NN as a mixed integer programming (MIP) problem (Serra et al. 2017, Bienstock et al. 2018, Fischetti and Jo 2017); we do not use the MIP to learn the NN, but rather utilize the network reformulation to produce outputs of the NN based on the input features. Details are provided in Section 4.1.
A key advantage of JANOS is that it automates the transcription of common predictive models into constraints that can be handled by mixed integer programming solvers. Thus, researchers and practitioners are relieved of this onerous task of reformulating the predictive models into tractable constraints. A different model for predictive variables can quickly be substituted out without much effort from the modeler, enabling the user to quickly compare the optimal decisions when different predictive models are used. Additionally, one need not worry about modeling all parameters of a predictive model with an optimization model; for example, a neural network with 3 layers of 10 nodes each would have hundreds of parameters, and JANOS automates that transformation. In future releases, additional predictive models will be added as well as more advanced reformulations and algorithmic implementations.
The framework, which we call JANOS111A play on Janus, who according to ancient Roman mythology is the god of beginnings, gates, transitions, time, duality, doorways and is usually depicted with two faces one looking to the past (predictive) and one to the future (prescriptive). (Source: Wikipedia), is built in Python and calls Gurobi (Gurobi Optimization 2019) to solve MIPs. A user specifies an optimization model through standard modeling constructs that share similarities to those in other common optimization modeling languages, for example Gurobi’s Python interface, Pyomo (Hart et al. 2011) or Julia (Bezanson et al. 2017).
We partition the variables in the model into two sets—the regular variables and the predicted variables. The regular variables are used to model operational constraints and objective function terms as typical variables in MIP. The predicted variables are specified through predictive models wherein some of the features depend on regular variables. The predictive models are pre-trained by the user, who can load any of the three permissible predictive model forms, together with a mapping between the regular variables and the features. We eventually plan to integrate automated machine learning (Feurer et al. 2015) and have JANOS determine the best predictive model to associate with a given data frame. To exhibit how the framework can be used, we present as an example the allocation of scholarship offers to admitted students to optimize the enrollment.
The rest of the paper is organized as follows. We first review the literature related to the joint reasoning in predictive and prescriptive analysis in Section 2. The general problem addressed by JANOS is provided in Section 3. The algorithmic details of how we optimize over linear regression models, logistic regression models and NNs are given in Section 4. The student enrollment example and a collection of experiments designed to test the efficiency of JANOS are described in Section 5. We then describe how JANOS can be downloaded and installed in Section 6. We conclude in Section 7.
2 Literature Review
Existing studies on the combination of predictive and prescriptive analytics take predictions as fixed and then make choices based on fixed predictions, for example, predictions are parameters in an optimization model (Ferreira et al. 2015).
Ferreira et al. (2015) first predict sales based on a chosen number of price values and then uses those fixed estimates to determine the optimal price.
In their application, the predicted sales are parameters in the optimization model. This modeling approach adapts, but still lacks in full integration between the two analytics disciplines. Sales will generally depend on price, and analytical methods by which one can optimize decisions that ultimately affect objective functions are currently lacking in the literature.
Approaches using fixed-point estimates of parameters are feasible when full enumeration or partial enumeration of the collection of feasible solutions is practical.
However, in instances where enumeration is not possible, facets of the optimization algorithm need to be integrated into the predictive model. For example,
first predict sales based on a chosen number of price values and then uses those fixed estimates to determine the optimal price. In their application, the predicted sales are parameters in the optimization model. This modeling approach adapts, but still lacks in full integration between the two analytics disciplines. Sales will generally depend on price, and analytical methods by which one can optimize decisions that ultimately affect objective functions are currently lacking in the literature. Approaches using fixed-point estimates of parameters are feasible when full enumeration or partial enumeration of the collection of feasible solutions is practical. However, in instances where enumeration is not possible, facets of the optimization algorithm need to be integrated into the predictive model. For example,Huang et al. (2019) model a real-world problem in such a way, but only propose an exact optimization approach when simple linear regression models are used for prediction.
There are additional streams of research that combine predictive modeling and optimization.
First of all, machine learning algorithms are powered by optimization techniques. For example, when fitting a simple linear regression model, the ordinary least square method determines the unknown parameters by minimizing the sum of the squares of the difference between the observed dependent variable values and the fitted values.
In machine learning, various optimization techniques are applied so that the learning process is efficient and achieves desired accuracy (Boyd et al. 2011).
Koh et al. (2007) propose an interior-point method for solving large-scale -regularized logistic regression problems;
ALAMO (Cozad et al. 2014) uses mixed integer optimization to learn algebraic models from data;
and, linear programming (LP) and integer programming (IP) based methods have been proposed to assist training NNs utilize deep reinforcement learning to improve optimization bounds, and proposes a generic method for learning combinatorial optimization over graphs, with a plethora of papers arising in this area
uses mixed integer optimization to learn algebraic models from data; and, linear programming (LP) and integer programming (IP) based methods have been proposed to assist training NNs(Serra et al. 2017, Bienstock et al. 2018, Fischetti and Jo 2017). Additionally, there have been recent efforts on using machine learning to improve optimization algorithms. For example, Cappart et al. (2019)
utilize deep reinforcement learning to improve optimization bounds, andKhalil et al. (2017)
proposes a generic method for learning combinatorial optimization over graphs, with a plethora of papers arising in this area(Nazari et al. 2018, Lemos et al. 2019, Bengio et al. 2018). These papers either leverage machine learning to improve optimization, or leverage optimization to improve machine learning—in our setting, we integrate the two into a unified decision-making framework.
The impact of the JANOS solver on research is twofold. First, framing a problem as optimizing over standard predictive models where the decision variables are features of the predictive models requires much attention. Suppose the predictive model is a support vector regression with a radial kernel. How does one solve the optimization problem? Each predictive model added to
solver on research is twofold. First, framing a problem as optimizing over standard predictive models where the decision variables are features of the predictive models requires much attention. Suppose the predictive model is a support vector regression with a radial kernel. How does one solve the optimization problem? Each predictive model added toJANOS will require research effort to investigate linearizations or advanced optimization methodology that lead to efficient solution times.
More broadly, we envision that JANOS will be used by researchers in fields other than optimization to solve the problems they are interested in to derive the insights they desire. Currently, a researcher familiar with optimization who has a machine learning model that they would like to optimize over is unable to solve the problem of interest. Contrast that with a researcher in logistics that requires the solution of a routing problem to identify how many trucks are needed by the company. This researcher can use a standard commercial integer programming solver, solve the routing problem to identify the number of trucks that are needed by the company, and then make decisions based on that output. The logistics researcher need not know how the optimization model is solved, just that the problem of interest can be solved. We envision the same thing for the former researcher. JANOS allows the researcher to solve an optimization problem with machine learning models so that optimal decisions can be identified for real-world decision making problems that previously could not.
3 Problem Description
JANOS seeks to solve problems formulated as (3):
The variables are regular variables and the variables are predicted variables. Each variable belongs to a finite or continuous set and are constrained via linear inequalities. Each predicted variable is associated with a predictive model , with features . Each is assumed to be a pre-trained predictive model (a linear regression, logistic regression, or neural network with a rectified linear activation function) so that the parameters are fit prior to optimization by the user. Model has features, . The first , features of predictive model are fixed and given, while each of the remaining features are regular variables, linked through the function , a -length binary unit-vector with a 1 in the coordinate of the associated regular variable. Note that a single pre-trained model can be used as multiple ’s.
4 Algorithmic Details
In this section, we summarize how JANOS handles linear regression, logistic regression, and NN models.
If is determined by a linear regression model, the function is linear and the construction is straightforward. We construct (3) and feed the model to Gurobi.
If is a predicted value of a NN, is obtained from a network flow model. The details are in Section 4.1.
If is a predicted value of a logistic regression model, we partition the range of the log-odds
log-odds, i.e., as in into a collection of intervals, and use the mean of the value of within the corresponding interval to approximate in the objective function. The details are in Section 4.2.
4.1 Optimization over Neural Networks
For every predicted variable that is determined by a NN prediction, we associate a distinct network flow model. A NN can be viewed as a multi-source single-terminal222The NN does not always have a single terminal. In our case, we only have one output, and so the output layer in our trained NN only has one node. arc-weighted acyclic layered digraph . The node set is partitioned into a collection of layers . There is a one-to-one mapping between input features of and nodes . For any node , we denote the corresponding feature by . Set consists of a single terminal node . For , every node has a bias learned during training.
Each arc is directed from a node in layer to a node in layer for some . Every arc has a weight learned during training.
Given values for all nodes , the prediction from a NN with a ReLU activation function is calculated recursively by assigning a value to all nodes in the NN via the following iterative procedure and returning :
For , , ; and
Here is the input of the ReLU function, and is the output of the ReLU function.
We further define as a binary variable indicating if
as a binary variable indicating if. With this interpretation, one can formulate the following model (MOD-NN) to calculate based on the inputs , , that are the features, which can be fixed constants or functions of the decision variables.
Constraints (6) guarantee that the ReLU activation function is not enforced in the input and output layers. Constraints (7) enforce that the values of the nodes in the input layer is the values of each input variable of the predictive model. These will either be fixed constants or the value determined by the optimization model for a single problem variable. Constraints (8) compute the input of the ReLU function of each node that is not on the first layer. Constraints (9) to (11) enforce the ReLU activation function on each hidden layer.
In (3), can be a regular variable or a constant, and there will be a separate network flow formulation for each of the predicted variables that are outcomes of neural networks. We test the impact of the size of the neural network and the number of such predicted variables on the performance of JANOS in Section 5.
4.2 Optimization over Logistic Regression Models
JANOS provides a parameterized discretization for handling logistic regression prediction. Specifically, it represents the nonlinear function of a logistic regression model using a piece-wise linear approximation, partitioning the range of the log-odds, i.e., as in into a collection of mutually exclusive and collectively exhaustive intervals. The number of intervals is a parameter that can be specified by users. This idea is illustrated on Figure 1.
We partition the range of the log-odds into intervals, , for , where and we assume that the length of each interval is uniform. The range of the log-odds is computed based on the bounds of the features. We use the mean of the function value within an interval to serve as a piece-wise linear approximation of the actual predicted value of the logistic regression model. Specifically, for interval , let
The value is the average value of the function over all values between and , where
We define as a binary variable indicating if we select a value for in interval . Let be the vector of features. Let be the vector of estimated coefficients in the logistic regression model. With this interpretation, one can formulate (4.2) to maximize over a logistic regression model approximately, using the following transformation:
The value assumed by will be approximately equal to , used as the value predicted by a logisitic regression model.
Constraint (15) ensures that only one interval is selected. Constraints (16) to (17) select the interval that contains the linear combination . Constraints (18) to (19) make sure that equals the mean outcome value of the selected interval. Recall that will be determined partially through fixed features and partially through decision variables.
5 Example Applications
In this section, we explore an example of allocating offered scholarship to admitted college students to exhibit the capability and flexibility of JANOS. All predictive models were built in Python3.7 using scikit-learn 0.21.3 (Pedregosa et al. 2011) and all optimization models were solved with Gurobi Optimizer v8.0 (Gurobi Optimization 2019). All experiments were run in MacOS Mojave 10.14.5 on a 2.8 GHz Intel Core i7-4980HQ processor with 16 GB RAM.
5.1 Problem Description
The Admission Office of a university wants to offer scholarship to its admitted students in order to bolster the class profile, such as academic metric, and often to simply maximize the expected class size (Maltz et al. 2007).
The Admission Office has collected from previous enrollment years the applicants’ SAT, GPA, scholarship offer, and matriculation result, i.e., whether the student accepted the offer or not. This year, suppose the school is issuing offers of admission. Moreover, suppose the budget available for offered scholarship is $, denoted by BUDGET henceforth. The amount of scholarship that can be assigned to any particular applicant is between $0 and $25,000. The Admission Office wants to maximize the incoming class size.
To solve this optimization problem using JANOS, one can pre-train a model to predict the probability of a candidate accepting an offer given this student’s SAT, GPA and scholarship offered. The decision to make is the third feature for each student: the amount of scholarship to offer to each student.
We model this problem as follows (5.1):
is the decision variable, i.e., the amount of scholarship assigned to each student accepted;
is the SAT score of applicant (standardized using -score);
is the GPA score of applicant (standardized using -score);
is a predicted variable per admitted student, the outcome of a predicted model representing the probability of a candidate accepting the offer; and
is a predicted model pre-trained to predict any candidates’ probabilities of accepting an offer. The parameters , and are the predictive model’s inputs. The vector represents the parameters of the predictive model, which we assume to be the same for each applicant. The function can be any of the permissible predictive models with determined prior to optimization.
5.2 Experimental Results
We utilize randomly generated realistic student records to train predictive models and test the efficiency of the solver when solving different-sized problems with variations in parameters as well. We build the three permissible models (linear regression, logistic regression, and neural networks) with various parameter settings, i.e., the number of intervals for logistic regression models and the hidden layer sizes for neural networks. Each of the models is trained on 20,000 randomly generated student records.
We then generate sets of random student records of different sizes to test the scalability of JANOS, 50, 100, 500 and 1,000 admitted students. These experiment instances are produced using the same data-generating scheme as was used for building the training set. We document how long it takes JANOS to solve problems of different sizes with different predictive models.
We first provide an analysis of the total runtime using various parameters for each of the models. For the linear regression there are not configurable parameters, and so we have only one setting, LinReg. For the logistic regression model, the main parameter of interest is the number of intervals in the discretization. For a fixed number of intervals , let LogReg refer to solving the model with logistic regression prediction with intervals. For neural networks, there are several parameters one might tune, most apparent being the configuration of the neurons. We fix 10 nodes per hidden layer and vary the number of hidden layers to 1, 2, 3. Let
intervals. For neural networks, there are several parameters one might tune, most apparent being the configuration of the neurons. We fix 10 nodes per hidden layer and vary the number of hidden layers to 1, 2, 3. LetNN refer to solving the model with neural networks with hidden layers.
For each of the predictive model specifications, we run 5 instances and take the mean runtime. Figure 2 depicts the runtimes. On the -axis we indicate the number of admitted students in the admitted pool. On the -axis we report the runtime in seconds. LinReg yields the most efficient model, taking up to a second to solve. LogReg() takes an increasing amount of time as grows, as does NN() for increasing , but the runtimes are not prohibitively large. Note that largest instances have 1,000 logistic regression approximations or 1,000 neural network flow models.
We also evaluate how well the approximation of the logistic regression performs at obtain optimal solutions. We apply the logistic regression approximation for 10 instances with 50 students. The results are reported on Figure 3. On the -axis we indicate the number of intervals in the approximation, and on the -axis we report the root mean squared error (RMSE) of the probability estimates given by the approximation and the actual learned logistic regression evaluation at the optimal solutions obtained by the approximation. As the number of intervals increases, the approximation becomes stronger, but as discussed earlier, increases runtime. Note that even with only 5 intervals, the average error in the estimate of the probability of enrollment in the approximation built is less than 0.02, or 2%.
In order to evaluate the expected improvement in solutions obtained from JANOS over what might be done in practice, we evaluate the solution obtained by JANOS and compare with the following heuristic that can be employed for any predictive model
and compare with the following heuristic that can be employed for any predictive modelfor this application:
Sort the accepted students in non-decreasing order of
Following the above order, allocate the maximum permissible aid (i.e., $25,000) until reaching the maximum budget.
This is a realistic heuristic because it greedily assigns scholarship to the students in the order of those that are most sensitive to scholarship.
Table 1 reports results from the experimental evaluation. In particular, for logistic regression and neural network prediction models, we report for 500 and 1000 admitted students the expected number of enrolled students based on the allocation determined by JANOS and the heuristic described above. We also report for both models and for each the percent reduction in admitted student declination of admission. The results indicate that simply by a more careful assignment of scholarship and making no other changes, JANOS can provide a substantial improvement in expected matriculation rates. This example exemplifies the improvement in decision-making capability that JANOS can obtain.
|Logistic regression prediction||Neural network prediction|
|JANOS||Heuristic||Exp. % red. in declination||JANOS||Heuristic||Exp. % red. in declination|
There are other ways that a practitioner who is well-versed in both optimization and predictive modeling might address a decision problem of this sort. For example, in this application, one could discretize the domain of the decision variables to take values and then evaluate the predictive model at each value for each admitted student. However, such a transformation results in a pseudo-polynomial size model and also admits only an approximation. Note that if desired, one can model the problem in this way directly in JANOS by declaring the decision variables as discrete and setting their domain to .
6 Accessing the Solver
JANOS works with Python3 and currently requires Gurobi for optimization and sklearn for predictive modeling. You also must have numpy and matplotlib installed. Please refer to JANOS’s website (http://janos.opt-operations.com) for more information, where a user manual, quick start guide, and examples are provided.
We propose a modeling framework JANOS that integrates predictive modeling and prescriptive analytics. JANOS is a useful tool both for practitioners and researchers who are seeking to integrate machine learning models within a discrete optimization model.
- Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128. Cited by: §2.
- Julia: a fresh approach to numerical computing. SIAM review 59 (1), pp. 65–98. External Links: Cited by: §1.
- Principled deep neural network training through linear programming. arXiv preprint arXiv:1810.03218. Cited by: §1, §2.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §2.
- Improving Optimization Bounds using Machine Learning: Decision Diagrams Meet Deep Reinforcement Learning. ArXiv e-prints. External Links: Cited by: §2.
- Learning surrogate models for simulation-based optimization. AIChE Journal 60 (6), pp. 2211–2227. Cited by: §2.
- Analytics for an online retailer: demand forecasting and price optimization. Manufacturing & Service Operations Management 18 (1), pp. 69–88. Cited by: §2.
- Efficient and robust automated machine learning. In Advances in neural information processing systems, pp. 2962–2970. Cited by: §1.
- Deep neural networks as 0-1 mixed integer linear programs: a feasibility study. arXiv preprint arXiv:1712.06174. Cited by: §1, §2.
- Gurobi optimizer reference manual. External Links: Cited by: §1, §5.
- Pyomo: modeling and solving mathematical programs in python. Mathematical Programming Computation 3 (3), pp. 219. Cited by: §1.
- Predictive and prescriptive analytics for location selection of add-on retail products. Production and Operations Management 28 (7), pp. 1858–1877. Cited by: §2.
- Learning to run heuristics in tree search.. In IJCAI, pp. 659–666. Cited by: §2.
- An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine learning research 8 (Jul), pp. 1519–1555. Cited by: §2.
Graph colouring meets deep learning: effective graph neural network models for combinatorial problems. arXiv preprint arXiv:1903.04598. Cited by: §2.
- Decision support for university enrollment management: implementation and experience. Decision Support Systems 44 (1), pp. 106–123. Cited by: §5.1.
- Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pp. 9839–9849. Cited by: §2.
- Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
- Bounding and counting linear regions of deep neural networks. arXiv preprint arXiv:1711.02114. Cited by: §1, §2.