1 Introduction
A decision tree is a form of predictive model used for predicting a dependent variable
using a collection of independent variables . To make a prediction, we start at the root of the tree, and check a query (e.g., “Is ?”) at the root of the tree; depending on the answer to the query (true/false) we proceed to another node. We then check the new node’s query; the process continues until we reach a leaf node, where the tree outputs a prediction. A generalization of this type of model, called a tree ensemble model, involves making this type of prediction from each of a collection of trees and aggregating the individual predictions into a single prediction (for example, by taking a weighted average of the predictions for a regression setting, or by taking a majority vote of the predictions for a classification setting).Many types of tree ensemble models have been proposed in the machine learning literature; the most notable examples are random forests and boosted trees. Tree ensemble models are extremely attractive due to their ability to model complex, nonlinear relationships between the independent variables and the dependent variable . As a result, tree ensemble models in general, have gained or are gaining widespread popularity in a number of application areas; examples include chemistry (Svetnik et al. 2003), genomics (DíazUriarte and De Andres 2006), ecology (Elith et al. 2008, Cutler et al. 2007), economics (Varian 2014, Bajari et al. 2015), marketing (Lemmens and Croux 2006) and operations management (Ferreira et al. 2015).
In many applications of tree ensemble models, and predictive models in general, the independent variables that are used for prediction are exogenous and beyond our control as the decision maker. For example, one might build a random forest model to predict whether a patient is at risk of developing a disease based on the patient’s age, blood pressure, family history of the disease and so on. Clearly, features like age and family history are not amenable to intervention. Such models are typically used for some form of post hoc action or prioritization. In the disease risk example, one can use the random forest model to rank patients by decreasing predicted risk of an imminent acute event, and this ranking can be used to determine which patients should receive some intervention (e.g., closer monitoring, treatment with a particular drug and so on).
In an increasing number of predictive modeling applications, however, some of the independent variables are controllable; that is to say, those independent variables are also decision variables. We provide a couple of examples:

Design of drug therapies. In a healthcare context, one might be interested in building a model to predict patient response given a particular drug therapy, and then using such a model to find the optimal drug therapy. The dependent variable is some metric of the patient’s health, and the independent variables may specify the drug therapy (which drugs and in what doses) and characteristics of the patient. A recent example of such an approach is the paper of Bertsimas et al. (2016b)
that considers the design of clinical trials for combination drug chemotherapy for gastric cancer; the first step involves estimating a model (specifically, a ridge regression model) of patient survival and toxicity using information about the drug therapy, and the second step involves solving an optimization problem to find the drug therapy that maximizes the predicted survival of the given patient group subject to a constraint on the predicted toxicity.

Pricing/promotion planning. In marketing and operations management, a fundamental problem is that of deciding which products should be promoted when and at what price. In such a context, the data might consist of weekly sales of each product in a category, and the prices of the products in that category during that week and in previous weeks; one might use this data to build a predictive model of demand as a function of the prices, and then optimize such a model to decide on a promotion schedule (for a recent example see, e.g., Cohen et al. 2017).
In this paper, we seek to unlock the prescriptive potential of tree ensemble models by considering the problem of tree ensemble optimization. This problem is concerned with the following question: given a tree ensemble model that predicts some quantity of interest using a set of controllable independent variables , how should we set the independent variables so as to maximize the predicted value of ? This problem is of significant practical interest because it allows us to leverage the high accuracy afforded by tree ensemble models to obtain high quality decisions. At the same time, the problem is also challenging and it is not obvious how to solve such a problem, due to the highly nonlinear and largescale nature of tree ensemble models.
We make the following contributions:

We propose the problem of tree ensemble optimization problem and we show how to formulate this problem as a finite mixedinteger (MIO) optimization problem. The formulation can accommodate independent variables that are discrete, categorical variables as well as continuous, numeric variables. To the best of our knowledge, the problem of optimizing an objective function described as the prediction of a tree ensemble has not been previously proposed in either the machine learning or operations research communities.

From a theoretical standpoint, we develop a number of results that generally concern the tractability of the formulation. First, we prove that the tree ensemble optimization problem is in general NPHard. We then show that our proposed MIO formulation offers a tighter relaxation of the problem than an alternate MIO formulation, obtained by applying a standard linearization to a binary polynomial formulation of the problem. We develop a hierarchy of approximate formulations for the problem, obtained by truncating each tree in the ensemble to a depth from the root node. We prove that the objective value of such an approximate formulation is an upper bound that improves as increases, and show how to construct a complementary a priori lower bound that depends on the variability of each tree’s prediction below the truncation depth .

From a solution methodology standpoint, we present two different algorithms for tackling largescale instances of our MIO formulation. The first is based on solving a Benders reformulation of the problem using constraint generation. Here, we analyze the structure of the Benders subproblem and show that it can be solved efficiently. The second is based on applying lazy constraint generation directly to our MIO formulation. For this approach, we propose an efficient procedure for identifying violated constraints, which involves simply traversing each tree in the ensemble, and prove its correctness.

We evaluate the effectiveness of our formulation and solution methods computationally using an assortment of real data sets. We show that the full MIO formulation can be solved to full optimality for small to medium sized instances within minutes, and that our formulation is significantly stronger in terms of relaxation bound and solution time than the aforementioned alternate formulation. We also show that our approach often significantly outperforms a simple local search heuristic that does not guarantee optimality. Lastly, we show that our customized solution methods can drastically reduce the solution time of our formulation.

We provide a deeper showcase of the utility of our approach in two applications. The first is a case study in drug design, using a publicly available data set from Merck Research Labs (Ma et al. 2015). Here, we show that our approach can optimize largescale tree ensemble models with thousands of independent variables to full or near optimality within a two hour time limit, and can be used to construct a Pareto frontier of compounds that efficiently trade off predicted performance and similarity to existing, alreadytested compounds. The second is a case study in customized pricing using a supermarket scanner data set (Montgomery 1997). Here, we show that a random forest model leads to considerable improvements in outofsample prediction accuracy than two stateoftheart models based on hierarchical Bayesian regression, and that our optimization approach can be used to efficiently determine provably optimal prices at the individual store level within minutes.
The rest of the paper is organized as follows. In Section 2, we survey some of the related literature to this work. In Section 3, we present our formulation of the tree ensemble optimization problem as an MIO problem, and provide theoretical results on the structure of this problem. In Section 4, we present two solution approaches for largescale instances of the tree ensemble optimization problem. In Section 5, we present the results of our computational experiments with real data sets, and in Sections 6 and 7 we present the results of our two case studies in drug design and customized pricing, respectively. In Section 8, we conclude.
2 Literature review
Decision tree models became popular in machine learning with the introduction of two algorithms, ID3 (iterative dichotomiser; see Quinlan 1986) and CART (classification and regression tree; see Breiman et al. 1984)
. Decision tree models gained popularity due to their interpretability, but were found to be generally less accurate than other models such as linear and logistic regression. A number of ideas were thus consequently proposed for improving the accuracy of decision tree models, which are all based on constructing an
ensemble of tree models. Breiman (1996) proposed the idea of bootstrap aggregation, or bagging, where one builds a collection of predictive models, each trained with a bootstrapped sample of the original training set; the predictions of each model are then aggregated into a single prediction (for classification, this is by majority vote; for regression, this is by averaging). The motivation for bagging is that it reduces the prediction error for predictive models that are unstable/highly sensitive to the training data (such as CART); indeed, Breiman (1996) showed that bagged regression trees can be significantly better than ordinary regression trees in outofsample prediction error. Later, Breiman (2001)proposed the random forest model, where one builds a collection of bagged CART tree for which the subset of features selected for splitting at each node of each tree is randomly sampled from the set of all features
(the socalled random subspace method; see Ho 1998). Concurrently, a separate stream of literature has considered the idea of boosting (Schapire and Freund 2012), wherein one iteratively builds a weighted collection of basic predictive models (such as CART trees), with the goal of reducing the prediction error with each iteration.Tree ensembles occupy a central place in machine learning because they generally work very well in practice. In a systematic comparison of 179 different prediction methods on a broad set of benchmark data sets, FernándezDelgado et al. (2014)
found that random forests achieved best or nearbest performance over all of these data sets. Boosted trees have been similarly successful: on the data science competition website Kaggle, one popular implementation of boosted trees,
XGBoost, was used in more than half of the winning solutions in the year 2015 (Chen and Guestrin 2016). There exist robust and open source implementations of many tree ensemble models. For boosted trees, the R package gbm (Ridgeway 2006) and XGBoost are widely used; for random forests, the R package randomForest (Liaw and Wiener 2002) is extremely popular.At the same time, there has been a significant effort in the machine learning research community to develop a theoretical foundation for tree ensemble methods; we briefly survey some of the work in this direction for random forests. For random forests, the original paper of Breiman (2001) developed an upper bound on the generalization error of a random forest. Later research studied the consistency of both simplified versions of the random forest model (for example, Biau et al. 2008) as well as the original random forest model (for example, Scornet et al. 2015). Recently, Wager and Walther (2015) showed that certain forms of regression trees and random forests converge uniformly over the feature space to the true regression function, while Wager and Athey (2015) considered how to use random forests for causal inference. For an excellent overview of recent theoretical advances in random forests, the reader is referred to Biau and Scornet (2016).
At the same time, there is an increasing number of papers originating in operations research where a predictive model is used to represent the effect of the decision, and one solves an optimization problem to find the best decision with respect to this predictive model. Aside from the papers already mentioned in clinical trials and promotion planning, we mention a few other examples. In pricing, data on historical prices and demand observed at those prices is often used to build a predictive model of demand as a function of price and to then determine the optimal price (see for example Besbes et al. 2010, Bertsimas and Kallus 2016). In assortment optimization, the paper of Bertsimas and Mišić (2016) considers the problem of selecting an assortment (a set of products) given data on historical assortments; the paper proposes an approach based on first estimating a rankingbased model of choice, and then solving an MIO model to find an assortment that maximizes the expected revenue predicted by this rankingbased model of choice.
In the research literature where predictive models are used to understand and subsequently optimize decisions, the closest paper conceptually to this one is the paper of Ferreira et al. (2015). This paper considers the problem of pricing weekly sales for an online fashion retailer. To solve the problem, the paper builds a random forest model of the demand of each style included in a sale as a function of the style’s price and the average price of the other styles. The paper then formulates an MIO problem to determine the optimal prices for the styles to be included in the sale, where the revenue is based on the price and the demand (as predicted by the random forest) of each style. The MIO formulation of Ferreira et al. (2015) does not explicitly model the random forest prediction mechanism – instead, one computes the prediction of the random forest for each style at each of its possible prices and at each possible average price, and these predictions enter the model as coefficients in the objective function. (The predictions could just as easily have come from a different form of predictive model, without changing the structure of the optimization problem.) In contrast, our MIO formulation explicitly represents the structure of each tree in the ensemble, allowing the prediction of each tree to be determined through the variables and constraints of the MIO. Although the modeling approach of Ferreira et al. (2015) is feasible for their pricing problem, it is difficult to extend this approach when there are many independent variables, as one would need to enumerate all possible combinations of values for them and compute the tree ensemble’s prediction for each combination of values. To the best of our knowledge, our paper is the first to conceptualize the problem of how to optimize an objective function that is given by a tree ensemble.
Methodologically, the present paper is most related to the paper of Bertsimas and Mišić (2016). It turns out that the rankingbased model considered in Bertsimas and Mišić (2016) can be understood as a type of tree ensemble model; as such, the MIO formulation of Bertsimas and Mišić (2016) can be regarded as a special case of the more general formulation that we analyze here. Some of the theoretical results found in Bertsimas and Mišić (2016) – specifically, those results on the structure of the Benders cuts – are generalized in the present paper to tree ensemble models. Despite this similarity, the goals of the two papers are different. Bertsimas and Mišić (2016) considers an estimation and optimization approach specifically for assortment decisions, whereas in the present paper, we focus solely on an optimization framework that can be applied to any tree ensemble model, thus spanning a significantly broader range of application domains. Indeed, later in the paper we will present two different case studies – one on drug design (Section 6) and one on customized pricing (Section 7) – to illustrate the broad applicability of tree ensemble optimization.
Finally, we note that there is a growing literature on the use of mixedinteger optimization for the purpose of estimating decision tree models and other forms of statistical models. Specifically, Bertsimas and Dunn (2017) consider an exact MIO approach to constructing CART trees. Outside of decision tree models, Bertsimas and King (2015)
consider an MIO approach to model selection in linear regression; and
Bertsimas et al. (2016a) consider an MIO approach to best subset selection in linear regression. While the present paper is related to this previous work in that it also leverages the technology of MIO, the goal of the present paper is different. The above papers focus on the estimation of trees and other statistical models, whereas our paper is focused on optimization, namely, how to determine the optimal decision with respect to a given, fixed tree ensemble model.3 Model
We begin by providing some background on tree ensemble models in Section 3.1 and defining the tree ensemble optimization problem. We then present our mixedinteger optimization model in Section 3.2. We provide results on the strength of our formulation in Section 3.3. Finally, in Section 3.4, we describe a hierarchy of approximate formulations based on depth truncation.
3.1 Background
In this section, we provide some background on treebased models. We are given the task of predicting a dependent variable using the independent variables ; for convenience, we use
to denote the vector of independent variables. We let
denote the domain of independent variable and let denote the domain of . An independent variable may be a numeric variable or a categorical variable.A decision tree is a model for predicting the dependent variable using the independent variable by checking a collection of splits. A split is a condition or query on a single independent variable that is either true or false. More precisely, for a numeric variable , a split is a query of the form
Is ? 
for some . For a categorical variable , a split is a query of the form
Is ? 
where is a set of levels of the categorical variable. The splits are arranged in the form of a tree, with each split node having two child nodes. The left child corresponds to the split condition being true, while the right child corresponds to the condition being false. To make a prediction for an observation with the independent variable , we start at the split at the root of the tree and check whether satisfies the split condition; if it is true, we move to the left child, and if it is false, we move to the right child. At the new node, one checks the split again, and move again to the corresponding node. This process continues until we reach a leaf of the tree. The prediction that we make is the value corresponding to the leaf we have reached. An example of a decision tree and a prediction being made is given in Figure 1.
In this paper, we will focus on predictive models that are ensembles or collections of decision trees. We assume that there are trees, where each tree is indexed from 1 to . Each tree has a weight , and its prediction is denoted by the function ; for the independent variable , the prediction of tree is . For an observation with independent variable , the prediction of the ensemble of trees is given by
The optimization problem that we would like to solve is to find the value of the independent variable that maximizes the ensemble prediction:
(1) 
We shall make several assumptions about the tree ensemble model and our tree ensemble optimization problem. First, we assume that we are only making a single, onetime decision and that the tree model is fixed. Extending our approach to the setting of multistage decisions is an interesting direction for future research. Second, we assume that the tree model is an accurate representation of the outcome when we make the decision . In practice, some care must be taken here because depending on how the tree ensemble model is estimated and the nature of the data, the prediction may not necessarily be an accurate estimate of the causal effect of setting the independent variables to . This issue has been the focus of some recent work in prescriptive analytics (see Bertsimas and Kallus 2016, Kallus 2016). Our goal in this paper is to address only the question of optimization – how to efficiently and scalably optimize a tree ensemble function – which is independent of such statistical questions. As such, we will assume that the tree ensemble model we are given at the outset is beyond suspicion.
Problem (1) is very general, and one question we may have is whether it is theoretically tractable or not. Our first theoretical result answers this question in the negative. The tree ensemble optimization problem (1) is NPHard. The proof of Proposition 1, given in Section 9.2 of the ecompanion, follows by reducing the minimum vertex cover problem to problem (1).
3.2 Optimization model
We now present an MIO formulation of problem (1). Before we present the model, we will require some additional notation. We let denote the set of numeric variables and denote the set of categorical variables; we have that .
For each numeric variable , let denote the set of unique split points, that is, the set of values such that is a split condition in some tree in the ensemble . Let be the number of unique split points. Let denote the th smallest split point of variable , so that .
For each categorical variable , recall that is the set of possible values of . For convenience, let us also use in this case to denote the size of (i.e., ) and use the values to denote the possible levels of variable .
Let be the set of leaves or terminal nodes of tree . Let denote the set of splits of tree (nonterminal nodes). Recall that the left branch of the split corresponds to “yes” or “true” to the split query, and the right branch corresponds to “no” or “false”. Therefore, for each split in , we let be the set of leaves that are accessible from the left branch (all of the leaves for which the condition of split must be true), and be the set of leaves that are accessible from the right branch (all of the leaves for which the condition of split must be false). For each split , we let denote the variable that participates in split , and let denote the set of values of variable that participate in the split query of . Specifically, if is numeric, then for some , which corresponds to the split query . If is categorical, then , which corresponds to the query .
Recall that is the weight of tree . For each tree , we denote the set of leaves by . For each leaf , we use to denote the prediction that tree makes when an observation reaches leaf .
We now define the decision variables of the problem. There are two sets of decision variables. The first set is used to specify the independent variable value . For each categorical independent variable and each category/level , we let be 1 if independent variable is set to level , and 0 otherwise. For each numeric independent variable and each , we let be 1 if independent variable is set to a value less than or equal to the th split point, and 0 otherwise. Mathematically,
We use to denote the vector of values.
The second set of decision variables is used to specify the prediction of each tree . For each tree and each leaf , we let be a binary decision variable that is 1 if the observation encoded by belongs to/falls into leaf of tree , and 0 otherwise.
With these definitions, the mixedinteger optimization can be written as follows:
(2a)  
subject to  (2b)  
(2c)  
(2d)  
(2e)  
(2f)  
(2g)  
(2h) 
In order of appearance, the constraints have the following meaning. Constraint (2b) ensures that the observation falls in exactly one of the leaves of each tree . Constraint (2c) ensures that, if , then is forced to zero for all ; in words, the condition of the split is false, so the observation cannot fall into any leaf to the left of split , as this would require the condition to be true. Similarly, constraint (2d) ensures that if the condition of split is satisfied, then is forced to zero for all ; in words, the condition of the split is true, so the observation cannot fall into any leaf to the right of split , as this would require the condition to be false. Constraint (2e) ensures that for each categorical variable , exactly one of the levels is selected. Constraint (2f) requires that if the numeric independent variable is less than or equal to the th lowest split point, then it must also be less than or equal to the th lowest split point. Constraint (2g) defines each to be binary, while constraint (2h) defines each to be nonnegative. The objective represents the prediction of the ensemble of trees on the observation that is encoded by .
We now comment on several features of the model. The first is that the variables, despite having a binary meaning, are defined as continuous variables. The reason for this is that when is binary, the constraints automatically force to be binary. We will formally state this result later (Proposition 4.1 of Section 4.1
). As a result, the only binary variables are those in
, of which there are . Recall that for categorical independent variables, is the number of levels, whereas for numeric independent variables, is the number of unique split points found in the tree ensemble .The second is that our formulation does not model the exact value of each numeric independent variable . In contrast, the formulation only models where the variable is in relation to the unique split points in – for example, indicates that independent variable is set to be less than or equal to the first lowest split point. The reason for this is that each decision tree function is a piecewise constant function and therefore the ensemble tree function is also piecewise constant. Thus, for the purpose of optimizing the function , it is not necessary to explicitly maintain the value of each numeric independent variable .
The third is that numeric independent variables are modeled in terms of an inequality, that is, . Alternatively, one could model numeric independent variables by using to represent whether is between two consecutive split points, e.g.,
One would then redefine the set for each split involving the variable so as to include all of the relevant values under this new encoding. The advantage of our choice of encoding – using – is that the resulting formulation enhances the power of branching on fractional values of and leads to more balanced branchandbound trees (Vielma 2015). This type of encoding has been used successfully in scheduling and transportation applications (socalled “by” variables, representing an event happening by some period ; see for example Bertsimas et al. 2011); for further details, the reader is referred to Vielma (2015).
3.3 Theoretical properties
One question that we can ask, having described formulation (2), is whether there exist alternate MIO formulations for problem (1). We now describe one such alternate formulation, which involves relating the and variables in a different way.
In particular, for any leaf of any tree , let be the set of splits for which leaf is on the left side (i.e., such that ), and be the set of splits for which leaf is on the right side (i.e., such that ). The ensemble tree optimization problem can then be formulated as the following problem:
(3a)  
subject to  (3b) 
The above problem is a binary polynomial problem. Note that the product term, , is exactly 1 if the observation is mapped to leaf of tree , and 0 otherwise. The standard linearization of problem (3) (see Crama 1993) is the following MIO:
(4a)  
subject to  
(4b)  
(4c)  
(4d)  
(4e) 
Let be the optimal value of the LO relaxation of problem (2) and let be the optimal value of the LO relaxation of problem (4). The following result relates the two optimal values. . The proof of Proposition 4 (see Section 9.3) follows by showing that the optimal solution of the relaxation of problem (2) is a feasible solution for the relaxation of problem (4) and achieves an objective value of exactly . The significance of Proposition 4 is that it establishes that formulation (2) is a stronger formulation of the tree ensemble optimization problem than formulation (4). This is desirable from a practical perspective, as stronger MIO formulations are generally faster to solve than weaker MIO formulations. We shall see in Section 5.2 that the difference in relaxation bounds can be substantial, and that problem (4) is significantly less tractable than our problem (2).
3.4 Depth approximation
In this section, we describe a hierarchy of relaxations of problem (2) that are based on approximating each tree in the ensemble up to a particular depth.
The motivation for this hierarchy of relaxations comes from the following observation regarding the size of problem (2). In particular, a key driver of the size of problem (2) is the number of left and right split constraints ((2c) and (2d), respectively); these constraints are enforced for every single split in each tree in the ensemble. For a large number of trees that are deep (and thus have many splits), the resulting number of left and right split constraints will be large. At the same time, it may be reasonable to expect that if we do not represent each tree to its full depth, but instead only represent each tree up to some depth and only include splits that occur before (and including) depth , then we might still be obtain a reasonable solution to the original problem (2). In this section, we rigorously define this hierarchy of approximate formulations, and provide theoretical guarantees on how close such approximations are to the original formulation.
Let be the set of treesplit pairs. Let be a subset of all possible treesplit pairs. The tree ensemble problem is defined as problem (2) where constraints (2c) and (2d) are restricted to :
(5a)  
subject to  (5b)  
(5c)  
(5d) 
Solving problem (5) for a fixed will result in a solution that satisfies constraints (2c) and (2d) but only for those pairs in .
For any , let be the set of all treesplit pairs where the split is at a depth (a depth of 1 corresponds to the split at the root node). Let denote the objective value of problem (5) with , i.e., all splits up to and including depth , and let be the maximum depth of any tree in the ensemble. The following simple result (see Section 9.4 for the proof) establishes that with increasing , the objective value is an increasingly tighter upper bound on , the objective value of problem (2) (where the split constraints are up to the full depth of each tree). We now show how to construct a complementary lower bound. Let denote the set of splits at depth ; if the depth of the tree is strictly less than , is empty. Let us define the constant for each split of each tree as
(6) 
The constant is an upper bound on the maximum error possible (due to the depth truncation of the split constraints) in the prediction of tree for the observation encoded by , given that the observation reaches split . We define as the maximum of the values over all the depth splits of tree :
(In the case that is empty, we set .)
Before stating our approximation guarantee, we note that given a solution that solves problem (5) with , it is possible to find a solution that is a feasible solution for the full depth problem (2). This is a consequence of a result which we will state later (Proposition 4.1). Our approximation guarantee is given below. Suppose that for all and . Let be the optimal solution of problem (5) with . Let be the true objective of when embedded within the fulldepth problem (5). We then have
The above theorem, which we prove in Section 9.5, provides a guarantee on how suboptimal the solution, derived from the depth problem (5) with , is for the true (full depth) problem (2). Note that the requirement of being nonnegative is not particularly restrictive, as we can always make of a given tree positive by negating the leaf predictions of that tree. This result is of practical relevance because it allows the decision maker to judiciously tradeoff the complexity of the problem (represented by the depth ) against an a priori guarantee on the quality of the approximation. Moreover, the quantity , which bounds the difference between and , can be easily computed from each tree, allowing the bound to be readily implemented in practice. We shall see in Section 5.3 that although the lower bound can be rather conservative for small values of , the true objective value of is often significantly better.
4 Solution methods
The optimization model that we presented in Section 3.2, although tractable for small to medium instances, can be difficult to solve directly for large instances. In this section, we present two solution approaches for tackling largescale instances of problem (2). In Section 4.1, we present an approach based on Benders decomposition. In Section 4.2, we present an alternate approach based on iteratively generating the split constraints.
4.1 Benders decomposition
The first solution approach that we will consider is Benders decomposition. Recall that in problem (2), we have two sets of variables, and ; furthermore, can be further partitioned as , where is the collection of variables corresponding to tree . For any two trees with , notice that the variables and do not appear together in any constraints; they are only linked together through the variables.
The above observation suggests a Benders reformulation of problem (2). Let us rewrite problem (2) as follows:
(7a)  
subject to  (7b) 
where is defined as the optimal value of the following subproblem:
(8a)  
subject to  (8b)  
(8c)  
(8d)  
(8e) 
The first result we will prove is of the form of the optimal solution to problem (8). To do this, we first provide a procedure for determining the leaf of the solution encoded by for tree , as Algorithm 1. For ease of exposition, we will denote this procedure applied to a particular observation encoded by and a given tree as . We use to denote the left child of a split node , to denote the right child of a split node , and to denote the root split node of tree .
Having defined GetLeaf, we now present our first theoretical result (see Section 9.6 for the proof). Let be a feasible solution of problem (7). Let be the leaf into which falls, and let be the solution to problem (8) defined as
The solution is the only feasible solution and therefore, the optimal solution, of problem (8). Since problem (8) is feasible and has a finite optimal value, then by LO strong duality the optimal value of problem (8) is equal to the optimal value of its dual. The dual of subproblem (8) is
(9a)  
subject to  (9b)  
(9c) 
Letting denote the set of dual feasible for subproblem , we can rewrite problem (7) as
(10a)  
subject to  
(10b)  
(10c) 
We can now solve problem (10) using constraint generation. In particular, we start with constraint (10b) enforced at a subset of dual solutions , and solve problem (10). We then solve problem (9) for each tree to determine if there exists a solution for which constraint (10b) is violated. If so, we add the constraint and solve the problem again. Otherwise, if no such is found for any tree, then the current solution is optimal.
In the above constraint generation scheme, the key step is to find a dual subproblem solution that is violated. With this motivation, we now prove a complementary result to Proposition 4.1 on the structure of an optimal solution to the dual problem (9) (see Section 9.7 for the proof). Let be a feasible solution of problem (7) and be the optimal solution of primal subproblem (8). Let . An optimal solution of dual subproblem (9) is then given as follows:
The value of Proposition 4.1 is that we can check for violated constraints in problem (10) through a simple calculation, without invoking an LO solver. In our numerical experiments, our Benders solution method will consist of solving problem (10) by adding the constraints (10b) using lazy constraint generation. In this approach, we solve problem (10) while maintaining a single branchandbound tree. At each node, we check the integer solution by solving problem (9) using Proposition 4.1 for each tree and determine if any constraints are violated; if so, we add those constraints to that node only.
4.2 Split constraint generation
Recall from Section 3.4 that when there are a large number of trees and each tree is deep, the total number of splits will be large, and the number of left and right split constraints will be large. However, for a given encoding , observe that we do not need all of the left and right split constraints in order for to be completely determined by . As an example, suppose for a tree that is the root split, and (i.e., we take the left branch of the root split); in this case, the right split constraint (2d) will force all to zero for . It is clear that in this case, it is not necessary to include any left or right split constraint for any split node that is to the right of split , because all of the values that could be affected by those constraints are already fixed to zero.
This suggests an alternate avenue to solving problem (2), based on iteratively generating the left and right split constraints. Rather than attempting to solve the full problem (2) with all of the left and right split constraints included in the model, start with a subset of left split constraints and a subset of right split constraints and solve the corresponding restricted version of problem (2). For the resulting solution , determine whether there exist any tree split pairs for which the left split constraint (2c) or right split constraint (2d) are violated. If a violated left or right split constraint is found, add the corresponding left or right constraint to the formulation and solve it again. Repeat the procedure until no violated constraints are found, at which point we terminate with the current solution as the optimal solution.
The key question in such a proposal is: how do we efficiently determine violated constraints? The answer to this question comes from the following proposition. Let be a candidate solution to problem (2) that satisfies constraints (2b) and constraints (2e) to (2h). Let . The solution satisfies constraints (2c) and (2d) for all if and only if it satisfies constraint (2c) for and constraint (2d) for , where . Proposition 4.2 (see Section 9.8 for the proof) states that, to check whether solution satisfies the split constraints for tree , it is only necessary to check the split constraints for those splits that are traversed when the observation encoded by is mapped to a leaf by the action of GetLeaf. This is a simple but extremely useful result, because it implies that we can check for violated constraints simply by traversing the tree, in the same way that we do when we determine the leaf of .
Algorithm 2 provides the pseudocode of this procedure. This algorithm involves taking the observation encoded by and walking it down tree , following the splits along the way. For each split we encounter, we determine whether we should proceed to the left child () or to the right child (). If we are going to the left ( or equivalently, ), then we check that is zero for all the leaves to the right of split (constraint (2d)). If we are going to the right ( or equivalently, ), then we check that is zero for all the leaves to the left of split (constraint (2c)). In words, we are traversing the tree as we would to make a prediction, and we are simply checking that there is no positive that is on the “wrong” side of any left or right split that we take. If we reach a leaf node, we can conclude that the current solution does not violate any of the split constraints of tree .
The above procedure can be used as part of a classical constraint generation scheme. Starting with some set of treesplit pairs , we can solve problem (5) to obtain the solution and check for violated constraints using Algorithm 2 for each tree . If we find any treesplit pairs for which a split constraint is violated, we add them to , resolve problem (5) and repeat the process. If no violated treesplit pairs were found, we terminate with as the optimal solution of problem (2).
Alternatively, we can also generate constraints using Algorithm 2 as part of a lazy constraint generation scheme, analogously to our Benders approach. In this approach, we check the integer solution of each node in the branchandbound tree using Algorithm 2 and determine if any split constraints are violated; if so, we add those constraints to that node only. We use this constraint generation approach in our numerical experiments in Section 5.4.
5 Computational experiments
In this section, we describe our first set of computational results. Section 5.1 provides the background of our experiments. Section 5.2 provides initial results on the full MIO formulation (2), while Section 5.3 provides results on the depth approximation scheme of Section 3.4. Finally, Section 5.4 compares the Benders and split generation solution methods against directly solving problem (2).
5.1 Background
We test our optimization formulation (2) and the associated solution methods from Section 4 using tree ensemble models estimated from real data sets. We consider several real data sets; the details are provided in Table 1. We wish to point out that in these data sets, the independent variables may in reality not be controllable. However, despite this, the data sets are still useful in that they will furnish us with real tree ensemble models for evaluating our optimization methodology.
Num.  Num.  

Data set  Source  Vars.  Obs.  Description 
winequalityred *  Cortez et al. (2009)  11  1599  Predict quality of (red) wine 
concrete **  Yeh (1998)  8  1030  Predict strength of concrete 
permeability **  Kansy et al. (1998)  1069  165  Predict permeability of compound 
solubility **  Tetko et al. (2001),  228  951  Predict solubility of compound 
Huuskonen (2000) 
We specifically focus on random forest models. Unless otherwise stated, all random forest models are estimated in R using the randomForest package (Liaw and Wiener 2002), using the default parameters. All linear and mixedinteger optimization models are formulated in the Julia programming language (Bezanson et al. 2012), using the JuMP package (Julia for Mathematical Programming; see Lubin and Dunning 2015), and solved using Gurobi 6.5 (Gurobi Optimization, Inc. 2015). All experiments were executed on a late 2013 Apple Macbook Pro Retina laptop, with a quadcore 2.6GHz Intel i7 processor and 16GB of memory.
5.2 Full MIO formulation experiments
As part of our first experiment, we consider solving the unconstrained tree ensemble problem for each data set. For each data set, we consider optimizing the default random forest model estimated in R which uses 500 trees (the parameter ntree in randomForest is set to 500). For each data set, we also consider solving the tree ensemble problem using only the first trees of the complete forest, where ranges in . For each data set and each value of , we solve the MIO formulation (2), as well as its LO relaxation.
We compare our MIO formulation against two other approaches:

Local search: We solve the tree ensemble problem (1) using a local search heuristic. The details of this local search are provided in Section 10; at a high level, it starts from a randomly chosen initial solution, and iteratively improves the solution, one independent variable at a time, until a local optimum is reached. The heuristic is repeated from ten starting points, out of which we only retain the best (highest objective value) solution. We test such an approach to establish the value of our MIObased approach, which obtains a globally optimal solution, as opposed to a locally optimal solution.
We consider several metrics:

: the time (in seconds) to solve the MIO (2).

: the time (in seconds) to solve the standard linearization MIO (4).

: the time (in seconds) to run the local search procedure (value reported is the total for ten starting points).

: the gap of the local search solution; if is the objective value of the local search solution and is the optimal objective value of problem (2), then

: the gap of the LO relaxation of problem (2); if is the objective value of the LO relaxation and is the optimal integer objective as before, then

: the gap of the LO relaxation of the standard linearization MIO (4); if is the optimal value of the relaxation, then

: the number of levels (i.e., dimension of ), defined as .

: the number of leaves (i.e., dimension of ), defined as .
Table 2 shows the solution time and problem size metrics, while Table 3 shows the gap metrics. From these two tables, we can draw several conclusions. First, the time required to solve problem (2) is very reasonable; in the most extreme case (winequalityred, ), problem (2) can be solved to full optimality in about 20 minutes. (Note that no time limit was imposed on problem (2); all values of correspond to the time required to solve problem (2) to full optimality.) In contrast, the standard linearization problem (4) was only solved to full optimality in two out of twenty cases within the 30 minute time limit. In addition, for those instances where the solver reached the time limit, the optimality gap of the final integer solution, , was quite poor, ranging from 50 to over 100%.
Second, the integrality gap is quite small – on the order of a few percent in most cases. This suggests that the LO relaxation of problem (2) is quite tight. In contrast, the LO relaxation bound from the standard linearization problem (4) is weaker than that of problem (2), as predicted by Proposition 4, and strikingly so. The weakness of the relaxation explains why the corresponding integer problem cannot be solved to a high degree of optimality within the 30 minute time limit. These results, together with the results above on the MIO solution times and the final optimality gaps of problem (4), highlight the edge of our formulation (2) over the standard linearization formulation (4).
Third, although there are many cases where the local search solution performs quite well, there are many where it can be quite suboptimal, even when repeated with ten starting points. Moreover, while the local search time is generally smaller than the MIO time , in some cases it is not substantially lower (for example, solubility for ). The modest additional time required by the MIO formulation (2) may therefore be justified for the guarantee of provable optimality.
Data set  

solubility  10  1253  3157  0.1  215.2  0.2 
50  2844  15933  0.8  1800.3  1.8  
100  4129  31720  1.7  1801.8  8.8  
200  6016  63704  4.5  1877.8  33.7  
500  9646  159639  177.9  1800.3  147.0  
permeability  10  2138  604  0.0  122.6  1.0 
50  2138  3056  0.1  1800.0  1.9  
100  2138  6108  0.2  1800.3  3.1  
200  2138  12214  0.5  1800.0  6.1  
500  2138  30443  2.7  1800.0  19.1  
winequalityred  10  1370  3246  1.8  1800.1  0.0 
50  2490  16296  18.5  1800.1  0.6  
100  3000  32659  51.6  1800.1  2.5  
200  3495  65199  216.0  1800.2  11.4  
500  3981  162936  1159.7  1971.8  34.6  
concrete  10  1924  2843  0.2  1800.8  0.1 
50  5614  14547  22.7  1800.1  1.3  
100  7851  29120  67.8  1800.1  4.3  
200  10459  58242  183.8  1800.2  20.2  
500  13988  145262  846.9  1809.4  81.6 
Data set  

solubility  10  0.0  485.5  0.0  18.6 
50  0.0  498.0  50.1  9.5  
100  0.0  481.2  70.5  0.3  
200  0.0  477.5  77.7  0.2  
500  0.0  501.3  103.2  0.2  
permeability  10  0.0  589.5  0.0  6.1 
50  0.0  619.4  71.9  3.5  
100  0.0  614.1  75.0  1.8  
200  0.0  613.0  80.0  0.1  
500  0.0  610.4  85.9  0.0  
winequalityred  10  1.5  11581.3  89.8  1.2 
50  3.4  11873.6  98.3  2.3  
100  4.3  12014.9  98.8  0.6  
200  4.3  12000.6  99.0  1.2  
500  4.5  12031.8  99.2  1.4  
concrete  10  0.0  6210.6  72.5  0.0 
50  1.8  6657.1  95.0  0.0  
100  2.6  6706.6  98.3  0.0  
200  1.6  6622.2  98.5  0.0  
500  2.2  6652.6  98.8  0.0 
5.3 Depth approximation experiments
In this section, we investigate the use of the depth tree problem (formulation (5) with ) for approximating the full depth problem (2).
In this set of experiments, we focus on the same data sets as before with . We solve problem (5) with and vary the depth of the approximation. We consider the upper bound (denoted by “UB”), the actual value of the solution (denoted by “Actual”) and the lower bound (denoted by “LB”).
Figures 2 and 3 plot the above three metrics for the winequalityred and concrete data sets, respectively. From these plots, we can see that the upper bound is decreasing, while the lower bound and the actual objective are increasing. We can also see that the lower bound is quite loose, and the depth needs to increase significantly in order for the lower bound to be close to the upper bound. However, even when the depth is small and the lower bound is loose, the actual objective of the solution produced by the approximation is very good. In the case of winequalityred, the solution is essentially optimal after a depth of (compared to a maximum depth of 26); for concrete, this occurs for a depth of (compared to a maximum depth of 24).
To complement these results on the objective values of the formulations, Figures 4 and 5 show the computation time of the depth approximation formulation as varies for the same two data sets. Here we can see that the solution time required to solve the depth approximation formulation initially increases in an exponential fashion as increases; this is to be expected, because with each additional layer of splits, the number of splits roughly doubles. Interestingly, though, the solution time seems to plateau after a certain depth, and no longer continues to increase. Together with Figures 2 and 3, these plots suggest the potential of the depth approximation approach to obtain nearoptimal and optimal solutions with significantly reduced computation time relative to the full depth problem.
5.4 Solution method experiments
In this final set of experiments, we evaluate the effectiveness of the two solution methods from Section 4 – Benders decomposition and split constraint generation – on solving large instances of problem (2). We consider the same data sets as before with . For each instance, we consider: , and , which are the times to solve problem (2) to full optimality using the Benders approach, using the split constraint generation approach and by directly solving problem (2), respectively.
Table 4 shows the results from this comparison. From this table, we can see that both approaches can lead to dramatic reductions in the solution time relative to solving problem (2) directly with all split constraints enforced at the start. In the most extreme case (concrete), we observe a reduction from about 800 seconds for the standard solution method to about 32 seconds for split constraint generation and about 37 seconds for the Benders approach – a reduction in solution time of over 95%. In some cases, Benders decomposition is slightly faster than split generation; for example, for winequalityred with , the Benders approach requires just under 11 minutes whereas split generation requires just over 13 minutes. In other cases, split generation is faster (for example, solubility with ).
Data set  

solubility  100  4129  31720  1.7  1.1  0.8 
200  6016  63704  4.5  2.6  0.8  
500  9646  159639  177.9  28.6  13.7  
permeability  100  2138  6108  0.2  0.0  0.0 
200  2138  12214  0.5  0.2  0.1  
500  2138  30443  2.7  0.4  0.7  
winequalityred  100  3000  32659  51.6  41.1  56.5 
200  3495  65199  216.0  152.3  57.2  
500  3981  162936  1159.7  641.8  787.8  
concrete  100  7851  29120  67.8  8.8  8.6 
200  10459  58242  183.8  12.5  15.4  
500  13988  145262  846.9  37.5  32.3 
6 Case study 1: drug design
In this section, we describe our case study in drug design. Section 6.1 provides the background on the problem and the data. Section 6.2 reports the unconstrained optimization results of the unconstrained optimization problem, while Section 6.3 shows the results of our optimization model when the similarity to existing compounds is constrained.
6.1 Background
For this set of experiments, we use the data sets from Ma et al. (2015). These data sets were created for a competition sponsored by Merck and hosted by Kaggle. There are fifteen different data sets. In each data set, each observation corresponds to a compound/molecule. Each data set has a single dependent variable, which is different in each data set and represents some measure of “activity” (a property of the molecule or the performance of the molecule for some task). The independent variables in each data set are the socalled “atom pair” and “donoracceptor pair” features, which describe the substructure of each molecule (see Ma et al. 2015 for further details). The goal of the competition was to develop a model to predict activity using the molecular substructure; such models are known as quantitative structureactivity relationship (QSAR) models.
The optimization problem that we will consider is to find the molecule that maximizes activity as predicted by a random forest model. Our interest in this problem is twofold. First, this is a problem of significant practical interest, as new drugs are extremely costly to develop. Moreover, these costs are rising: the number of drugs approved by the FDA per billion dollars of pharmaceutical R&D spending has been decreasing by about 50% every 10 years since 1950 (a phenomenon known as “Eroom’s Law” – Moore’s Law backwards; see Scannell et al. 2012). As a result, there has been growing interest in using analytics to identify promising drug candidates in academia as well as industry (see for example Atomwise Inc. 2017). With regard to random forests, we note that they are widely used in this domain: the QSAR community was one of the first to adopt them (Svetnik et al. 2003) and they have been considered a “gold standard” in QSAR modeling (Ma et al. 2015).
Second, the problem is of a very large scale. The number of independent variables ranges from about 4000 to just under 10,000, while the number of observations ranges from about 1500 to just over 37,000; in terms of file size, the smallest data set is approximately 15MB, while the largest weighs in at just over 700MB. Table 5 summarizes the data sets. Estimating a random forest model using the conventional randomForest package in R on any one of these data sets is a daunting task; to give an example, a single random forest tree on the largest data set required more than 5 minutes to estimate, which extrapolates to a total computation time of over 8 hours for a realistic ensemble of 100 trees.
Data set  Name  Num Obs.  Num. Variables 

1  3A4  37241  9491 
2  CB1  8716  5877 
3  DPP4  6148  5203 
4  HIVINT  1815  4306 
5  HIVPROT  3212  6274 
6  LOGD  37388  8921 
7  METAB  1569  4505 
8  NK1  9965  5803 
9  OX1  5351  4730 
10  OX2  11151  5790 
11  PGP  6399  5135 
12  PPB  8651  5470 
13  RAT_F  6105  5698 
14  TDI  4165  5945 
15  THROMBIN  5059  5552 
6.2 Unconstrained optimization results
In our first set of experiments, we proceed as follows. For each data set, we estimate a random forest model to predict the activity variable using all available independent variables. To reduce the computational burden posed by estimating random forest models from such large data sets, we deviate from our previous experiments by using the ranger package in R (Wright and Ziegler 2017)
, which is a faster implementation of the random forest algorithm suited for
Comments
There are no comments yet.