1.1 Motivation & Related Work
nodes. Thus, each path from root to leaf represents a classification rule that assigns a unique label to all datapoints that reach that leaf. The goal in the design of optimal decision trees is to select the tests to perform at each internal node and the labels to assign to each leaf to maximize prediction accuracy (classification) or to minimize prediction error (regression). Not only are decision trees popular in their own right; they also form the backbone for more sophisticated machine learning models. For example, they are the building blocks for random forests, one of the most popular and stable machine learning techniques available, see e.g.,Liaw and Wiener . They have also proved useful to provide explanations for the solutions to optimization problems, see e.g., Bertsimas and Stellato .
. It can intuitively be viewed as a combinatorial optimization problem with an exponential number of decision variables: at each internal node of the tree, one can select what feature to branch on (and potentially the level of that feature), guiding each datapoint to the left or right using logical constraints.
Motivated by these hardness results, traditional algorithms for learning decision trees have relied on heuristics that employ very intuitive, yet ad-hoc, rules for constructing the decision trees. For example, CART uses the Gini Index to decide on the splitting, see Breiman ; ID3 employs entropy, see Quinlan ; and C4.5 leverages normalized information gain, see Quinlan . The high quality and speed of these algorithms combined with the availability of software packages in many popular languages such as R or Python has facilitated their popularization, see e.g., Therneau et al. , Kuhn et al. . They are now routinely used in commercial, medical, and other applications.
Mathematical Programming Techniques.
Motivated by the heuristic nature of traditional approaches, which provide no guarantees on the quality of the learned tree, several researchers have proposed algorithms for learning provably optimal
trees based on techniques from mathematical optimization. Approaches for learning optimal decision-trees rely on enumeration coupled with rules to prune-out the search space. For example,Nijssen and Fromont  use itemset mining algorithms and Narodytska et al.  use satisfiability (SAT) solvers. Verhaeghe et al.  propose a more elaborate implementation combining several ideas from the literature, including branch-and-bound, itemset mining techniques and caching. Hu et al. 
use analytical bounds (to aggressively prune-out the search space) combined with a tailored bit-vector based implementation.
The Special Case of MIP.
As an alternative approach to conducting the search for optimal trees, Bertsimas and Dunn  recently proposed to use mixed-integer programming (MIP) to learn optimal classification trees. Following this work, using MIP to learn decision trees gained a lot of traction in the literature with the works of Günlük et al. , Aghaei et al. , and Verwer and Zhang . This is no coincidence. First, MIP comes with a suit of off-the shelf solvers and algorithms that can be leveraged to effectively prune-out the search space. Indeed, solvers such as CPLEX  and Gurobi Optimization  have benefited from decades of research, see Bixby , and have been very successful at solving a broad class of MIP problems. Second, MIP comes with a highly expressive language that can be used to tailor the objective function of the problem or to augment the learning problem with constraints of practical interest. For example, Aghaei et al.  leverage the power of MIP to incorporate fairness and interpretability constraints into learned classification and regression trees. They also show how MIP technology can be exploited to learn decision trees with more sophisticated structure (linear branching and leafing rules). Similarly, Günlük et al.  use MIP to solve classification trees with combinatorial branching decisions. MIP formulations have also been leveraged to design decision trees for decision- and policy-making problems, see Azizi et al.  and Ciocan and Mišić , and for optimizing decisions over tree ensembles, see Mišic .
Discussion & Motivation.
The works of Bertsimas and Dunn , Günlük et al. , Aghaei et al. , and Verwer and Zhang  have served to showcase the modeling power of using MIP to learn decision trees and the potential suboptimality of traditional algorithms. Yet, we argue that they have not leveraged the power of MIP to its full extent. A critical component for efficiently solving MIPs is to pose good formulations, but determining such formulations is no simple task. The standard approach for solving MIP problems is the branch-and-bound method, which partitions the search space recursively and solves Linear Programming (LP) relaxations for each partition to produce lower bounds for fathoming sections of the search space. Thus, since solving a MIP requires solving a large sequence of LPs, small and compact formulations are desirable as they enable the LP relaxation to be solved faster. Moreover, formulations with tight LP relaxations, referred to as strong
formulations, are also desirable, as they produce higher quality lower bounds which lead to a faster pruning of the search space, ultimately reducing the number of LPs to be solved. Unfortunately, these two goals are at odds with one another, with stronger relaxations often requiring additional variables and constraints thanweak ones. For example, in the context of decision trees, Verwer and Zhang  propose a MIP formulation with significantly fewer variables and constraints than the formulation of Bertsimas and Dunn , but in the process weaken the LP relaxation. As a consequence, neither method consistently dominates the other.
We note that in the case of MIPs with large numbers of decision variables and constraints, classical decomposition techniques from the Operations Research literature may be leveraged to break the problem up into multiple tractable subproblems of benign complexity. A notable example of a decomposition algorithm is Benders’ [Benders, 1962]. Bender’s decomposition exploits the structure of mathematical programming problems with so-called complicating variables which couple constraints with one another and which, once fixed, result in an attractive decomposable structure that is leveraged to speed-up computation and alleviate memory consumption, allowing the solution of large-scale MIPs. To the best of our knowledge, existing approaches from the literature have not sought explicitly strong formulations, neither have they attempted to leverage the potentially decomposable structure of the problem. This is precisely the gap we fill with the present work.
1.2 Proposed Approach & Contributions
Our approach and main contributions in this paper are:
We propose an intuitive flow-based MIP formulation for the problem of learning optimal classification trees with binary data. Notably, our proposed formulation does not use big- constraints, which are known to lead to weak LP relaxations. We provide an intuitive proof to justify that our LP relaxation is stronger than existing alternatives.
Our proposed formulation is amenable to Bender’s decomposition. In particular, binary tests are selected in the master problem and each subproblem guides each datapoint through the tree via a max-flow formulation. We leverage the max-flow structure of the subproblems to solve them efficiently via min-cut procedures.
We conduct extensive computational studies on benchmark datasets, demonstrating that our formulations improve upon the state-of-the-art MIP algorithms, both in terms of in-sample solution quality (and speed) and out-of-sample performance.
The proposed modeling and solution paradigm can act as a building block for the faster and more accurate learning of more sophisticated trees. Continuous data can be discretized and binarized to address problems with continuous labels, seeBreiman . Regression trees can be obtained via minor modifications of the formulation, see e.g., Verwer and Zhang . Fairness and interpretability constraints can naturally be incorporated into the problem, see Aghaei et al. . We leave these studies to future work.
2 Decision Tree Formulation
2.1 Problem Formulation
We are given a training dataset consisting of datapoints indexed in the set . Each row of this dataset consists of binary features indexed in the set and collected in the vector and a label drawn from the finite set of classes. We consider the problem of designing an optimal decision tree that minimizes the misclassification rate based on MIP technology.
The key idea behind our model is to augment the decision tree with a single source node that is connected to the root node (node 1) of the tree and a single sink node connected to all nodes of the tree, see Figure 1. This modification enables us to think of the decision tree as a directed acyclic graph with a single source and sink node. Datapoints flow
from source to sink through a single path and only reach the sink if they are correctly classified (they will face a ‘‘road block’’ if incorrectly classified which will prevent the datapoint from traversing the graph at all). Similar to traditional algorithms for learning decision trees, we allow labels to be assigned to internal nodes of the tree. In that case, correctly classified datapoints that reach such nodes are directly routed to the sink node (as if we had a ‘‘short circuit’’).
Next, we introduce our notation and conventions that will be useful to present our model. We denote by and the sets of all internal and leaf nodes in the tree, respectively. For each node , we let be the direct ancestor of in the graph. For , let (resp. ) represent the left (resp. right) direct descendant of node in the graph. In particular, we have . We will say that we branch on feature at node if the binary test performed at asks ‘‘Is ’’? Datapoint will be directed left (right) if the answer is affirmative (negative).
The decision variables for our formulation are as follows. The variable indicates if we branch on (i.e., perform a binary test on) feature at node . If for some node , no feature is selected to branch on at that node, and a class is assigned to node . We let the variable indicate if the predicted class for node is . A datapoint is correctly classified iff it reaches some node such that with . Points that arrive at that node and that are correctly classified are directed to the sink. For each node and for each datapoint , we introduce a binary valued decision variable which equals 1 if and only if the th datapoint is correctly classified (i.e., reaches the sink) and traverses the edge between nodes and . We let be defined accordingly for each edge between node and sink .
The flow-based formulation for decision trees reads
where is a regularization weight. The objective (1a) maximizes the total number of correctly classified points while minimizing the number of splits . Thus, controls the trade-off between these competing objectives, with larger values of lambda corresponding to greater regularization. An interpretation of the constraints is as follows. Constraint (1b) ensures that at each node we either branch on a feature or assign a class label to it (but not both, the label is only used if we do not branch at that node). Constraint (1c) guarantees that each leaf has a unique predicted class label. Constraint (1d) is a flow conservation constraint for each datapoint and node : it ensures that if a datapoint arrives at a node, it must also leave the node through one of its descendants, or be correctly classified and routed to . Similarly, constraint (1e) enforces flow conservation for each node . The inequality constraint (1f) ensures that at most one unit of flow can enter the graph through the source. Constraints (1g) and (1h) ensure that if a datapoint is routed to the left (right) at node , then one of the features such that () must have been selected for branching at the node. Constraint (1i) ensures that datapoints routed to the sink node are correctly classified.
Given a choice of branching and labeling decisions, and , each datapoint is allotted one unit of flow which it attempts to guide through the graph from the source node to the sink node. If the datapoint cannot be correctly classified, the flow that will reach the sink (and by extension enter the source) will be zero. In particular note that once the and variables have been fixed, optimization of the flows can be done separately for each datapoint. This implies that the problem can be decomposed to speed-up computation, an idea that we leverage in Section 3. In particular, note that the optimization over flow variables can be cast as a max-flow problem for each datapoint, implying that the integrality constraint on the variables can be relaxed to yield an equivalent formulation. We leverage this idea in our computational experiments.
Formulation (1) has several distinguishing features relative to existing MIP formulations for training decision trees
It does not use big- constraints.
It includes flow variables indicating whether each datapoint is directed to the left or right at each branching node.
It only tracks datapoints that are correctly classified.
The number of variables and constraints in formulation (1) is , where is the tree depth. Thus, its size is of the same order as the one proposed by Bertsimas and Dunn . Nonetheless, as we discuss in §2.2, the LP relaxation of formulation (1) is tighter, and therefore results in a more aggressive pruning of the search space without incurring in significant additional costs.
2.2 Strength of the Flow-Based Formulation
We now argue that formulation (1), which we henceforth refer to as flow-based formulation, is stronger than existing formulations from the literature.
The BinOCT formulation of Verwer and Zhang  is obtained by aggregating constraints from the OCT formulation of Bertsimas and Dunn  (using big- constants). As a consequence, its relaxation is weaker. Thus, it suffices to argue that the proposed formulation is stronger than OCT. In the following, we work with a simplified version of the formulation of Bertsimas and Dunn  specialized to the case of binary data. We provide this formulation in the online companion A.
2.2.1 No big-s
In this section, we argue that the absence of big- constraints in our formulation induces a stronger formulation. In the OCT formulation, for and
, there are binary variablessuch that if datapoint is assigned to leaf node (regardless of whether that point is correctly classified or not), and otherwise. In addition, the authors introduce a variable that represents the number of missclassified points at leaf node , and this variable is defined via constraints and
Thus, the number of correctly classified points is . Note that this is a big- constraint, with , which is activated or deactivated depending on whether or not.
The LP relaxation induced from counting correctly classified points can be improved. The number of such points, using the variables above, is
The right hand side of (2) is nonlinear (quadratic). Nonetheless, the quadratic function is supermodular, see Nemhauser et al. , and its concave envelop can be described by introducing variables via the system
The additional variables are precisely the variables used in formulation (1). Note that a simple application of this idea would require the introduction of additional variables for each pair . However, by noting that the desired tree structure can be enforced using the new variables only, and the original variables can be dropped, we achieve this strengthening without incurring the cost of a larger formulation.
2.2.2 Improved branching constraints
To correctly enforce the branching structure of the decision-tree, Bertsimas and Dunn  use (after specializing their formulation to the case of binary data) constraints of the form
where denotes the set of ancestors of whose left branch was followed on the path from the root to . An intrepretation of this constraint is as follows: if datapoint reaches leaf node , then for all nodes in the path where took the left direction, no branching decision can be made that would cause the point to go right. Instead, we use constraint (1g).
where, following the notation of Bertsimas and Dunn , is the set of left descendants of . In particular, the left hand side of constraint (1g) is larger than the left hand side of (3). Now, we focus on the right hand side: from constraints (1b), we find that
In particular, the right hand side of (1g) is smaller than the right hand side of (3). Similar arguments can be made for constraint (1h). As a consequence, the linear inequalities for branching induced from formulation (1) dominate those proposed by Bertsimas and Dunn .
2.2.3 Further Strengthening of the Formulation
Formulation (1) can be strengthened even more through the addition of cuts.
Let be any node such that and . Also, let and define as any subset of the rows such that: a) , and b) . Intuitively, is a set of points belonging to different classes that would all be assigned to the right branch if feature is selected for branching. Then, the constraint
is valid: indeed, if , then none of the points in can be assigned to the left branch; and, if , then at most one of the points in can be correctly classified.
None of the constraint in (1) implies (4). As a matter of fact, if all constraints (4) are added for all possible combinations of sets , nodes and features , then variables with could be dropped from the formulation, along with constraints (1i) and (1c). Naturally, we do not add all constraints (4) a priori, but instead use cuts to enforce them as needed.
3 Benders’s Decomposition
The flow-based formulation (1) is effective at reducing the number of branch-and-bound nodes required to prove optimality when compared with existing formulations, and results in a substantial speedup in small- and medium- sized instances, see §4. However, in larger instances, the computational time required to solve the LP relaxations may become prohibitive, impairing its performance in branch-and-bound.
Recall from §2 that, if variables and are fixed, then the problem decomposes into independent subproblems, one for each datapoint. Additionally, each problem is a maximum flow problem, for which specialized polynomial-time methods exist. Due to these characteristics, formulation (1) can be naturally tackled using Benders’ decomposition, see Benders . In what follows, we explain the details of our implementation.
Observe that problem (1) can be written in an equivalent fashion by making the subproblems explicit as follows:
where, for any fixed , and , is defined as the optimal objective value of the max-flow problem
In formulation (6) we use the shorthand to represent upper bounds on the decision variables . These values can be interpreted as edge capacities in the flow problem, and are given as for all , and for all , and finally . Note that if point is correctly classified given the tree structure and class labels induced by .
From the well-known max-flow/min-cut duality, we find that also equals the optimal value of the dual of the above max-flow problem, which is expressible as
Problem (7) is a minimum cut problem, where variable is one if and only if node is in the source set (we implicitly fix ), and variable is one if and only if arc is part of the minimum cut. Note that the feasible region (7b)-(7h) of the minimum cut problem does not depend on the variables ; we denote this feasible region by .
We can now reformulate the master problem (5) as follows:
In the above formulation, we have added constraint (8e) to make sure we get bounded solutions in the relaxed master problem. Formulation (8) is implemented using row generation, where constraint (8b) is initially dropped but then enforced by adding cuts on the fly. Row generation can be implemented in modern MIP optimization solvers via callbacks, by adding lazy constraints at relevant nodes of the branch-and-bound tree. Identifying which constraint (8b) to add can in general be done by solving a minimum cut problem, for which specialized algorithms exist, see Goldberg and Tarjan  and Hochbaum . However, if all variables and are integer, due to the special structure of the network, it is possible to solve the minimum cut problem in time complexity linear in the depth of the tree – this depth is a constant for all practical purposes. We now describe the separation procedure used to find a most violated inequality (8b) to add, or to conclude that all constraints are satisfied.
Since is integer, all arcs capacities in formulations (6) and (7) are either 0 or 1. Moreover, since it is feasible for (5), it completely defines a decision-tree. We say that a node in this decision tree is terminal if its unique descendent is the sink .
Note the network defined by (6) has the following structure: (i) for each node , there is a unique outgoing arc with capacity 1 (corresponding to the direction the datapoint would go if it reaches ), and the remaining arcs have capacity ; (ii) there is a unique path of arcs with capacity 1 starting at , and this path passes through the terminal node datapoint is assigned to; (iii) if the class of is assigned to that node, then the path continues to the sink node ; otherwise, the path ends there.
If the path connects and , then the minimum cut has value 1, corresponding to any single arc on this path. Note that in this case constraint (8b) is satisfied as variables in (8e) have a trivial upper bound of 1. Thus cuts corresponding to this case are never added. If the path does not connect and , then a minimum cut with value 0 can be found by assigning the nodes in the path to the source set, and the remaining nodes to the sink set. The arcs in the minimum cut are all the arcs adjacent to nodes in the path, but not part of it. All those arcs have indeed capacity 0, as the single arc with capacity 1 is precisely the one in the path. Since the path has length at most and the out-degree of all nodes is 3, the minimum cut can be found in using any graph search algorithms. ∎
Approaches and Datasets.
We evaluate our two approaches on eight publicly available datasets. The number of rows (
), number of one-hot encoded features (), and number of class labels () for each dataset are given in Table 1. We compare the flow-based formulation (FlowOCT) and its Benders’ decomposition (Benders) against the formulations proposed by Bertsimas and Dunn  (OCT) and Verwer and Zhang  (BinOCT). As the code used for OCT is not publicly available, we implemented the corresponding formulation (adapted for the case of binary data). The details of this implementation are given in the online companion A.
Each dataset is split into three parts: the training set (50%), the validation set (25%), and the testing set (25%). The training and validation sets are used to tune the value of the hyperparameter. We repeat this process 5 times with 5 different samples. We test values of for . Finally, we use the best to train a tree using the training and evaluation sets from the previous step, which we then evaluate against the testing set to determine the out-of-sample accuracy. All approaches are implemented in Python programming language and solved by the Gurobi 8.1 solver. All problems are solved on a single core of SL250s Xeon CPUs by HPE and 4gb of memory with a one hour time limit.
In-Sample (Optimization) Performance.
Figure 6 summarizes the in-sample performance, i.e., how good the methods are at solving the optimization problems. Detailed results are provided in the online companion B. From Figure 6(a), we observe that for , BinOCT is able to solve 79 instances within the time limit (and outperforms OCT), but Benders solves the same quantity of instances in only 140 seconds, resulting in a speedup. Similarly, from Figure 6(b), it can be seen that for , OCT is able to solve 666 instances within the time limit111BinOCT does not include the option to have a regularization parameter, and is omitted., while Benders requires only 70 seconds to do so, resulting in a speedup. Finally, Figure 6(c) shows the optimality gaps proven as a function of the dimension. We observe that all methods result in a gap of in small instances. As the dimension increases, BinOCT (which relies on weak formulations but fast enumeration) yields 100% optimality gaps in most cases. OCT and BinOCT prove better gaps, but the performance degrades substantially as the dimension increases. Benders results in the best performance, proving optimality gaps of 20% or less regardless of dimension.
Out-of-Sample (Statistical) Performance.
Table 2 reports the out-of-sample accuracy after cross-validation. Each row represents the average over the five samples. We observe that the better optimization performance translates to superior statistical properties as well: OCT is the best method in two instances (excluding ties), BinOCT in six, while the new formulations FlowOCT and Benders are better in 13 (of which Benders accounts for 10, and is second after FlowOCT in an additional two).
- Aghaei et al.  Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos. Learning optimal and fair decision trees for non-discriminative decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1418–1426, 2019.
- Azizi et al.  Mohammad Javad Azizi, Phebe Vayanos, Bryan Wilder, Eric Rice, and Milind Tambe. Designing fair, efficient, and interpretable policies for prioritizing homeless youth for housing resources. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 35–51. Springer, 2018.
- Benders  Jacques F Benders. Partitioning procedures for solving mixed-variables programming problems. Numerische mathematik, 4(1):238–252, 1962.
- Bertsimas and Dunn  Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082, 2017.
- Bertsimas and Stellato  Dimitris Bertsimas and Bartolomeo Stellato. The voice of optimization. arXiv preprint arXiv:1812.09991, 2018.
- Bixby  Robert E Bixby. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, pages 107–121, 2012.
- Breiman  Leo Breiman. Classification and regression trees. Technical report, 1984.
- Breiman  Leo Breiman. Classification and regression trees. Routledge, 2017.
- Ciocan and Mišić  Dragos Ciocan and Velibor Mišić. Interpretable optimal stopping. 2018.
- CPLEX  IBM ILOG CPLEX. V12. 1: User’s manual for cplex. International Business Machines Corporation, 46(53):157, 2009.
- Goldberg and Tarjan  Andrew V Goldberg and Robert E Tarjan. A new approach to the maximum-flow problem. Journal of the ACM (JACM), 35(4):921–940, 1988.
- Günlük et al.  Oktay Günlük, Jayant Kalagnanam, Matt Menickelly, and Katya Scheinberg. Optimal decision trees for categorical data via integer programming. arXiv preprint arXiv:1612.03225, 2018.
- Gurobi Optimization  Incorporate Gurobi Optimization. Gurobi optimizer reference manual. URL http://www. gurobi. com, 2015.
- Hochbaum  Dorit S Hochbaum. The pseudoflow algorithm: A new algorithm for the maximum-flow problem. Operations research, 56(4):992–1009, 2008.
- Hu et al.  Xiyang Hu, Cynthia Rudin, and Margo Seltzer. Optimal sparse decision trees. arXiv preprint arXiv:1904.12847, 2019.
- Hyafil and Rivest  Laurent Hyafil and Ronald L Rivest. Constructing optimal binary search trees is NP complete. Information Processing Letters, 1976.
- Kuhn et al.  Max Kuhn, Steve Weston, Mark Culp, Nathan Coulter, and Ross Quinlan. Package ‘c50’, 2018.
- Liaw and Wiener  Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
- Mišic  Velibor V Mišic. Optimization of tree ensembles. arXiv preprint arXiv:1705.10883, 2017.
- Narodytska et al.  Nina Narodytska, Alexey Ignatiev, Filipe Pereira, and Joao Marques-Silva. Learning optimal decision trees with SAT. In IJCAI, pages 1362–1368, 2018.
- Nemhauser et al.  George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
- Nijssen and Fromont  Siegfried Nijssen and Elisa Fromont. Optimal constraint-based decision tree induction from itemset lattices. Data Mining and Knowledge Discovery, 21(1):9–51, 2010.
- Quinlan  John Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
- Quinlan  John Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
- Therneau et al.  Terry Therneau, Beth Atkinson, Brian Ripley, and Maintainer Brian Ripley. Package ‘rpart’. 2015.
- Verhaeghe et al.  Hélene Verhaeghe, Siegfried Nijssen, Gilles Pesant, Claude-Guy Quimper, and Pierre Schaus. Learning optimal decision trees using constraint programming. In The 25th International Conference on Principles and Practice of Constraint Programming (CP2019), 2019.
- Verwer and Zhang  Sicco Verwer and Yingqian Zhang. Learning decision trees with flexible constraints and objectives using integer optimization. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 94–103. Springer, 2017.
- Verwer and Zhang  Sicco Verwer and Yingqian Zhang. Learning optimal classification trees using a binary linear program formulation. In 33rd AAAI Conference on Artificial Intelligence, 2019.