Learning Optimal Classification Trees: Strong Max-Flow Formulations

02/21/2020 ∙ by Sina Aghaei, et al. ∙ University of Southern California 12

We consider the problem of learning optimal binary classification trees. Literature on the topic has burgeoned in recent years, motivated both by the empirical suboptimality of heuristic approaches and the tremendous improvements in mixed-integer programming (MIP) technology. Yet, existing approaches from the literature do not leverage the power of MIP to its full extent. Indeed, they rely on weak formulations, resulting in slow convergence and large optimality gaps. To fill this gap in the literature, we propose a flow-based MIP formulation for optimal binary classification trees that has a stronger linear programming relaxation. Our formulation presents an attractive decomposable structure. We exploit this structure and max-flow/min-cut duality to derive a Benders' decomposition method, which scales to larger instances. We conduct extensive computational experiments on standard benchmark datasets on which we show that our proposed approaches are 50 times faster than state-of-the art MIP-based techniques and improve out of sample performance up to 13.8



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation & Related Work

Since their inception over 30 years ago, decision trees have become among the most popular techniques for interpretable machine learning (classification and regression), see 

Breiman [1984]. A decision tree takes the form of a binary tree. In each internal node of the tree, a binary test is performed on a specific feature. Two branches emanate from each internal node, with each branch representing the outcome of the test. If a datapoint passes (resp. fails) the test, it is directed to the left (resp. right) branch. A predicted label is assigned to all leaf

nodes. Thus, each path from root to leaf represents a classification rule that assigns a unique label to all datapoints that reach that leaf. The goal in the design of optimal decision trees is to select the tests to perform at each internal node and the labels to assign to each leaf to maximize prediction accuracy (classification) or to minimize prediction error (regression). Not only are decision trees popular in their own right; they also form the backbone for more sophisticated machine learning models. For example, they are the building blocks for random forests, one of the most popular and stable machine learning techniques available, see e.g., 

Liaw and Wiener [2002]. They have also proved useful to provide explanations for the solutions to optimization problems, see e.g., Bertsimas and Stellato [2018].

The problem of learning optimal decision trees is an -hard problem, see Hyafil and Rivest [1976] and Breiman [2017]

. It can intuitively be viewed as a combinatorial optimization problem with an exponential number of decision variables: at each internal node of the tree, one can select what feature to branch on (and potentially the level of that feature), guiding each datapoint to the left or right using logical constraints.

Traditional Methods.

Motivated by these hardness results, traditional algorithms for learning decision trees have relied on heuristics that employ very intuitive, yet ad-hoc, rules for constructing the decision trees. For example, CART uses the Gini Index to decide on the splitting, see Breiman [1984]; ID3 employs entropy, see Quinlan [1986]; and C4.5 leverages normalized information gain, see Quinlan [2014]. The high quality and speed of these algorithms combined with the availability of software packages in many popular languages such as R or Python has facilitated their popularization, see e.g., Therneau et al. [2015], Kuhn et al. [2018]. They are now routinely used in commercial, medical, and other applications.

Mathematical Programming Techniques.

Motivated by the heuristic nature of traditional approaches, which provide no guarantees on the quality of the learned tree, several researchers have proposed algorithms for learning provably optimal

trees based on techniques from mathematical optimization. Approaches for learning optimal decision-trees rely on enumeration coupled with rules to prune-out the search space. For example,

Nijssen and Fromont [2010] use itemset mining algorithms and Narodytska et al. [2018] use satisfiability (SAT) solvers. Verhaeghe et al. [2019] propose a more elaborate implementation combining several ideas from the literature, including branch-and-bound, itemset mining techniques and caching. Hu et al. [2019]

use analytical bounds (to aggressively prune-out the search space) combined with a tailored bit-vector based implementation.

The Special Case of MIP.

As an alternative approach to conducting the search for optimal trees, Bertsimas and Dunn [2017] recently proposed to use mixed-integer programming (MIP) to learn optimal classification trees. Following this work, using MIP to learn decision trees gained a lot of traction in the literature with the works of Günlük et al. [2018], Aghaei et al. [2019], and Verwer and Zhang [2019]. This is no coincidence. First, MIP comes with a suit of off-the shelf solvers and algorithms that can be leveraged to effectively prune-out the search space. Indeed, solvers such as CPLEX [2009] and Gurobi Optimization [2015] have benefited from decades of research, see Bixby [2012], and have been very successful at solving a broad class of MIP problems. Second, MIP comes with a highly expressive language that can be used to tailor the objective function of the problem or to augment the learning problem with constraints of practical interest. For example, Aghaei et al. [2019] leverage the power of MIP to incorporate fairness and interpretability constraints into learned classification and regression trees. They also show how MIP technology can be exploited to learn decision trees with more sophisticated structure (linear branching and leafing rules). Similarly, Günlük et al. [2018] use MIP to solve classification trees with combinatorial branching decisions. MIP formulations have also been leveraged to design decision trees for decision- and policy-making problems, see Azizi et al. [2018] and Ciocan and Mišić [2018], and for optimizing decisions over tree ensembles, see Mišic [2017].

Discussion & Motivation.

The works of Bertsimas and Dunn [2017], Günlük et al. [2018], Aghaei et al. [2019], and Verwer and Zhang [2019] have served to showcase the modeling power of using MIP to learn decision trees and the potential suboptimality of traditional algorithms. Yet, we argue that they have not leveraged the power of MIP to its full extent. A critical component for efficiently solving MIPs is to pose good formulations, but determining such formulations is no simple task. The standard approach for solving MIP problems is the branch-and-bound method, which partitions the search space recursively and solves Linear Programming (LP) relaxations for each partition to produce lower bounds for fathoming sections of the search space. Thus, since solving a MIP requires solving a large sequence of LPs, small and compact formulations are desirable as they enable the LP relaxation to be solved faster. Moreover, formulations with tight LP relaxations, referred to as strong

formulations, are also desirable, as they produce higher quality lower bounds which lead to a faster pruning of the search space, ultimately reducing the number of LPs to be solved. Unfortunately, these two goals are at odds with one another, with stronger relaxations often requiring additional variables and constraints than

weak ones. For example, in the context of decision trees, Verwer and Zhang [2019] propose a MIP formulation with significantly fewer variables and constraints than the formulation of Bertsimas and Dunn [2017], but in the process weaken the LP relaxation. As a consequence, neither method consistently dominates the other.

We note that in the case of MIPs with large numbers of decision variables and constraints, classical decomposition techniques from the Operations Research literature may be leveraged to break the problem up into multiple tractable subproblems of benign complexity. A notable example of a decomposition algorithm is Benders’ [Benders, 1962]. Bender’s decomposition exploits the structure of mathematical programming problems with so-called complicating variables which couple constraints with one another and which, once fixed, result in an attractive decomposable structure that is leveraged to speed-up computation and alleviate memory consumption, allowing the solution of large-scale MIPs. To the best of our knowledge, existing approaches from the literature have not sought explicitly strong formulations, neither have they attempted to leverage the potentially decomposable structure of the problem. This is precisely the gap we fill with the present work.

1.2 Proposed Approach & Contributions

Our approach and main contributions in this paper are:

  1. We propose an intuitive flow-based MIP formulation for the problem of learning optimal classification trees with binary data. Notably, our proposed formulation does not use big- constraints, which are known to lead to weak LP relaxations. We provide an intuitive proof to justify that our LP relaxation is stronger than existing alternatives.

  2. Our proposed formulation is amenable to Bender’s decomposition. In particular, binary tests are selected in the master problem and each subproblem guides each datapoint through the tree via a max-flow formulation. We leverage the max-flow structure of the subproblems to solve them efficiently via min-cut procedures.

  3. We conduct extensive computational studies on benchmark datasets, demonstrating that our formulations improve upon the state-of-the-art MIP algorithms, both in terms of in-sample solution quality (and speed) and out-of-sample performance.

The proposed modeling and solution paradigm can act as a building block for the faster and more accurate learning of more sophisticated trees. Continuous data can be discretized and binarized to address problems with continuous labels, see

Breiman [2017]. Regression trees can be obtained via minor modifications of the formulation, see e.g., Verwer and Zhang [2017]. Fairness and interpretability constraints can naturally be incorporated into the problem, see Aghaei et al. [2019]. We leave these studies to future work.

The rest of the paper is organized as follows. We introduce our flow-based formulation and our Bender’s decomposition method in §2 and §3, respectively. We report in §4 computational experiments with popular benchmark datasets.

2 Decision Tree Formulation

2.1 Problem Formulation

We are given a training dataset consisting of datapoints indexed in the set . Each row of this dataset consists of binary features indexed in the set and collected in the vector and a label drawn from the finite set of classes. We consider the problem of designing an optimal decision tree that minimizes the misclassification rate based on MIP technology.

The key idea behind our model is to augment the decision tree with a single source node  that is connected to the root node (node 1) of the tree and a single sink node  connected to all nodes of the tree, see Figure 1. This modification enables us to think of the decision tree as a directed acyclic graph with a single source and sink node. Datapoints flow

from source to sink through a single path and only reach the sink if they are correctly classified (they will face a ‘‘road block’’ if incorrectly classified which will prevent the datapoint from traversing the graph at all). Similar to traditional algorithms for learning decision trees, we allow labels to be assigned to internal nodes of the tree. In that case, correctly classified datapoints that reach such nodes are directly routed to the sink node (as if we had a ‘‘short circuit’’).

Next, we introduce our notation and conventions that will be useful to present our model. We denote by  and  the sets of all internal and leaf nodes in the tree, respectively. For each node , we let be the direct ancestor of in the graph. For , let (resp. ) represent the left (resp. right) direct descendant of node  in the graph. In particular, we have . We will say that we branch on feature at node if the binary test performed at  asks ‘‘Is ’’? Datapoint  will be directed left (right) if the answer is affirmative (negative).

The decision variables for our formulation are as follows. The variable indicates if we branch on (i.e., perform a binary test on) feature at node . If for some node , no feature is selected to branch on at that node, and a class is assigned to node . We let the variable indicate if the predicted class for node is . A datapoint is correctly classified iff it reaches some node such that with . Points that arrive at that node and that are correctly classified are directed to the sink. For each node and for each datapoint , we introduce a binary valued decision variable which equals 1 if and only if the th datapoint is correctly classified (i.e., reaches the sink) and traverses the edge between nodes and . We let be defined accordingly for each edge between node and sink .

Figure 1: A classification tree of depth 2 viewed as a directed acyclic graph with a single source and sink.

The flow-based formulation for decision trees reads

s.t. (1b)

where is a regularization weight. The objective (1a) maximizes the total number of correctly classified points while minimizing the number of splits . Thus, controls the trade-off between these competing objectives, with larger values of lambda corresponding to greater regularization. An interpretation of the constraints is as follows. Constraint (1b) ensures that at each node we either branch on a feature or assign a class label to it (but not both, the label is only used if we do not branch at that node). Constraint (1c) guarantees that each leaf has a unique predicted class label. Constraint (1d) is a flow conservation constraint for each datapoint and node : it ensures that if a datapoint arrives at a node, it must also leave the node through one of its descendants, or be correctly classified and routed to . Similarly, constraint (1e) enforces flow conservation for each node . The inequality constraint (1f) ensures that at most one unit of flow can enter the graph through the source. Constraints (1g) and (1h) ensure that if a datapoint is routed to the left (right) at node , then one of the features such that () must have been selected for branching at the node. Constraint (1i) ensures that datapoints routed to the sink node  are correctly classified.

Given a choice of branching and labeling decisions, and , each datapoint is allotted one unit of flow which it attempts to guide through the graph from the source node to the sink node. If the datapoint cannot be correctly classified, the flow that will reach the sink (and by extension enter the source) will be zero. In particular note that once the and variables have been fixed, optimization of the flows can be done separately for each datapoint. This implies that the problem can be decomposed to speed-up computation, an idea that we leverage in Section 3. In particular, note that the optimization over flow variables can be cast as a max-flow problem for each datapoint, implying that the integrality constraint on the  variables can be relaxed to yield an equivalent formulation. We leverage this idea in our computational experiments.

Formulation (1) has several distinguishing features relative to existing MIP formulations for training decision trees

  1. It does not use big- constraints.

  2. It includes flow variables indicating whether each datapoint is directed to the left or right at each branching node.

  3. It only tracks datapoints that are correctly classified.

The number of variables and constraints in formulation (1) is , where is the tree depth. Thus, its size is of the same order as the one proposed by Bertsimas and Dunn [2017]. Nonetheless, as we discuss in §2.2, the LP relaxation of formulation (1) is tighter, and therefore results in a more aggressive pruning of the search space without incurring in significant additional costs.

2.2 Strength of the Flow-Based Formulation

We now argue that formulation (1), which we henceforth refer to as flow-based formulation, is stronger than existing formulations from the literature.

The BinOCT formulation of Verwer and Zhang [2019] is obtained by aggregating constraints from the OCT formulation of Bertsimas and Dunn [2017] (using big- constants). As a consequence, its relaxation is weaker. Thus, it suffices to argue that the proposed formulation is stronger than OCT. In the following, we work with a simplified version of the formulation of Bertsimas and Dunn [2017] specialized to the case of binary data. We provide this formulation in the online companion A.

2.2.1 No big-s

In this section, we argue that the absence of big- constraints in our formulation induces a stronger formulation. In the OCT formulation, for and

, there are binary variables

such that if datapoint is assigned to leaf node (regardless of whether that point is correctly classified or not), and otherwise. In addition, the authors introduce a variable that represents the number of missclassified points at leaf node , and this variable is defined via constraints and

Thus, the number of correctly classified points is . Note that this is a big- constraint, with , which is activated or deactivated depending on whether or not.

The LP relaxation induced from counting correctly classified points can be improved. The number of such points, using the variables above, is


The right hand side of (2) is nonlinear (quadratic). Nonetheless, the quadratic function is supermodular, see Nemhauser et al. [1978], and its concave envelop can be described by introducing variables via the system

The additional variables are precisely the variables used in formulation (1). Note that a simple application of this idea would require the introduction of additional variables for each pair . However, by noting that the desired tree structure can be enforced using the new variables only, and the original variables can be dropped, we achieve this strengthening without incurring the cost of a larger formulation.

2.2.2 Improved branching constraints

To correctly enforce the branching structure of the decision-tree, Bertsimas and Dunn [2017] use (after specializing their formulation to the case of binary data) constraints of the form


where denotes the set of ancestors of whose left branch was followed on the path from the root to . An intrepretation of this constraint is as follows: if datapoint reaches leaf node , then for all nodes in the path where  took the left direction, no branching decision  can be made that would cause the point to go right. Instead, we use constraint (1g).

We now show that (1g) induces a stronger LP relaxation. First, we focus on the left hand side of (1g): due to flow conservation constraints (1d), we find that

where, following the notation of Bertsimas and Dunn [2017], is the set of left descendants of . In particular, the left hand side of constraint (1g) is larger than the left hand side of (3). Now, we focus on the right hand side: from constraints (1b), we find that

In particular, the right hand side of (1g) is smaller than the right hand side of (3). Similar arguments can be made for constraint (1h). As a consequence, the linear inequalities for branching induced from formulation (1) dominate those proposed by Bertsimas and Dunn [2017].

2.2.3 Further Strengthening of the Formulation

Formulation (1) can be strengthened even more through the addition of cuts.

Let be any node such that and . Also, let and define as any subset of the rows such that: a) , and b) . Intuitively, is a set of points belonging to different classes that would all be assigned to the right branch if feature is selected for branching. Then, the constraint


is valid: indeed, if , then none of the points in can be assigned to the left branch; and, if , then at most one of the points in can be correctly classified.

None of the constraint in (1) implies (4). As a matter of fact, if all constraints (4) are added for all possible combinations of sets , nodes and features , then variables with could be dropped from the formulation, along with constraints (1i) and (1c). Naturally, we do not add all constraints (4) a priori, but instead use cuts to enforce them as needed.

3 Benders’s Decomposition

The flow-based formulation (1) is effective at reducing the number of branch-and-bound nodes required to prove optimality when compared with existing formulations, and results in a substantial speedup in small- and medium- sized instances, see §4. However, in larger instances, the computational time required to solve the LP relaxations may become prohibitive, impairing its performance in branch-and-bound.

Recall from §2 that, if variables and are fixed, then the problem decomposes into independent subproblems, one for each datapoint. Additionally, each problem is a maximum flow problem, for which specialized polynomial-time methods exist. Due to these characteristics, formulation (1) can be naturally tackled using Benders’ decomposition, see Benders [1962]. In what follows, we explain the details of our implementation.

Observe that problem (1) can be written in an equivalent fashion by making the subproblems explicit as follows:

s.t. (5b)

where, for any fixed , and , is defined as the optimal objective value of the max-flow problem

s.t. (6b)

In formulation (6) we use the shorthand to represent upper bounds on the decision variables . These values can be interpreted as edge capacities in the flow problem, and are given as for all , and for all , and finally . Note that  if point is correctly classified given the tree structure and class labels induced by .

From the well-known max-flow/min-cut duality, we find that also equals the optimal value of the dual of the above max-flow problem, which is expressible as

s.t. (7b)

Problem (7) is a minimum cut problem, where variable is one if and only if node is in the source set (we implicitly fix ), and variable is one if and only if arc is part of the minimum cut. Note that the feasible region (7b)-(7h) of the minimum cut problem does not depend on the variables ; we denote this feasible region by .

We can now reformulate the master problem (5) as follows:


In the above formulation, we have added constraint (8e) to make sure we get bounded solutions in the relaxed master problem. Formulation (8) is implemented using row generation, where constraint (8b) is initially dropped but then enforced by adding cuts on the fly. Row generation can be implemented in modern MIP optimization solvers via callbacks, by adding lazy constraints at relevant nodes of the branch-and-bound tree. Identifying which constraint (8b) to add can in general be done by solving a minimum cut problem, for which specialized algorithms exist, see Goldberg and Tarjan [1988] and Hochbaum [2008]. However, if all variables and are integer, due to the special structure of the network, it is possible to solve the minimum cut problem in time complexity linear in the depth of the tree – this depth is a constant for all practical purposes. We now describe the separation procedure used to find a most violated inequality (8b) to add, or to conclude that all constraints are satisfied.

Proposition 1.

If is integer and satisfies (5b)-(5c), then the separation problem for inequalities (8b) can be solved in time, where is the tree depth.


Since is integer, all arcs capacities in formulations (6) and (7) are either 0 or 1. Moreover, since it is feasible for (5), it completely defines a decision-tree. We say that a node in this decision tree is terminal if its unique descendent is the sink .

Note the network defined by (6) has the following structure: (i) for each node , there is a unique outgoing arc with capacity 1 (corresponding to the direction the datapoint would go if it reaches ), and the remaining arcs have capacity ; (ii) there is a unique path of arcs with capacity 1 starting at , and this path passes through the terminal node datapoint is assigned to; (iii) if the class of is assigned to that node, then the path continues to the sink node ; otherwise, the path ends there.

If the path connects and , then the minimum cut has value 1, corresponding to any single arc on this path. Note that in this case constraint (8b) is satisfied as variables in (8e) have a trivial upper bound of 1. Thus cuts corresponding to this case are never added. If the path does not connect and , then a minimum cut with value 0 can be found by assigning the nodes in the path to the source set, and the remaining nodes to the sink set. The arcs in the minimum cut are all the arcs adjacent to nodes in the path, but not part of it. All those arcs have indeed capacity 0, as the single arc with capacity 1 is precisely the one in the path. Since the path has length at most and the out-degree of all nodes is 3, the minimum cut can be found in using any graph search algorithms. ∎

Figure 2 illustrates graphically the algorithm outlined in the proof of Proposition 1.

Figure 2: Pictorial description of the separation procedure. Green arcs have capacity 1 (and others capacity 0), and red arcs are those in the minimum cut. On the left, examples with minimum cut value equal to one – constraint (8b) is satisfied. On the right, examples with minimum cut values of 0 – new constraints will be added.

4 Experiments

Approaches and Datasets.

We evaluate our two approaches on eight publicly available datasets. The number of rows (

), number of one-hot encoded features (

), and number of class labels () for each dataset are given in Table 1. We compare the flow-based formulation (FlowOCT) and its Benders’ decomposition (Benders) against the formulations proposed by Bertsimas and Dunn [2017] (OCT) and Verwer and Zhang [2019] (BinOCT). As the code used for OCT is not publicly available, we implemented the corresponding formulation (adapted for the case of binary data). The details of this implementation are given in the online companion A.

monk3 122 15 2
monk1 124 15 2
monk2 169 15 2
house-votes-84 232 16 2
balance-scale 625 20 3
tic-tac-toe 958 27 2
car_evaluation 1728 20 4
kr-vs-kp 3196 38 2
Table 1: Datasets used in the experiments.
Experimental Setup.

Each dataset is split into three parts: the training set (50%), the validation set (25%), and the testing set (25%). The training and validation sets are used to tune the value of the hyperparameter

. We repeat this process 5 times with 5 different samples. We test values of for . Finally, we use the best to train a tree using the training and evaluation sets from the previous step, which we then evaluate against the testing set to determine the out-of-sample accuracy. All approaches are implemented in Python programming language and solved by the Gurobi 8.1 solver. All problems are solved on a single core of SL250s Xeon CPUs by HPE and 4gb of memory with a one hour time limit.

(a) Performance profile with .
(b) Performance profile with .
(c) Optimality gaps as a function of the size.
Figure 6: Summary of optimization performance.
In-Sample (Optimization) Performance.

Figure 6 summarizes the in-sample performance, i.e., how good the methods are at solving the optimization problems. Detailed results are provided in the online companion B. From Figure 6(a), we observe that for , BinOCT is able to solve 79 instances within the time limit (and outperforms OCT), but Benders solves the same quantity of instances in only 140 seconds, resulting in a speedup. Similarly, from Figure 6(b), it can be seen that for , OCT is able to solve 666 instances within the time limit111BinOCT does not include the option to have a regularization parameter, and is omitted., while Benders requires only 70 seconds to do so, resulting in a speedup. Finally, Figure 6(c) shows the optimality gaps proven as a function of the dimension. We observe that all methods result in a gap of in small instances. As the dimension increases, BinOCT (which relies on weak formulations but fast enumeration) yields 100% optimality gaps in most cases. OCT and BinOCT prove better gaps, but the performance degrades substantially as the dimension increases. Benders results in the best performance, proving optimality gaps of 20% or less regardless of dimension.

Out-of-Sample (Statistical) Performance.

Table 2 reports the out-of-sample accuracy after cross-validation. Each row represents the average over the five samples. We observe that the better optimization performance translates to superior statistical properties as well: OCT is the best method in two instances (excluding ties), BinOCT in six, while the new formulations FlowOCT and Benders are better in 13 (of which Benders accounts for 10, and is second after FlowOCT in an additional two).

Dataset Depth OCT BinOCT FlowOCT Benders
monk3 2 92.3 92.3 92.3 92.3
monk3 3 83.2 91 91 91
monk3 4 91 85.2 92.3 91
monk3 5 87.1 87.7 92.3 91.6
monk1 2 71 72.3 72.3 71
monk1 3 83.2 82.6 81.3 81.3
monk1 4 100 99.4 100 100
monk1 5 93.5 96.8 100 100
monk2 2 56.7 49.8 56.7 56.7
monk2 3 62.3 58.1 63.7 63.3
monk2 4 59.5 60.5 58.6 64.2
monk2 5 63.3 55.8 62.3 61.9
house-votes-84 2 79.3 96.2 97.2 97.2
house-votes-84 3 97.2 94.1 97.2 97.2
house-votes-84 4 96.9 94.8 96.9 95.5
house-votes-84 5 95.2 93.1 96.9 97.2
balance-scale 2 68.7 67.9 68.7 68.7
balance-scale 3 69 71.5 69.8 71
balance-scale 4 68.5 73.9 73.2 71.7
balance-scale 5 65.7 75.3 71.6 76.8
tic-tac-toe 2 66.7 65.9 66.7 66.7
tic-tac-toe 3 68.1 72.2 68.5 72.6
tic-tac-toe 4 70.4 80.3 68.7 77.1
tic-tac-toe 5 69.7 78.9 66.3 79.3
car_evaluation 2 76.5 76.5 76.5 76.5
car_evaluation 3 73.3 78.4 76.7 79.1
car_evaluation 4 75.2 80.3 71.6 79.7
car_evaluation 5 74.8 81.3 61.6 80.5
kr-vs-kp 2 73.7 87.2 70.5 87.2
kr-vs-kp 3 69.3 87.8 61.2 89.9
kr-vs-kp 4 64.7 90.8 54.3 91
kr-vs-kp 5 62.7 87.1 45.8 86.7
Table 2: Out of sample accuracy




  • Aghaei et al. [2019] Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos. Learning optimal and fair decision trees for non-discriminative decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1418–1426, 2019.
  • Azizi et al. [2018] Mohammad Javad Azizi, Phebe Vayanos, Bryan Wilder, Eric Rice, and Milind Tambe. Designing fair, efficient, and interpretable policies for prioritizing homeless youth for housing resources. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 35–51. Springer, 2018.
  • Benders [1962] Jacques F Benders. Partitioning procedures for solving mixed-variables programming problems. Numerische mathematik, 4(1):238–252, 1962.
  • Bertsimas and Dunn [2017] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082, 2017.
  • Bertsimas and Stellato [2018] Dimitris Bertsimas and Bartolomeo Stellato. The voice of optimization. arXiv preprint arXiv:1812.09991, 2018.
  • Bixby [2012] Robert E Bixby. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, pages 107–121, 2012.
  • Breiman [1984] Leo Breiman. Classification and regression trees. Technical report, 1984.
  • Breiman [2017] Leo Breiman. Classification and regression trees. Routledge, 2017.
  • Ciocan and Mišić [2018] Dragos Ciocan and Velibor Mišić. Interpretable optimal stopping. 2018.
  • CPLEX [2009] IBM ILOG CPLEX. V12. 1: User’s manual for cplex. International Business Machines Corporation, 46(53):157, 2009.
  • Goldberg and Tarjan [1988] Andrew V Goldberg and Robert E Tarjan. A new approach to the maximum-flow problem. Journal of the ACM (JACM), 35(4):921–940, 1988.
  • Günlük et al. [2018] Oktay Günlük, Jayant Kalagnanam, Matt Menickelly, and Katya Scheinberg. Optimal decision trees for categorical data via integer programming. arXiv preprint arXiv:1612.03225, 2018.
  • Gurobi Optimization [2015] Incorporate Gurobi Optimization. Gurobi optimizer reference manual. URL http://www. gurobi. com, 2015.
  • Hochbaum [2008] Dorit S Hochbaum. The pseudoflow algorithm: A new algorithm for the maximum-flow problem. Operations research, 56(4):992–1009, 2008.
  • Hu et al. [2019] Xiyang Hu, Cynthia Rudin, and Margo Seltzer. Optimal sparse decision trees. arXiv preprint arXiv:1904.12847, 2019.
  • Hyafil and Rivest [1976] Laurent Hyafil and Ronald L Rivest. Constructing optimal binary search trees is NP complete. Information Processing Letters, 1976.
  • Kuhn et al. [2018] Max Kuhn, Steve Weston, Mark Culp, Nathan Coulter, and Ross Quinlan. Package ‘c50’, 2018.
  • Liaw and Wiener [2002] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
  • Mišic [2017] Velibor V Mišic. Optimization of tree ensembles. arXiv preprint arXiv:1705.10883, 2017.
  • Narodytska et al. [2018] Nina Narodytska, Alexey Ignatiev, Filipe Pereira, and Joao Marques-Silva. Learning optimal decision trees with SAT. In IJCAI, pages 1362–1368, 2018.
  • Nemhauser et al. [1978] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
  • Nijssen and Fromont [2010] Siegfried Nijssen and Elisa Fromont. Optimal constraint-based decision tree induction from itemset lattices. Data Mining and Knowledge Discovery, 21(1):9–51, 2010.
  • Quinlan [1986] John Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  • Quinlan [2014] John Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
  • Therneau et al. [2015] Terry Therneau, Beth Atkinson, Brian Ripley, and Maintainer Brian Ripley. Package ‘rpart’. 2015.
  • Verhaeghe et al. [2019] Hélene Verhaeghe, Siegfried Nijssen, Gilles Pesant, Claude-Guy Quimper, and Pierre Schaus. Learning optimal decision trees using constraint programming. In The 25th International Conference on Principles and Practice of Constraint Programming (CP2019), 2019.
  • Verwer and Zhang [2017] Sicco Verwer and Yingqian Zhang. Learning decision trees with flexible constraints and objectives using integer optimization. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 94–103. Springer, 2017.
  • Verwer and Zhang [2019] Sicco Verwer and Yingqian Zhang. Learning optimal classification trees using a binary linear program formulation. In 33rd AAAI Conference on Artificial Intelligence, 2019.