In classification problems, the goal is to decide the class membership of a set of observations, by using available information on features and class membership of a training data set. Decision trees are one of the most popular models for solving this problem, due to their effectiveness and high interpretability. In this work, we focus on constructing univariate binary decision trees of prespecified depth.
In a univariate binary decision tree, each internal node contains a test regarding the value of one single feature of the data set, while the leaves contain the target classes. The problem of constructing (learning) a classification tree (CTCP), is the problem of finding a set of optimal tests (decision checks), such that the assignment of target classes to rows satisfies a certain criteria. A commonly encountered objective is accuracy, measured as the number of correct predictions in a training set.
As the problem of learning optimal decision trees is an NP-complete problem [HyaRi76]
, heuristics such as CART[Breiman84] and ID3 [Quinlan86] are widely used. These greedy algorithms build a tree recursively, starting from a single node. At each internal node, the (locally) best decision split is chosen by solving an optimization problem on a subset of the training data. This process is repeated at children nodes till some stopping criteria is satisfied. Although greedy algorithms are computationally efficient, they do not guarantee finding an optimal tree. In recent years, constructing decision trees by using mathematical programming techniques, especially Integer Optimization, became a hot topic among researchers (see Menick16, VerZhaYe17, BertDunn17, VerZha17, and Dash18).
In this paper, our contribution is threefold. Firstly, we propose a novel ILP formulation for constructing classification tree, that is suitable for a Column Generation approach. Secondly, we show that by using only a subset of the feature checks (decision checks), solutions of good quality can be obtained within short computation time. Thirdly, we provide ILP-based solutions for large data sets, that not have been previously solved via optimization techniques.
As a result, we can construct classification trees with higher performance in shorter computation times compared to the state-of-art approach of BertDunn17, and we are capable of handling much larger dataset than VerZha17.
This paper is organized as follows. Section 2 revises the existing literature and discusses the state-of-art algorithms in constructing decision trees. Our basic notation and important concepts related to decision trees are introduced in Section 3. Sections 4 and 5 present the mathematical models and our solution approach. Section 6 reports the experimental results obtained with our method and compares them to recent results in literature. Finally, our conclusions and further research directions are discussed in Section 7.
2 Related work
Finding optimal decision trees is known to be NP-hard (HyaRi76). This led to the development of heuristics that run in short time and output reasonably good solutions. An important limitation in constructing decision trees is that the decision splits at internal nodes do not contain any information regarding the quality of the solution, e.g. partial solution or lower bounds on the objective. This results in lack of guidance for constructive algorithms [Breiman84]. To alleviate this shortcoming, greedy algorithms use goodness measures for making the (local) split decisions. The most common measures are Gini Index, used by CART (Breiman84), and Information Gain, used by ID3 (Quinlan86). In order to increase the generalization power of a decision tree, a pruning post-processing step is usually applied after a greedy construction.
Norton89 proposed adding a lookahead procedure to the greedy heuristics, however no significant improvements are reported [Murthy95]. Other optimization techniques used in the literature to find decision trees are integer linear programming (shortly ILP), dynamic programming [Payne77]
, and stochastic gradient descent based methods[Norouzi15]. Several ILP approaches have been recently proposed in literature. BertDunn17 study constructing optimal classification trees with both univariate and multivariate decision splits. The authors do not assume a fixed tree topology, but control the topology through a tuned regularization parameter in the objective. As the magnitude of this parameter increases, more leaf nodes may have no samples routed to them, resulting in shallower trees. An improvement of w.r.t CART is obtained for out-of- sample data for univariate test and an improvement of 3-5% for multivariate tests. The paper of BertDunn17 is the main reference for benchmarking our method (see Section 6).
By exploiting the discrete nature of the data, Gunluk18 propose an efficient MILP formulation for the problem of constructing classification trees for data with categorical features. At each node, decisions can be taken based on a subsets of features (combinatorial checks). The number of integer variables in the obtained MILP is independent of the size of the training data. Besides the class estimations to the leaf nodes, a fixed tree topology is given as input to the ILP model. Four candidate topologies are considered, from which one is eventually chosen after a cross validation. Numerical experiments indicate that, when classification can be achieved via a small interpretable tree, their algorithm outperforms CART.
In another recent study, Dash18 propose an ILP model for learning boolean decision rules in disjunctive form (DNF, OR-of-ANDs, equivalent to decision rule sets) or conjunctive normal form (CNF, AND-of-ORs) as an interpretable model for classification. The proposed ILP takes into account the trade-off between accuracy and the simplicity of the chosen rules and is solved via the column generation method. The authors formulate an approximate pricing problem by randomly selecting a limited number of features and data instances. Computational experiments show that this CG procedure is highly competitive to other sate-of-the-art algorithms.
Our ILP builds on the ideas in VerZha17, where an efficient encoding is proposed for constructing both classification and regression (binary) trees of univariate splits of depth . As a result, the number of decision variables in their ILP is reduced to , compared to variables use in BertDunn17. Preliminary results indicate that this method obtains good results on trees up to depth 5 and smaller data sets of size up to 1000.
Besides mathematical optimization, Satisfiability (SAT) and Constraint Programming (CP) techniques have also been recently employed to solve learning problems. Chang2012 introduce a constrained conditional model framework that considers domain knowledge in a conditional model for structured learning in the form of declarative constraints. They show how the proposed framework can be used to solve prediction problems. Bessiere09 focus on the problem of finding the smallest size decision tree that best separates the training data. They formulate this problem as a constraint program. In narodytska2018learning, the authors model the smallest-size decision tree learning problem as a satisfiability problem, and solve the model using a SAT solver. Many studies investigate using CP solvers for item set mining and pattern set mining, e.g.,[Raedt2010, Guns11]. In these work, the learning problems are declaratively specified by means of the constraints that they need to satisfy. In [DuongVrain17] the authors introduce a model for solving the constrained clustering problem based on a CP framework. A great advantage of using CP based models is the flexibility to integrate different constraints and to choose different optimization criteria.
In this section we describe the basic concepts of our work and introduce the necessary notation.
3.1 Binary tree topology
The set of data instances (or rows) is denoted by and the set of features by . Each data row has a certain value for every feature in
. In this paper, we consider numerical features. This is without loss of generality, as in case there are ordinal and categorical features in datasets, we can simply transform them into numerical ones, using for example one hot encoding.
The numerical value of row for feature is denoted by . Besides the features in , every data row has an associated target class, which is the label to be predicted in a classification task. So denotes the target class of row and the predicted target class of row is denoted by .
In this paper we consider complete binary decision trees of prespecified depth . For a decision tree of depth , let denote the set of all nodes and let be the set of internal nodes. Every internal node has two child nodes: left and right, denoted by and respectively. Let denote the set of leaf nodes. A target class (prediction output) is assigned to every leaf node in the decision tree.
We denote the path from the root node to a leaf node by , where denotes the internal node at level () on the path. When we say that path makes a left turn at level and a right turn otherwise.
3.2 Decision checks and decision paths
In a binary decision tree, a partition of the data is first obtained based on a feature value check at the root node. Depending on the result of the test, each element of the partition is directed to one of the children. This process is repeated at each internal node, on the data that was directed towards that node. A feature check (also called decision check or decision split) involving only one feature is called univariate; otherwise the split is called multivariate. In this work, we only consider decision trees with univariate splits.
We will denote a (univariate) decision check at an internal node by the triple , where denotes the feature and the threshold value for , that is, the value to which is compared to. The check is performed as follows: if, for a row , , is directed to the left child ; otherwise, it is directed to the right child . The triple is called a feasible decision check at internal node , if , and threshold value is an element of the set of all values of feature in the data, i.e., .
We denote the set of feasible decision checks at internal node as , . Once a tree is constructed using given feasible decision checks, the split decision at internal node is denoted by .
A direct formulation of constructing decision trees with respect to decision splits leads to a high number of decision variables (see [BertDunn17]). In this paper, we propose an alternative formulation, based on decision paths.
Given a leaf and the associated path in , we define a decision path to node , as a sequence of distinct univariate decision checks at the nodes of . In other words, , where is the decision check at node . We say that is the ending leaf of and denote it by . Figure 1 shows a highlighted decision path in a depth-2 tree.
Let denote the set of decision paths from the root node to the leaf node . Note that each path in corresponds to only one sequence , which will be denoted by . The reverse is not true, as given one path , one can associate many decision paths.
For a decision path , let denote the subset of rows directed through the nodes of to the leaf node . The prediction output and the number of correct predictions associated to are given by
Based on the above discussion, we can represent each decision tree as a collection of decision paths that have the same splits at common internal nodes:
where is the decision split at node in the decision tree . Although the cardinality of is exponential in the number of features and data size, (2) is critical to our solution approach, as it allows to search for the optimal set of decision paths instead of the optimal set of decision splits. As we will see in the next section, this makes the ILP formulation suitable for column generation.
3.3 Problem definition
Informally, the classification tree construction problem (CTCP) we are interested in can be viewed as: given a dataset with features and a set of decision checks for each internal nodes in a tree of a given depth, find a collection of decision paths such that the number of correct predictions at the leaf nodes is maximized. It is stated formally as follows.
Problem: Classification Tree Constructing Problem (CTCP)
Instance: A tree of depth with its topology as defined in this section, a set of data rows , a set of features , a set of decision check alternatives for every of tree .
Find a collection of decision paths such that (i) decision paths in satisfy condition (2), and (ii) is maximized, where is given by (1).
4 Column generation based approach
The presentation of our CG approach is organized as follows: in Section 4.1 we give the master ILP formulation of the CTCP. In Section 4.2 we derive the corresponding pricing subproblem and prove that it is NP-hard; we then formulate the pricing problem as an ILP and discuss the column management in our CG procedure.
As it is usual in a CG approach, the master problem and the pricing problem are solved iteratively, where the former passes to the latter the dual variables in order to find paths with positive reduced costs; a subset of these paths are added to the master MILP to improve the objective. The optimality of the master model is proven by showing that no paths with positive reduced cost can be found. We refer to Desrosiers05 for more details of the CG technique.
4.1 Master formulation
The master model chooses a collection of decision paths that give a feasible decision tree, as defined by (2).
Table 1 lists the sets, parameters, and decision variables of the LP model to construct decision trees.
|set of rows in data file, indexed by .|
|set of features in data file, indexed by .|
|leaf and internal (non-leaf) nodes in the decision tree, indexed by .|
|set of decision paths ending in leaf , indexed by .|
|subset of rows directed through the nodes of to the leaf node .|
|set of decision checks for paths passing internal node .|
||the depth of the decision tree, levels are indexed by .|
|number of correct predictions/true positives of path :|
||indicates that path is chosen.|
The following lines present the master ILP model.
The objective function (3) maximizes accuracy (number of rows correctly predicted). Constraint (4) imposes that a path has to be selected for each leaf. Constraint (5) ensures that each row is directed to one single leaf. Constraint (6) is related to the consistency of the tree: all selected paths passing a certain internal node must share the same split at that node. This constraint is an essential feature of our model.
In our CG approach, we use an LP relaxation of the above ILP, in which constraints (7) are relaxed to . Clearly, by (4). Note that there is no need to impose any bounds on , as follows from (6) and the non-negativity of , while follows from the fact that the sum in the left hand side in (6) is bounded by 1, as a consequence of (4).
4.2 Pricing subproblem
We associate the dual variables , and with constraints (4)-(6) respectively. Given that the number of paths in the sets , are exponentially many, these sets are not enumerated at all. Instead, we only find the paths in that are promising for increasing the objective value. For a path, the degree of being promising is quantified by a positive reduced cost, where the reduced cost associated to a decision path is defined as:
We call a path with the highest positive reduced cost the most promising path. The objective of the pricing problem becomes
In the following we study the complexity of a special case of the pricing problem, in which all dual values ’s and ’s are set to 0, and all the variables are set to . We call this special case the “Decision path constructing pricing problem (DPP)”.
Problem: Decision Path Pricing problem (DPP)
Instance: A binary tree , a set of data rows , a set of features , a leaf node , the corresponding path and a set of splits for every in . A real number .
Question: Does there exist a decision path in such that , where is given by (9)?
The DPP problem is strongly NP-hard.
The proof uses a reduction from Exact Cover by 3-Sets(3XC) to DPP. 3XC is a well-known NP-complete problem in the strong sense [NPcomp].
Exact Cover by 3-Sets: Given a set and a collection of of 3-element subsets of , does there exist a subset of where every element of occurs in exactly one member of ?
Given an instance of the problem, we now present a polynomial time transformation to an instance of the DPP problem. By the definition of a decision path, all decision checks have to be distinct at internal nodes.
Rows and compatibility: For every element in we create a distinct row, so . We say that two rows and are compatible if the corresponding elements in are disjoint, and it is denoted by .
Features and feature values: For every row in , we define a distinct feature , hence . For each row, the value of a feature is defined as
Leaf, depth, decision check alternatives: Consider a binary tree of depth . Let be the leaf that is reached after left turns and the path from the root to the parent of . Note that (recall ). The decision check alternatives at every node are given by .
Choose . The objective of the DPP instance turns out to maximize the number of rows reaching the leaf . Moreover, this number cannot be greater than , which is equal to the highest number of compatible rows. Hence, the question can be reformulated as “Does there exist a decision path that directs exactly rows to the leaf node ?”.
Now let denote the set of rows corresponding to the elements in . Note that since the elements of are disjoint, these rows are compatible. Next select at each internal node exactly one decision split for every in . Observe that each row in is exactly in decision checks , implying that either or . Moreover, is directed left at all internal nodes due to feature values and therefore reaches leaf . The decision path constructed in this way is a YES instance to the decision version of DPP. The other direction is trivial, since the subsets of corresponding to the rows that reach leaf give an exact cover for the instance .
Formulating the pricing problem as a MILP.
Next we present a MILP formulation of the pricing problem described in the previous section. This model will be solved to optimality in order to guarantee that the master model is solved optimally in the course of the CG procedure. Note that in order to find the optimal value, it suffices to solve the DPCP for every leaf separately. Furthermore, for each leaf , we decompose the DPCP into more optimization problems, each corresponding to a target class. Table 2 explains the necessary notation of the pricing MILP model.
||set of rows in data file with target , indexed by .|
|set of features in data file, indexed by .|
|set of decision check alternatives .|
|value of feature of row .|
|indicates that row reaches leaf .|
|indicates that is selected as decision check at , for all .|
In the MILP formulation of the pricing problem, every internal node in the sequence corresponds to a level. The case () implies that the path to leaf makes a left (right) turn at the level of internal node . The MILP formulation of the pricing problem is
Objective (11) aims to maximize the reduced cost associated to a feasible decision path. Constraint (12) ensures that exactly one decision split has to be performed at each level. Constraints (13), (14) and (15) take care that the rows directed through the nodes of the path are consistent with the decision splits. Finally, the constraints (16) enforce that the splits performed at internal nodes are distinct.
Post processing of the pricing MILP.
Let be the decision path found by solving the pricing MILP model for the pair
the target output class ofdiffers from , i.e. and . In such a case, the decision path can have a higher reduced cost value due to a higher value in the first summation in objective (11). Therefore a post precessing step is executed by checking the correct predictions of all target classes in the row set . Since we solve the MILP for every pair, our post processing has no impact on the optimality proof.
Pricing heuristic and column management.
Instead of solving the pricing problem for all columns, we do so only for a selected column pool of a fixed size, say . In each step of the pricing heuristic, the pool is updated. The update procedure starts by selecting a subset of leaves , corresponding to the ones for which columns with high (positive) reduced costs are found in the previous iteration. Then, a leaf is chosen uniformly at random from the set and a decision path to the selected leaf is constructed by choosing uniformly at every internal node a decision check in . If the constructed decision path is not correct according to the definition given in Section 3 because the same decision check appears several times along the path, its reduced cost is artificially set to . The columns with highest positive reduced costs are then added to the master problem (if the number of columns with positive costs is lower than , add all columns with positive reduced costs). Finally, columns with low reduced costs are removed to obtain a pool of size .
The pricing heuristic is used as long as it delivers promising columns, that is, columns with positive reduced costs. If no promising column is found after running the pricing heuristic a given number of times, the algorithm switches to the MILP formulation of the pricing model. If the MILP model also fails finding a promising column, then the solution to the master problem is optimal. Otherwise, we empty the column pool and adjust the pricing heuristic such that leaves for which decision paths with high (positive) reduced cost are found, have priority to be considered. Figure 2 contains the flow chart of our pricing solution procedure and column management.
5 Selecting the restricted set of parameters
In the literature, it has been shown that Column Generation brings significant computational efficiency in solving the studied problems optimally. However, our preliminary experiments indicated that a standard CG based approach based on the master and pricing problems described in the previous section has difficulties in finding optimal solutions in reasonable times.
In order to understand the complexity of our problem, we compare the master ILP for our problem with the master ILP’s used in the CG approach for to other 2 classical problems: vehicle routing and worker scheduling. The master model of the CTCP is characterized by the following two important points (i) every row arrives exactly at one leaf, (ii) all chosen decision paths must have the same decision checks at common internal nodes. In the vehicle routing problem, the master model imposes that every customer is served by exactly one vehicle, which leads to a set partitioning of customers [Desrochers92, SplietGabor2014]. Similarly, in scheduling problems, workers are part of at most one team [Firat16]. While routes in vehicles routing and teams in scheduling should be distinct, decision paths in the CTCP are highly dependent on each other through the synchronized decision checks at internal nodes (see constraints (6)). Moreover, to formulate these constraints, the set of variables are needed. When all the decision checks are considered, the number of these variables is of magnitude . The high dependency between decision paths and the extra variables for decision check synchronization increase the complexity of the master model of the CTCP considerably. During our exploratory experiments, we see that the master ILP model has a high number of decision variables that are not found by generating columns. This is not the case in majority of the CG based applications.
Therefore, to alleviate for the high complexity of CTCP, we propose to use a restricted set of decision checks at each node , namely . The consequence is that we cannot guarantee the resulting trees are optimal even if we solve our ILP model to optimality. This way we use CG like a large neighbourhood search engine including the initially provided CART solution and all classification trees reachable by selecting the decision check alternatives at internal nodes, i.e. for all .
To find a good restricted set of decision checks at node , we make use of the CART algorithm. For simplicity, we will call this process threshold sampling, despite the fact that we sample from both sets of features and thresholds.
In the threshold sampling procedure, we run the CART algorithm [sklearn] on a randomly selected large portion of the data, i.e. (line 4, Algorithm 1) and collect the decision splits appearing in the obtained tree (lines 5 and 6, Algorithm 1) . This procedure is repeated while a new decision split appears at root node in less than iterations. We then retain the splits, that are most frequently used at each internal node. While it is possible to keep all decision splits at the root node, as their number is small, we only keep a limited number of the decision splits appearing at the internal nodes of the constructed CART trees. More precisely, we keep at every internal node , the splits with the highest frequency (line 13, Algorithm 1). This stopping rule is based on the observation that the split at the root and at nodes close to the root are the most decisive in the structure of the tree. For each node , the obtained decision splits form the set of restricted decision checks .
One of the advantages of using mathematical optimization models to learn decision trees is that one can easily incorporate different optimization objectives and constraints into the learning process, as initially demonstrated by VerZha17. In Appendix A, we explain how the CG approach described above can be adapted to handle other objectives, such as minimizing the false negatives, obtaining trees with a low number of leaves or can incorporate constraints on the minimum number of rows to end in a leaf.
6 Computational experiments
This section presents the computational results obtained with our approach and compares them to the results of the recently proposed ILP based classification algorithms in the literature.
6.1 Benchmark datasets and algorithms
In the sequel we use CG to refer to our column generation based approach, using the master problem , pricing problem and column management procedure described in 4.1 and the threshold sampling described in 5. We compare CG to three algorithms. The first one is an optimized version of CART available in Scikit-learn, which is a machine learning tool in Python [sklearn]. We ran CART with the default parameters except for the maximum depth, which was set to the corresponding depth of our problem. The second algorithm is a tuned version of CART, where we tested different parameters’ values of CART and used ones that gave the best results. We name it CART*. As listed in Table 3, the parameter tuning includes 80 possible combinations of the following parameters: (i) the minimum sample requirement, from 0.02, 0.05, 0.1, to 0.2, and the minimum segment size at leaves, from 0.01, 0.05, 0.1, 0.2, to 1. (ii) the performance metric used to determine the best splits, including gini and entropy; and (iii) the weights given to different classes. The “Balanced” option from Scikit-learn balances classes by assigning different weights to data samples based on the sizes of their corresponding classes. The “None” option does not assign any weights to data samples. All these options are explored by performing an exhaustive search with a 10-fold cross validation on training data. The third algorithm that we compared with is a MILP formulation proposed recently by BertDunn17, named OCT. The results of OCT are directly taken from BertDunn17.
Similar to BertDunn17 and VerZha17, we use the tree generated by CART as a starting solution to our model. In the pricing heuristic, the size of the column pool is , the number of leaves in the updated procedure is , and the number of the chosen columns to add to the master problem is . For the threshold sampling procedure 1, we use the following parameter values: the portion of the data is set of 90%, the number of CART trees is , and the number of , decision check values are selected for each internal node.
|Minimum sample requirement|
|Minimum segment size at leaves|
Tuned hyperparameters for CART*
We tested four algorithms using 20 datasets from the UCI repository [UCI], where 14 are “small” datasets containing less than 10000 data rows and 6 are “large” ones containing over 10000 data rows. The first 14 were selected such that almost none pre-processing was required to use the data, that is, there are no missing values and almost all features are numerical. The only pre-processing we performed was:
Transform classes to integers
Transform nominal string features into 0/1 features using one-hot-encoding
Transform meaningful ranked (ordinal) string features into numerical features (for instance
This last transformation only happened in the car evaluation and the seismic bumps datasets. We used the algorithms to construct classification trees of depths 2, 3, and 4 and compare their performance in terms of training and testing accuracy.
6.2 Experimentation setting
All experiments were conducted on a Windows 10 OS, with 16GB of RAM and an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz. The code is written in Python 2.7 and the solver used to solve the linear programs is CPLEX V.12.7.1 [CPLEX] with default parameters.
In order to compare the performance of CG to OCT, we used the same setup as in BertDunn17. Therefore for a given dataset, 50% of the dataset are used for training, 25% for testing and the remaining 25% are not used.111This set was used in BertDunn17 to select the best parameters in their model. The splits in the data are made randomly. This procedure is repeated 5 times, and the reported performance of each dataset is averaged over 5 experiments.
We let CG run at most 10 minutes for solving each instance. In comparison, for OCT, the time limit was set to 30 minutes or to 2 hours depending on the difficulty of the problem (see BertDunn17).
In this section we give the results using charts and small tables. The exhaustive results can be found in the appendix.
6.3.1 Results overview
First we provide some overview results in tables 4 and 5. Table 4 contains the average testing accuracy over the 14 small data sets and the 6 large data sets, respectively. Table 5 contains the number of wins, i.e., the number of times a method outperforms another in terms of accuracy on different datasets, for decision trees of different depths. Note that no results are available on the big datasets for OCT in BertDunn17. The best results are indicated in bold.
The results in Table 4 show that for all small data sets, CG obtains the highest average accuracy among the tested methods for building different depths of the classification trees. For big datasets, when the tree is small (i.e., depth 2), CG outperforms CART and CART*. When learning trees of depths 3 and 4, although CG is outperformed by CART* in terms of the average accuracy over all big datasets (Table LABEL:tab:overview1big), for most individual datasets, CG gives a better tree than CART*, as indicated in Table LABEL:tab:overview2big.
Table 5 shows that CG has the highest number of wins, i.e., highest accuracy score for a given dataset, for not only the small but also the large datasets. CG is better than or equal to CART and CART* on 5 out of 6 datasets on constructing trees of depth 3, and its performance on one particular dataset (Letter Recognition, See Table 7) leads to an lower average accuracy than CART* as shown in Table LABEL:tab:overview2big.
These overview results show that the proposed algorithm has the best overall performance against the other three algorithms on the tested datasets. Next, we provide more details on the results.
6.3.2 Training accuracy
We first investigate whether our proposed optimization method can maximize the prediction accuracy better than the greedy based heuristic CART. For this purpose, we use all data as training data and test on the 14 small datasets.
Table 6 shows the training accuracy obtained by CG and CART on decision trees of different depths. There are 14 datasets with data size ranging from 122 (Monks-problems-2) to 4601 (Spambase), number of features ( in the table) from 4 to 57.
As expected, when the learning models become more complex (i.e., the depths of the trees become deeper), both CG and CART construct better decision trees that predict the classes more accurately. More interesting results are in Figure 3, which shows the absolute difference in accuracy between CG and CART. On average, CG improves the training accuracy by 2%, with a maximum improvement of 9%. The CG algorithm improves upon CART in almost all datasets, except two cases: Monks-problems-3 and car-evaluation.
For the known easy datasets, such as Iris, CG’s results are very similar to CART since a simple heuristic like CART performs very good on these datasets already. On Iris, CART and CG result in the trees of same quality for depth 2, with accuracy 96%, which has been proven to be optimal [VerZha17]. CG is slightly better than CART on building the decision tree of depth 3. For depth 4, they both achieve the training accuracy of 99.3%, while the best accuracy that a decision tree of depth 4 can obtain is 100% (see [VerZha17]). This similar and sub-optimal result of CG and CART is due to the fact that we use a restricted feature and threshold set in CG, and such a restricted set is derived from many randomized decision trees generated from CART.
Although the use of the threshold sampling compromises the optimality of solutions, it does demonstrate its advantage in terms of scalability. As seen from the existing MILP based approaches, e.g., BertDunn17 and VerZha17, the performance is degenerated with larger trees. In contrast, CG has been consistently given better solutions than CART, regardless of the size of the trees.
This set of experiments demonstrates that our MILP based approach is capable of constructing more accurate decision trees on the given datasets, compared to CART.
6.3.3 Detailed results on testing data
In this section, we show the generalization ability of the proposed algorithm by evaluating the resulting decision trees using testing data. We show the testing results on the small datasets first, and then the large datasets.
When learning classification trees of depth 2, all algorithms have very similar performance, see Figure (a)a. CG outperforms the three other algorithms in most of cases (9 out of 14 datasets), although the performance increases are rather small, within 5%. There is only one single instance, Seismic-bumps, where CG has the worst accuracy value. However, the difference of CG and others on Seismic-bumps is less than 0.5%. The MILP based algorithm, OCT, performs best in two datasets, namely, Monks-problems-2 and Wine. Interestingly, these two datasets are both very small, with Monks-problems-2 having 169 data points and 6 features, and Wine having 178 with 13 features. On two relatively bigger datasets with over 4000 data points (i.e., Spambase and Statlog-satellite), the OCT algorithm starts to show its limitation on the large datasets, i.e., it performs worst among all algorithms. In these cases, our MILP based algorithm, CG, can still outperform CART and CART* on Spambase. It shows the effectiveness of our algorithm to solve the problem of learning decision trees even with large datasets, in contrast to OCT. Compared to the greedy heuristics CART and CART*, CG outscores or ties with CART and CART* in 13 and 10 out of 14 cases, respectively. When learning bigger trees of depths 3 and 4, the improvements that our algorithm makes over the other three algorithms, especially over OCT, are even more significant. For instance, CG improves up to more than 18% against OCT in Monks-problems-1, and more than 11% against CART* in Monks-problems-2. The good generalization ability of CG could be due to the randomness that we introduce in the threshold and feature sampling procedure.
This set of experiments, together with the training results in Table 6, show that our algorithm has an overall better learning and generalization abilities than the three other algorithms on the tested datasets.
Another important aspect is the needed time to construct the trees of given depths. In our experiments, CART only took approximately 0.1s to generate a tree, while CART* needed between 1s to 10s due to the grid search for the best parameters. For OCT, the time limit was set to 30 minutes or to 2 hours depending on the difficulty of the problem (see BertDunn17). In terms of the proposed CG algorithm, it terminates as soon as one of the following stopping criteria is met: (1) the optimal solution of the master problem has been reached; (2) a maximum number of iterations of the heuristic in case the IP formulation of the pricing problems are too hard to solve has been reached; and (3) a time limit of 10 minutes has been reached. For running the 14 small datasets, CG never terminated due to the time limit of 10 minutes. Apart from three datasets at depth 4 (namely, Spambase, Qsar-biodegradation and Seismic-bumps), which have large number of data rows or features, and the target decision trees are big, all other experiments terminated due to the fact that the optimal solutions of the master problem have been found. In other words, they carry proof of optimality. This contrasted with the results of BertDunn17, where the authors stated “most of [their] results do not carry explicit certificates of optimality”. Note that “optimality” only refers to the optimality of a master problem during the CG procedure, and not to finding an optimal solution to the decision tree learning problem.
Figure 5 shows the required computational time for constructing each tree. As expected, the algorithm needs more time when the size of the problem (depth, rows, features) grows. Nevertheless, all instances are solved in less than 10 minutes. For the smallest instances, only a few seconds are needed, which makes our algorithm competitive against CART* not only in the quality of the results, but also in speed.
6.3.4 Detailed results on big datasets
The MILP-based formulations in the existing literature (e.g., BertDunn17, VerZha17) failed to handle datasets with more than 10000 rows. Therefore, for large datasets, we can only compare the results of our CG algorithm with CART and CART*.
|Magic4||19020||10||2||79.1||79.2 (24.32)||80.1 (665.31)|
|Default credit||30000||23||2||82.3||82.2 (42.05)||82.3 (517.35)|
|HTRU_2||17898||8||2||97.9||97.8 (20.48)||97.9 (627.85)|
|Letter recognition||20000||16||26||17.7||23.3 (9.75)||18.6 (114.03)|
|Statlog shuttle||43500||9||7||99.6||99.5 (13.37)||99.7 (260.52)|
|Hand-posture||78095||33||5||62.5||62.4 (321.90)||62.8 (660.21)|
Figure 6 shows the performance improvement of CG, against CART and CART*, on different datasets when learning trees of different depths. Table 7 contains the detailed results on the testing accuracy for learning classification trees of depth 3, where the computation time (bracketed) to construct trees are provided for CART* and CG. For the results of depths 2 and 4, we refer to Tables 11 and 12 in Appendix.
Despite the large size of the problem, CG always performs better than CART, although the improvements, which is about 0.34% on average, are not as significant as those on small datasets. On two datasets, Default credit and HTRU_2, CG could not find improved solutions compared to CART. This may be caused by the structure of the data, that is, these two cases are rather easy as CART can give very good classification results already (more than 82% for Default credit and more than 97% for HTRU_2). Hence, the room to improve might be small. Compared with CART*, we note small improvements (around 0.3%) in most of the instances. For the case Letter recognition with 26 classes, CART* appears to be much better than CG on predicting right classes.
Regarding the computational time, CART needed less than 1 second to generate a tree. CART* took between 10 seconds and 5 minutes depending on the size of the problem, which is between 2 and 30 times faster than CG. The stopping criteria of CG are the same as for small datasets.222The time limit was set to 10 minutes for CG. In the table, some running time over 10 minutes is because the current iteration of CG has to end before completely terminating the algorithm. Only on Magic4 with depth 2, the result carries an optimality proof.
7 Discussion and conclusion
In this paper we propose a novel Column Generation (CG) based approach to construct decision trees for classification tasks in Machine Learning. To the best of our knowledge, our approach is the first one using restricted parameter set of a problem besides the set of restricted decision variables as in the traditional CG approaches. We also indicate clearly the limitation of the CG approach when the complete parameter set is used.
Our extensive computational experiments show that the proposed approach outperforms the state-of-the-art MILP based algorithms in the very recent literature in terms of training and testing accuracy. It also improves the solutions obtained by CART. Moreover, our approach can also solve big instances with more than 10000 data rows. For such instances the existing ILP formulations (e.g., BertDunn17 and VerZha17) have very high computation times.
Another important aspect of our approach is having high flexibility in the objective. This means that our models can use other type of objectives, common in the field of decision trees, different from accuracy.
In this work, we implemented a basic version of the threshold sampling. In a future study, this sampling procedure can be tested with other ideas of information collection, e.g., rhuggenaath2018learning. Advanced threshold sampling will expectedly result in improved results on big data sets.
It will also be interesting to see if our idea of working with restricted parameter set can be used to develop solution methods for other problems. In addition, it will be interesting to see the performance of our approach with different learning objectives other than accuracy in applications from different fields. For instance, our approach can be used to build a classification tree with minimized false negatives using medical data. Furthermore, we will investigate how to improve our approach in a way that the CG model iteratively updates the restricted parameters to achieve better objective values.
The second and third authors acknowledge the support of United Arab Emirates University through the Start up Grant G0002191.
Appendix A Flexibility in the objective
a.0.1 Focusing on false positive/false negatives
In many applications of classification trees, such in health-care, focusing on false positives or false negatives might be more desirable than maximizing accuracy. This can be easily incorporated in our model by changing the definition of and the pricing problem accordingly.
Assume for example that our objective is to minimize the false positives (the row is predicted with target 1 whereas the real target is 0). Then the objective coefficients of decision paths in the master model become for , and for . The pricing objectives become
Note that same ideas can be applied to deal with other objective functions focusing on different aspects (such as false positive, etc), or a combination of them. Moreover, our method offers the flexibility of giving different weights to different aspects. The only changes that need to be performed are in the definition of and the objective function of the pricing problem.
a.0.2 Penalizing the number of leaves
Another common objective in the field of decision trees is a trade-off between accuracy and the number of leaves. In order to have a more interpretable tree (i.e. a tree with less than leaves), a high number of leaves is often penalized in the objective function. Note that our model allows empty paths. As a high number of empty paths corresponds to a high number of leaves, to restrict the latest, it suffices to reward in the objective the choice of an empty path. This can be done by defining an extra indicator variable for an empty path and changing , for example into:
where has to be defined by the user. Correspondingly, the objective function of the pricing problem becomes:
and the pricing MILP model includes the following extra constraint:
a.0.3 Minimum sample requirement
Another useful feature can be not to create a leaf if at least rows end in it. This can be done either by using a penalty function or by including the following constraint in the pricing problem:
Please note that this is compatible with other options such as penalizing a high number of leaves. Both aspects can be included by considering the following constraint in the pricing MILP:
Appendix B Detailed results
The following tables refer to the average accuracy on testing. For CART* and CG, the computational time is also provided (bracketed).
|Iris||150||4||2||94.7||94.7 (1.61)||92.4||94.7 (2.67)|
|Pima-Indians-diabetes||768||8||2||71.4||71.5 (2.02)||72.9||73.2 (11.38)|
|Banknote-authentification||1372||4||2||89.0||89.7 (2.05)||90.1||91.2 (9.27)|
|Balance-scale||625||4||3||63.7||63.7 (1.65)||67.1||68.5 (5.04)|
|Monks-problems-1||124||6||2||66.5||66.5 (1.61)||67.7||69 (2.29)|
|Monks-problems-2||169||6||2||53.0||53.0 (1.60)||60.0||54.4 (3.76)|
|Monks-problems-3||122||6||2||94.2||94.2 (1.60)||94.2||94.2 (2.35)|
|Ionosphere||351||34||2||84.1||88.2 (2.41)||87.8||85.2 (7.39)|
|Spambase||4601||57||2||85.2||85.6 (6.13)||84.3||86.5 (51.9)|
|Car-evaluation||1728||5||4||77.5||77.5 (2.10)||73.7||77.5 (6.59)|
|Qsar-biodegradation||1055||41||2||77.5||75.4 (3.08)||76.1||79.8 (36.00)|
|Seismic-bumps||2584||18||2||93.2||93.4 (2.62)||93.3||92.8 (31.89)|
|Statlog-satellite||4435||36||6||63.6||65 (5.83)||63.2||64 (25.52)|
|Wine||178||13||3||82.2||87.6 (1.96)||91.6||86.2 (6.03)|
|Iris||150||4||2||96.3||96.3 (1.66)||93.5||96.3 (4.20)|
|Pima-Indians-diabetes||768||8||2||73.8||69.6 (2.07)||71.1||72.9 (144.49)|
|Banknote-authentification||1372||4||2||92.1||94.2 (2.16)||89.6||94.8 (40.81)|
|Balance-scale||625||4||3||69.8||70.7 (1.68)||68.9||72.5 (56.10)|
|Monks-problems-1||124||6||2||79.4||78.1 (1.69)||70.3||88.4 (3.65)|
|Monks-problems-2||169||6||2||51.6||51.2 (1.64)||60.0||63.3 (25.95)|
|Monks-problems-3||122||6||2||92.3||93.5 (1.69)||94.2||92.9 (3.15)|
|Ionosphere||351||34||2||86.4||89.1 (2.61)||87.6||86.4 (55.58)|
|Spambase||4601||57||2||88.0||88.0 (7.82)||86.0||88.3 (416.76)|
|Car-evaluation||1728||5||4||79.0||79.9 (1.92)||77.4||78.9 (10.63)|
|Qsar-biodegradation||1055||41||2||82.0||80.9 (3.40)||78.6||82.9 (390.35)|
|Seismic-bumps||2584||18||2||92.8||93.4 (2.82)||93.3||92.4 (312.00)|
|Statlog-satellite||4435||36||6||78.6||80.3 (7.31)||77.9||78.4 (111.26)|
|Wine||178||13||3||88.9||91.6 (1.90)||94.2||91.6 (7.24)|
|Iris||150||4||2||95.8||95.8 (1.68)||93.5||94.7 (7.54)|
|Pima-Indians-diabetes||768||8||2||70.9||72.5 (2.23)||72.4||71.5 (319.14)|
|Banknote-authentification||1372||4||2||95.2||96.1 (2.25)||90.7||95.9 (107.01)|
|Balance-scale||625||4||3||74.6||73.8 (1.68)||71.6||79.9 (243.93)|
|Monks-problems-1||124||6||2||76.1||72.9 (1.61)||74.2||86.5 (8.90)|
|Monks-problems-2||169||6||2||52.6||49.8 (1.62)||54.0||52.6 (75.88)|
|Monks-problems-3||122||6||2||90.3||91.6 (1.64)||94.2||92.9 (5.30)|
|Ionosphere||351||34||2||87||87.3 (2.80)||87.6||84.5 (103.22)|
|Spambase||4601||57||2||90.2||90.0 (8.50)||86.1||90.1 (537.24)|
|Car-evaluation||1728||5||4||83.4||84.7 (1.84)||78.8||85.0 (23.02)|
|Qsar-biodegradation||1055||41||2||82.1||81.6 (3.92)||79.8||82.9 (555.67)|
|Seismic-bumps||2584||18||2||92.0||93.4 (3.21)||93.3||92.0 (555.86)|
|Statlog-satellite||4435||36||6||81.1||81.4 (8.75)||78.0||81.5 (355.55)|
|Wine||178||13||3||89.3||93.3 (1.88)||94.2||92.0 (10.20)|
|Magic4||19020||10||2||78.4||78.4 (19.60)||79.1 (437.79)|
|Default credit||30000||23||2||82.3||82.3 (32.83)||82.3 (150.12)|
|HTRU_2||17898||8||2||97.8||97.8 (15.58)||97.8 (114.68)|
|Letter recognition||20000||16||26||12.5||12.7 (8.32)||12.7 (92.11)|
|Statlog shuttle||43500||9||7||93.7||93.7 (11.36)||93.7 (211.24)|
|Hand-posture||78095||33||5||56.4||56.4 (254.63)||56.4 (612.39)|
|Magic4||19020||10||2||81.5||81.5 (29.21)||81.5 (688.24)|
|Default credit||30000||23||2||82.3||82.2 (51.38)||82.3 (635.50)|
|HTRU_2||17898||8||2||98.0||97.7 (23.93)||98.0 (633.21)|
|Letter recognition||20000||16||26||24.8||35.4 (11.40)||27.0 (306.91)|
|Statlog shuttle||43500||9||7||99.8||99.6 (15.99)||99.8 (441.48)|
|Hand-posture||78095||33||5||69.0||69.0 (385.25)||69.1 (696.12)|
-  [Bessiere et al.]Bessiere et al. 2009 Bessiere09 Bessiere, C., Hebrard, E. and O’Sullivan, B., 2009, September. Minimising decision tree size as combinatorial optimisation. In International Conference on Principles and Practice of Constraint Programming (pp. 173-187). Springer, Berlin, Heidelberg.
-  [Bertsimas and Dunn]Bertsimas and Dunn 2017 BertDunn17 Bertsimas, D. and Dunn, J., 2017, Optimal classification trees, Journal Machine Learning, 106(7) pp. 1039-1085.
-  [Breiman et al.]Breiman et al. 1984 Breiman84 Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984, Classification and regression trees, Monterey,CA: Wadsworth and Brooks.
-  [Chang et al.]Chang et al. 2012 Chang2012 Chang, M.W., Ratinov, L. and Roth, D., 2012. Structured learning with constrained conditional models. Machine learning, 88(3), pp.399-431.
-  [Dash et al.]Dash et al. 2018 Dash18 Dash, S., Günlük, O., Wei, D.,2018, Boolean Decision Rules via Column Generation, arXiv:1805.09901 [cs.AI].
[De Raedt et al.]De Raedt et al.2010Raedt2010 De Raedt, L., Guns, T. and Nijssen, S., 2010, July. Constraint programming for data mining and machine learning
. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) (pp. 1671-1675).
-  [Desrochers et al.]Desrochers et al. 1992 Desrochers92 Desrochers, M., Desrosiers, J., Solomon, M., 1992, A New Optimization Algorithm for the Vehicle Routing Problem with Time Windows, Operations Research, 40(2), pp. 342-354.
-  [Desrosiers and Lübbecke]Desrosiers and Lübbecke 2005 Desrosiers05 Desrosiers, J., Lübbecke, M. E.,2005, Column Generation, edited by Desaulniers, G., Desrosiers, J.,Solomon, M. M., Springer US, pp. 1-32
-  [Duong and Vrain]Duong and Vrain 2017 DuongVrain17 Duong, K.C. and Vrain, C., 2017. Constrained clustering by constraint programming. Artificial Intelligence, 244, pp.70-94.
-  [Flach]Flach 2012 Flach12 Flach, P., 2012, Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press, Cambridge.
-  [Firat et al.]Firat et al. 2016 Firat16 Firat, M., Briskorn, D., Laugier, A., 2016, A Branch-and-Price algorithm for stable multi-skill workforce assignments with hierarchical skills, European Journal of Operational Research, 251(2), pp. 676-685.
-  [Garey, Michael R. and David S. Johnson]Garey, Michael R. and David S. Johnson 1979 NPcomp Garey, Michael R. and Johnson, David S., 1979, Computers and Intractability; A Guide to the Theory of NP-Completeness, ISBN 0-7167-1045-5.
-  [Gunluk et al.]Gunluk et al. 2018 Gunluk18 Günlük, O., Kalagnanam, J., Menickelly, M., Scheinberg, K., 2018, Optimal Generalized Decision Trees via Integer Programming, arXiv:1612.03225v2 [cs.AI].
-  [Guns et al.]Guns et al.2011Guns11 Guns, T., Nijssen, S. and De Raedt, L., 2011. Itemset mining: A constraint programming perspective. Artificial Intelligence, 175(12-13), pp.1951-1983.
-  [Hyafil and Rivest]Hyafil and Rivest 1976 HyaRi76 Hyafil, L., Rivest, R.L., 1976, ‘Constructing optimal binary decision trees is np-complete, Inf. Proc. Lett., pp. 15-17.
-  [IBM ILOG CPLEX]IBM ILOG CPLEX 2016 CPLEX IBM ILOG CPLEX, 2016, V 12.7 User’s manual, https://www-01.ibm.com/software/commerce/ optimization/cplex-optimizer
-  [Lichman, M.]Lichman, M. 2013 UCI Lichman, M., 2013, UCI machine learning repository, http://archive.ics.uci.edu/ml
-  [Menickelly et al.]Menickelly et al. 2016 Menick16 Menickelly, M., Gunluk, O., Kalagnanam, J., and Scheinberg, K., 2016, ‘Optimal Decision Trees for Categorical Data via Integer Programming, COR@L Technical Report 13T-02-R1, Lehigh University.
-  [Murthy and Salzberg]Murthy and Salzberg1995 Murthy95 Murthy, S., Salzberg, S., 1995, Lookahead and pathology in decision tree induction, in IJCAI, Citeseer, pp. 1025–1033.
-  [Narodytska et al.]Narodytska et al. 2018 narodytska2018learning Narodytska, N., Ignatiev, A., Pereira, F., Marques-Silva, J. and RAS, I.S., 2018. Learning Optimal Decision Trees with SAT. In IJCAI, pp. 1362-1368.
-  [Norton]Norton1989 Norton89 Norton, S. W., 1989, Generating better decision trees, IJCAI 89, pp. 800-805.
-  [Norouzi et al.]Norouzi et al.2015 Norouzi15 Norouzi, M., Collins, M., Johnson, M.A., Fleet, D.J., Kohli, P., 2015, Efficient non-greedy optimization of decision trees, Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 1720-1728.
-  [Payne and Meisel ]Payne and Meisel1977 Payne77 Payne, H. J., Meisel, W. S., 1977, An algorithm for constructing optimal binary decision trees, IEEE Transactions on Computers, 100(9), pp. 905–916.
-  [Quinlan]Quinlan1986 Quinlan86 Quinlan, J. R., 1986, Induction of decision trees, Machine Learning, 1(1), pp 81-106.
-  [Rhuggenaath et al.]Rhuggenaath et al.2018 rhuggenaath2018learning Rhuggenaath, J., Zhang, Y., Akcay, A., Kaymak, U. and Verwer, S., 2018. Learning fuzzy decision trees using integer programming, In 2018 IEEE International Conference on Fuzzy Systems.
-  [Scikit-learn]Scikit-learn 2018 sklearn Scikit-learn, 2018, V 0.19.2 User’s manual, http://scikit-learn.org/stable/_downloads/scikit-learn-docs.pdf
-  [Spliet and Gabor]Spliet and Gabor2014SplietGabor2014 Spliet, R. and Gabor, A.F., 2014. The time window assignment vehicle routing problem. Transportation Science, 49(4), pp.721-731.
-  [Verwer and Zhang]Verwer and Zhang 2017 VerZha17 Verwer, S. and Zhang, Y., 2017, Learning Decision Trees with Flexible Constraints and Objectives Using Integer Optimization, Integration of AI and OR Techniques in Constraint Programming: 14th International Conference, CPAIOR 2017, Padua, Italy, June 5-8, 2017, Proceedings. pp. 94–103.
-  [Verwer et al.]Verwer et al. 2017 VerZhaYe17 Verwer, S., Zhang, Y. and Ye, Q.C., 2017. Auction optimization using regression trees and linear models as integer programs. Artificial Intelligence, 244, pp.368-395.