Decision trees are one of the widely used statistical models. Apart from being a good classifier, they have some very unique properties which separate them from other models. A path from root to any leaf can be described as a sequence of decisions: ‘’ (axis-aligned trees) or (oblique tree). This not only makes decision trees very fast at inference but also makes them good interpretable models. This sequence of decisions can be used as IF-THEN rules to understand the prediction of the model for a given input.
However, there is one major drawback with decision trees: learning the tree from data is a very difficult optimization problem, involving a search over a complex, large set of tree structures, and over the parameters at each node. Recently, Carreira-Perpiñán and Tavallali (2018) proposed Tree Alternating Optimization (TAO) algorithm to improve this problem, where authors directly optimize the misclassification error, using alternating optimization over separable subsets of nodes. In this work, we compare TAO against some well-known decision tree algorithm over a wide range of datasets.
We have structured this paper in the following way: In section 2 we briefly describe all the algorithms that we use for the comparison. Next, in section 3 we describe all the data sets including the number of instances and dimensionality for each data set. In section 4 and 5, we describe our experimental setup and results of the comparison.
2 The algorithms
Below we provide a short description of the algorithms. More details can be found on the corresponding cited papers.
CART: CART (Breiman et al., 1984) is one of the most widely used algorithms for training axis-aligned decision trees. It learns the tree by greedy recursive partitioning, to optimize the impurity measure at each node. At each growing stage for a given node, it enumerates through all the attributes to find the best split that reduces the Gini-index for that node. It grows the tree up to the max depth and then starts pruning nodes one by one such that it does not increase the misclassification error by a certain threshold.
C5.0: Quinlan (1993) is known as an established univariate decision tree learning software. Similarly to CART, it uses a greedy recursive partitioning of the tree nodes. At each recursive split, the algorithm enumerates over different feature-threshold combinations and picks the best one according to the information gain criterion. Pruning can be applied once the tree growing phase is finished.
TAO: The TAO algorithm proposed in Carreira-Perpiñán and Tavallali (2018) optimizes a decision tree with predetermined structure and can be trained to minimize the desired objective function such as misclassification error. Each iteration of TAO is guaranteed to decrease or leave unchanged the objective function. The algorithm can be applied to both axis aligned and oblique decision trees. Moreover, the algorithm can handle various penalty terms on objective function such as -regularization which we briefly describe here (see Carreira-Perpiñán and Tavallali (2018) for details). TAO assumes a given tree structure with initial parameter values (possibly random), and minimizes the following objective function jointly over the parameters of all nodes of the tree:
where is a training set of -dimensional real-valued instances and their labels (in classes),
is the loss function (e.g. cross-entropy, 0/1 loss, etc.) andis the predictive function of the tree and is parameters at a node . For example, in case of oblique decision nodes, and bias , which thus sends an input instance down its right child if and down its left child otherwise.
The basis of the TAO algorithm is given by the separability condition theorem. It states that for any nodes and (internal or leaves) that are not descendants of each other (e.g. all nodes at the same depth) the error in eq. (1) separates over and . Since the loss function now separates algorithm can optimize eq. (1) over each node separately. This much simpler problem is referred as a “reduced problem”. TAO algorithm applies alternating optimization over separable subsets of nodes:
Optimizing over internal nodes is equivalent to optimizing a linear binary classifier over over the training instances that currently reach node . Each such instance is assigned a pseudo label based on the child whose subtree gives the better prediction for . Specifically, we send to the left and right subtrees. All parameters in those subtrees are fixed and depending on which one gives correct output we assign a pseudo label (either or ). This pseudo label indicates where to send the given instance (either or ).
Optimizing over a leaf which is a -class classifier on the training points that reach that particular leaf. In this paper, we focus on constant leaves. Therefore, the solution, in this case, will be the majority label of the training points that reach leaf .
Below we summarize datasets used in this study and any changes that are made for the experiments. All datasets are available in the public domain.
Balance Scale: The dataset is available from UCI (Zhang et al., 2017). This dataset was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. There are 625 instances and each instance has attributes that are described as of categorical type by Zhang et al. (2017), but they are numerical variables that has been discretized to discrete values.
Banknote authentication: This UCI dataset (Zhang et al., 2017) is consists of attributes extracted from images that were taken from genuine and forged banknote-like specimens. This dataset has two classes: genuine and forged. There are instances and each instance has real-valued attribute.
Blood Transfusion: The dataset is also available form the UCI (Zhang et al., 2017). The dataset also has two classes: whether a donor donated the blood in March 2007 or not. There are instances and each instance has real-valued attributes.
Breast Cancer Wisconsin (Diagnostic): This is one of the UCI dataset (Zhang et al., 2017) about breast cancer. The task is to classify whether the cancer is malignant or benign. There are 569 instances and each instance has real-valued attributes.
Spam: This UCI dataset (Zhang et al., 2017) is consists of a collection of emails and the task is to create a spam-filter that can tell whether an email is a spam or not. There are 4601 instances and each instance has real-valued attributes.
Sensit: This dataset is created by Duarte and Hu (2004). The task is to classify the types of moving vehicles in a distributed, wireless sensor network. There are three classes and training instances along with separate test instances. Each instance has attributes.
Letter: The objective of this UCI dataset (Zhang et al., 2017) is to classify 26 capital letters in the English alphabet. Similar to the Sensit dataset it has separate test instances along with training instances. Each instance has real-valued attributes.
MNIST Images: The dataset is consists of grayscale images of handwritten digits and the task is to classify them as 0 to 9. There are training images and test images. Each image is of size with gray scales in [0,1].
4 Experimental setup
For UCI datasets which do not have separate test set (except Letter dataset), we shuffle the entire dataset and keep 20% of the entire data as the test set. We repeat the training procedure times for each dataset, reshuffling the training data each time. For each algorithm, we use the same reshuffled training data for a fair comparison. We also apply 10-fold cross-validation for each algorithm to find the best hyper-parameters (pruning parameter for CART and C5.0, sparsity parameter () for TAO). Below we describe each algorithm-specific experiment setup:
CART: We use R implementation of CART called rpart (Therneau et al., 2019). For each dataset during training we let the tree grow up to the max allowed depth of (max-depth constraint by rpart ). For this, we set the “minsplit” parameter to 1 and the complexity parameter (“cp”) to 0. Once the tree is fully grown we use rpart internal k-fold cross-validation (k=10), to get list of pruning parameters and choose best pruning parameter based on SE-1 rule (as suggested by rpart documentation). We report tests and train accuracy of the pruned tree.
C5.0: We use single-threaded Linux version of the C5.0 (provided by authors, see 111https://rulequest.com/download.html) written in C language. For each of the datasets, we apply a grid search on the k-fold validation set to get the best parameters. Specifically, we tune “-c CF” which controls the pruning severity and “-m cases” which is the minimum number of points to perform a node split. We use the default options for all other parameters. It worth to mention that empirically we have found that in many cases the tuned parameters are not far away from the default setting.
TAO: For the TAO algorithm, we use oblique (i.e. linear splits) decision trees with constant leaves. We take as an initial tree a deep enough, a complete binary tree with random parameters at each node. We use the fixed number of TAO iterations which is equal to
, and algorithm proceeds until the maximum number of iterations are reached (i.e. there is no other stopping criterion). We also use a simple grid search on k-fold validation set to find the best hyperparameters. Specifically, we tune the “” parameter which controls -folsparsity of the tree and maximum depth of the initial tree. TAO algorithm is implemented in Python (version 3.5) without parallel processing in a single CPU. TAO uses an
-regularized logistic regression to solve the decision node optimization (using LIBLINEAR(Fan et al., 2008)) where the mentioned “” parameter is used as an regularization parameter ().
We ran all experiments on a single Linux PC with the following specifications: OS - Ubuntu 18.04 LTS, CPU - 8 Intel Core i7-7700 3.60GHz, Memory - 16 GiB DDR4 3600 MHz.
As mentioned in the previous section we report the test and training errors on the pruned trees. We summarize error results for all the datasets in table 1. Both CART and C5.0 perform similarly in terms of test accuracy, but TAO outperforms both algorithms for all datasets. The accuracy margin between TAO and the other two becomes more as the dataset complexity grows like in the case of the last 4 datasets in the table. For instance, in the case of MNIST images and LeNet5 features not only the dataset size is big ( training data points) but also the number of attributes is very high.
Since decision trees are considered interpretative models, it is important to also compare the size of the trained trees. For this in table 2 we compare the maximum depth and number of the leaves. If a decision tree is deep and has a large number of leaves, it is very difficult to interpret. Moreover, the larger tree has large inference time and need more space. CART performs better than C5.0 in terms of tree size given that both perform similarly in terms of test accuracy. Again, TAO performs better than both of the algorithms. Similar to test and train accuracy as datasets become more difficult the tree size margin grows bigger. For example, in the case of MNIST images and LeNet5 features, the difference is very large. The depth difference is almost twice the TAO tree size and the number of leaves is more than 4 times than TAO.
|Dataset||Depth||# of leaves||Depth||# of leaves||Depth||# of leaves|
In this work, we compare the performance of some well-known decision tree algorithms along with the recently proposed TAO algorithm. Our experiments show that TAO not only performs better in accuracy but also provides smaller and more interpretable decision trees. The reason for this better performance is how TAO train trees differently than the other two algorithms. CART and C5.0 greedily optimize the decision trees, at each step both algorithms split the data by using a single attribute that optimizes the impurity of a single node. This approach has no guarantees to reduce the global loss. Thus, in the end, the trained tree usually doesn’t generalize well and also big in size. On the other hand, TAO instead of optimizing the impurity of a node at each step, optimizes the misclassification error of the entire tree as each step optimizes the weights of all the nodes. This approach of updating parameters of all nodes as a whole provides better-optimized decision trees that not only generalize well but also have a smaller size.
- Breiman et al. (1984) L. J. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, Calif., 1984.
- Carreira-Perpiñán and Tavallali (2018) M. Á. Carreira-Perpiñán and P. Tavallali. Alternating optimization of decision trees, with application to learning sparse oblique trees. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems (NEURIPS), volume 31, pages 1211–1221. MIT Press, Cambridge, MA, 2018.
- Duarte and Hu (2004) M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7):826–838, 2004.
Fan et al. (2008)
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
LIBLINEAR: A library for large linear classification.
J. Machine Learning Research, 9:1871–1874, Aug. 2008.
- LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
- Quinlan (1993) J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
- Therneau et al. (2019) T. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive partitioning and regression trees. R package version 4.1-15, Apr. 12 2019. Available online at https://cran.r-project.org/package=rpart.
- Zhang et al. (2017) C. Zhang, C. Liu, X. Zhang, and G. Almpanidis. An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications, 82:128–150, Oct. 1 2017.