1 Introduction
Decision trees are one of the widely used statistical models. Apart from being a good classifier, they have some very unique properties which separate them from other models. A path from root to any leaf can be described as a sequence of decisions: ‘
’ (axisaligned trees) or (oblique tree). This not only makes decision trees very fast at inference but also makes them good interpretable models. This sequence of decisions can be used as IFTHEN rules to understand the prediction of the model for a given input.However, there is one major drawback with decision trees: learning the tree from data is a very difficult optimization problem, involving a search over a complex, large set of tree structures, and over the parameters at each node. Recently, CarreiraPerpiñán and Tavallali (2018) proposed Tree Alternating Optimization (TAO) algorithm to improve this problem, where authors directly optimize the misclassification error, using alternating optimization over separable subsets of nodes. In this work, we compare TAO against some wellknown decision tree algorithm over a wide range of datasets.
We have structured this paper in the following way: In section 2 we briefly describe all the algorithms that we use for the comparison. Next, in section 3 we describe all the data sets including the number of instances and dimensionality for each data set. In section 4 and 5, we describe our experimental setup and results of the comparison.
2 The algorithms
Below we provide a short description of the algorithms. More details can be found on the corresponding cited papers.

CART: CART (Breiman et al., 1984) is one of the most widely used algorithms for training axisaligned decision trees. It learns the tree by greedy recursive partitioning, to optimize the impurity measure at each node. At each growing stage for a given node, it enumerates through all the attributes to find the best split that reduces the Giniindex for that node. It grows the tree up to the max depth and then starts pruning nodes one by one such that it does not increase the misclassification error by a certain threshold.

C5.0: Quinlan (1993) is known as an established univariate decision tree learning software. Similarly to CART, it uses a greedy recursive partitioning of the tree nodes. At each recursive split, the algorithm enumerates over different featurethreshold combinations and picks the best one according to the information gain criterion. Pruning can be applied once the tree growing phase is finished.

TAO: The TAO algorithm proposed in CarreiraPerpiñán and Tavallali (2018) optimizes a decision tree with predetermined structure and can be trained to minimize the desired objective function such as misclassification error. Each iteration of TAO is guaranteed to decrease or leave unchanged the objective function. The algorithm can be applied to both axis aligned and oblique decision trees. Moreover, the algorithm can handle various penalty terms on objective function such as regularization which we briefly describe here (see CarreiraPerpiñán and Tavallali (2018) for details). TAO assumes a given tree structure with initial parameter values (possibly random), and minimizes the following objective function jointly over the parameters of all nodes of the tree:
(1) where is a training set of dimensional realvalued instances and their labels (in classes),
is the loss function (e.g. crossentropy, 0/1 loss, etc.) and
is the predictive function of the tree and is parameters at a node . For example, in case of oblique decision nodes,is a hyperplane with weight vector
and bias , which thus sends an input instance down its right child if and down its left child otherwise.The basis of the TAO algorithm is given by the separability condition theorem. It states that for any nodes and (internal or leaves) that are not descendants of each other (e.g. all nodes at the same depth) the error in eq. (1) separates over and . Since the loss function now separates algorithm can optimize eq. (1) over each node separately. This much simpler problem is referred as a “reduced problem”. TAO algorithm applies alternating optimization over separable subsets of nodes:

Optimizing over internal nodes is equivalent to optimizing a linear binary classifier over over the training instances that currently reach node . Each such instance is assigned a pseudo label based on the child whose subtree gives the better prediction for . Specifically, we send to the left and right subtrees. All parameters in those subtrees are fixed and depending on which one gives correct output we assign a pseudo label (either or ). This pseudo label indicates where to send the given instance (either or ).

Optimizing over a leaf which is a class classifier on the training points that reach that particular leaf. In this paper, we focus on constant leaves. Therefore, the solution, in this case, will be the majority label of the training points that reach leaf .

3 Datasets
Below we summarize datasets used in this study and any changes that are made for the experiments. All datasets are available in the public domain.

Balance Scale: The dataset is available from UCI (Zhang et al., 2017). This dataset was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. There are 625 instances and each instance has attributes that are described as of categorical type by Zhang et al. (2017), but they are numerical variables that has been discretized to discrete values.

Banknote authentication: This UCI dataset (Zhang et al., 2017) is consists of attributes extracted from images that were taken from genuine and forged banknotelike specimens. This dataset has two classes: genuine and forged. There are instances and each instance has realvalued attribute.

Blood Transfusion: The dataset is also available form the UCI (Zhang et al., 2017). The dataset also has two classes: whether a donor donated the blood in March 2007 or not. There are instances and each instance has realvalued attributes.

Breast Cancer Wisconsin (Diagnostic): This is one of the UCI dataset (Zhang et al., 2017) about breast cancer. The task is to classify whether the cancer is malignant or benign. There are 569 instances and each instance has realvalued attributes.

Spam: This UCI dataset (Zhang et al., 2017) is consists of a collection of emails and the task is to create a spamfilter that can tell whether an email is a spam or not. There are 4601 instances and each instance has realvalued attributes.

Sensit: This dataset is created by Duarte and Hu (2004). The task is to classify the types of moving vehicles in a distributed, wireless sensor network. There are three classes and training instances along with separate test instances. Each instance has attributes.

Letter: The objective of this UCI dataset (Zhang et al., 2017) is to classify 26 capital letters in the English alphabet. Similar to the Sensit dataset it has separate test instances along with training instances. Each instance has realvalued attributes.

MNIST Images: The dataset is consists of grayscale images of handwritten digits and the task is to classify them as 0 to 9. There are training images and test images. Each image is of size with gray scales in [0,1].

LeNet5 features:
This dataset is consist of features extracted by “conv2” layer of a pretrained LeNet5
(LeCun et al., 1998)neural network for all MNIST dataset. Similar to MNIST images it also has training and test instances. Each instance has nonnegative realvalued attributes.
4 Experimental setup
For UCI datasets which do not have separate test set (except Letter dataset), we shuffle the entire dataset and keep 20% of the entire data as the test set. We repeat the training procedure times for each dataset, reshuffling the training data each time. For each algorithm, we use the same reshuffled training data for a fair comparison. We also apply 10fold crossvalidation for each algorithm to find the best hyperparameters (pruning parameter for CART and C5.0, sparsity parameter () for TAO). Below we describe each algorithmspecific experiment setup:

CART: We use R implementation of CART called rpart (Therneau et al., 2019). For each dataset during training we let the tree grow up to the max allowed depth of (maxdepth constraint by rpart ). For this, we set the “minsplit” parameter to 1 and the complexity parameter (“cp”) to 0. Once the tree is fully grown we use rpart internal kfold crossvalidation (k=10), to get list of pruning parameters and choose best pruning parameter based on SE1 rule (as suggested by rpart documentation). We report tests and train accuracy of the pruned tree.

C5.0: We use singlethreaded Linux version of the C5.0 (provided by authors, see ^{1}^{1}1https://rulequest.com/download.html) written in C language. For each of the datasets, we apply a grid search on the kfold validation set to get the best parameters. Specifically, we tune “c CF” which controls the pruning severity and “m cases” which is the minimum number of points to perform a node split. We use the default options for all other parameters. It worth to mention that empirically we have found that in many cases the tuned parameters are not far away from the default setting.

TAO: For the TAO algorithm, we use oblique (i.e. linear splits) decision trees with constant leaves. We take as an initial tree a deep enough, a complete binary tree with random parameters at each node. We use the fixed number of TAO iterations which is equal to
, and algorithm proceeds until the maximum number of iterations are reached (i.e. there is no other stopping criterion). We also use a simple grid search on kfold validation set to find the best hyperparameters. Specifically, we tune the “
” parameter which controls folsparsity of the tree and maximum depth of the initial tree. TAO algorithm is implemented in Python (version 3.5) without parallel processing in a single CPU. TAO uses anregularized logistic regression to solve the decision node optimization (using LIBLINEAR
(Fan et al., 2008)) where the mentioned “” parameter is used as an regularization parameter ().
We ran all experiments on a single Linux PC with the following specifications: OS  Ubuntu 18.04 LTS, CPU  8 Intel Core i77700 3.60GHz, Memory  16 GiB DDR4 3600 MHz.
5 Results
As mentioned in the previous section we report the test and training errors on the pruned trees. We summarize error results for all the datasets in table 1. Both CART and C5.0 perform similarly in terms of test accuracy, but TAO outperforms both algorithms for all datasets. The accuracy margin between TAO and the other two becomes more as the dataset complexity grows like in the case of the last 4 datasets in the table. For instance, in the case of MNIST images and LeNet5 features not only the dataset size is big ( training data points) but also the number of attributes is very high.
Since decision trees are considered interpretative models, it is important to also compare the size of the trained trees. For this in table 2 we compare the maximum depth and number of the leaves. If a decision tree is deep and has a large number of leaves, it is very difficult to interpret. Moreover, the larger tree has large inference time and need more space. CART performs better than C5.0 in terms of tree size given that both perform similarly in terms of test accuracy. Again, TAO performs better than both of the algorithms. Similar to test and train accuracy as datasets become more difficult the tree size margin grows bigger. For example, in the case of MNIST images and LeNet5 features, the difference is very large. The depth difference is almost twice the TAO tree size and the number of leaves is more than 4 times than TAO.
TAO  C5.0  CART  

Dataset  
Balance scale  91.68  0.72  88.48  2.56  88.38  1.43  78.19  1.43  85.94  0.42  78.96  0.34 
Banknote auth  99.83  0.33  99.18  0.14  99.63  0.12  98.70  0.68  99.45  0.02  97.93  0.06 
Blood Transf  81.74  0.89  78.93  3.12  79.69  0.59  78.40  2.45  76.45  0.01  75.20  0.02 
Breast Cancer  98.21  0.79  97.71  1.04  97.36  0.61  94.83  0.90  96.10  0.01  94.57  0.02 
Spambase  95.55  0.47  93.31  1.22  96.18  0.38  92.85  0.84  94.96  0.01  91.92  0.01 
SensIT  85.68  0.13  85.12  0.20  86.66  0.11  82.41  0.04  84.38  0.01  81.71  0.01 
Letter  95.39  0.24  89.15  0.88  97.97  0.14  85.26  0.33  94.30  0.01  86.04  0.04 
MNIST Images  98.43  0.07  94.74  0.11  94.52  0.23  88.71  0.35  92.54  0.03  88.03  0.07 
LeNet5 Features  99.98  0.01  98.22  0.18  97.89  0.14  93.48  0.21  95.71  0.04  93.31  0.05 
TAO  C5.0  CART  

Dataset  Depth  # of leaves  Depth  # of leaves  Depth  # of leaves 
Balance scale  3  5.6  7.1  27.8  6.7  22.6 
Banknote auth  3  7.4  5.8  14.3  5.8  14.0 
Blood Transf  5  10.8  2.5  4.6  0  1 
Breast Cancer  3  7.8  4.0  9.0  3.2  5.5 
Spambase  4  14.8  14.7  68.6  10.7  41.7 
SensIT  7  69.2  15.2  410.0  14.0  239.5 
Letter  11  1077.6  17.0  1343.0  26.0  920.8 
MNIST Images  8  177.8  19.0  941.6  18.3  805.4 
LeNet5 Features  8  166.8  15.2  582.0  18.6  363.2 
6 Discussion
In this work, we compare the performance of some wellknown decision tree algorithms along with the recently proposed TAO algorithm. Our experiments show that TAO not only performs better in accuracy but also provides smaller and more interpretable decision trees. The reason for this better performance is how TAO train trees differently than the other two algorithms. CART and C5.0 greedily optimize the decision trees, at each step both algorithms split the data by using a single attribute that optimizes the impurity of a single node. This approach has no guarantees to reduce the global loss. Thus, in the end, the trained tree usually doesn’t generalize well and also big in size. On the other hand, TAO instead of optimizing the impurity of a node at each step, optimizes the misclassification error of the entire tree as each step optimizes the weights of all the nodes. This approach of updating parameters of all nodes as a whole provides betteroptimized decision trees that not only generalize well but also have a smaller size.
References
 Breiman et al. (1984) L. J. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, Calif., 1984.
 CarreiraPerpiñán and Tavallali (2018) M. Á. CarreiraPerpiñán and P. Tavallali. Alternating optimization of decision trees, with application to learning sparse oblique trees. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems (NEURIPS), volume 31, pages 1211–1221. MIT Press, Cambridge, MA, 2018.
 Duarte and Hu (2004) M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7):826–838, 2004.

Fan et al. (2008)
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin.
LIBLINEAR: A library for large linear classification.
J. Machine Learning Research
, 9:1871–1874, Aug. 2008.  LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
 Quinlan (1993) J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 Therneau et al. (2019) T. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive partitioning and regression trees. R package version 4.115, Apr. 12 2019. Available online at https://cran.rproject.org/package=rpart.
 Zhang et al. (2017) C. Zhang, C. Liu, X. Zhang, and G. Almpanidis. An uptodate comparison of stateoftheart classification algorithms. Expert Systems with Applications, 82:128–150, Oct. 1 2017.
Comments
There are no comments yet.