Decision trees (DTs) are an increasingly popular method used for classifying data. In the typical tree building procedure, the space that the data occupies (feature space) is iteratively partitioned into disjoint sub-regions until each sub-region is homogeneous (or near so) with respect to a particular class. In a DT, each sub-region is represented by a node in the tree. The node can be either terminal or non-terminal. Non-terminal nodes are impure and can be split further using a series of tests based on the feature variables, a process called splitting. Each split is determined by considering a series of hyperplanes which separate the feature space into two sub-regions. The best hyperplane split is chosen as the one which maximises the change in an impurity function (). To obtain a fully grown tree, this process is recursively applied to each non-terminal node until terminal nodes are reached. The terminal nodes correspond to homogeneous or near homogeneous sub-regions in the feature space. Each terminal node is assigned the class label that minimises the misclassification cost at the node.
DTs play an important role in statistical learning and have been a popular technique for data classification over several decades (see (Breiman et al., 1984; Murthy et al., 1994; López-Chau et al., 2013)). In the tree building process the aim is to produce accurate and smaller trees while minimising the computational time. Accuracy, size and time mainly depend on the way non-terminal nodes are split in a DT. Three types of splits are considered including axis parallel, oblique and non-linear splits. Axis parallel splits partition the space parallel to feature axes. Therefore axis parallel trees are desirable when the decision boundaries are aligned with the feature axes. Oblique splits are hyperplane splits defined by a linear combination of the feature variables. These splits are more appealing when the decision boundaries are not aligned with the feature axes. Non-linear splits (Ittner and Schlosser, 1996; Li et al., 2005) are general class of splits. Decision boundaries generated by these splits can take arbitrary shapes and can easily be influenced by noise data. (Li et al., 2005).
Many algorithms have been proposed to induce DTs. In general, these algorithms differ in the way they search for the best split at each non-terminal node. Many studies show that trees which use oblique splits generally produce smaller trees with better accuracy compared with axis parallel trees Li et al. (2003). Therefore they have become increasingly popular in DT literature and motivated us to propose a new methodology to construct a DT which uses oblique splits at each non-terminal node. These DTs are called Oblique Decision Trees (Murthy et al., 1994)
. More specifically, let the feature vector consists ofattributes, where . The oblique splits can be defined as linear combinations of features of the form
One of the major issues when inducing an oblique DT is the time complexity of the induction algorithm. In a data structure with feature variables and examples at a non-terminal node, the number of splits to be evaluated to find the best axis parallel split is . Therefore, the globally optimal split (with respect to an impurity function) at a non-terminal node can be found by exhaustively searching all possible splits along the feature axes. However, the number of splits to be evaluated to find the best oblique split at a node by exhaustive search is at most (Murthy et al., 1994). Hence, an exhaustive search for the best oblique split is impractical. Furthermore, the best split at a node does not necessarily lead to the optimal tree. Spending more time searching for the best split at a node in general may not be beneficial (Heath et al., 1993). Furthermore, (Hyafil and Rivest, 1976)
point out the problem of finding an optimal binary DT is an NP-complete problem. This led us to search for efficient heuristics for constructing near optimal decision trees. In this work, we propose a simple, and effective heuristic method to induce oblique decision trees.
The remaining sections of this paper are organized as follows: Section 2 highlights related work. Section 3 introduces our proposed method. Comparisons with some commonly used DT algorithms are presented in Section 4. Section 5 concludes the paper with discussions.
2 Related Work
Most of the oblique DT induction algorithms construct DTs in a top-down fashion (Rokach and Maimon, 2005). The induction algorithms differ in the way they search for the best split and can be categorised as follows. We define three categories: Induction algorithms that use optimisation techniques; standard statistical techniques; and those that use heuristic arguments.
2.1 Tree induction methods based on optimisation techniques
The first major oblique DT algorithm was Classification and Regression Trees - Linear Combination, which is commonly known as CART-LC (Breiman et al., 1984). CART-LC uses a deterministic hill climbing algorithm to search for the best oblique split at a non-terminal node. A backward feature elimination process is also carried out to delete irrelevant features from the split. CART-LC will not necessarily find the best split at each node because there is no built in mechanism to avoid getting stuck in the local maxima of . The best split found may be only a local, rather than global, maximiser of .
Simulated Annealing Decision Tree (SADT) was introduced by (Heath et al., 1993). This DT uses the simulated annealing optimisation algorithm, which uses randomisation, to search for the best split. The use of randomisation potentially avoids getting stuck in local maxima of and will often produce better trees than those of CART-LC. The main disadvantage of the algorithm is the time taken to find the best split. In some cases it may require the evaluation of tens of thousands of hyperplanes before finding an optimal split (Murthy et al., 1994).
The concepts of CART-LC and SADT are combined to produce a new oblique DT methodology called OC1 by (Murthy et al., 1994). Their method uses a deterministic hill climbing algorithm to perturb the coefficients of an initial hyperplane until a local maximum of is found. Then the hyperplane is perturbed randomly in an attempt to find a hyperplane that improves further. These two steps are repeated several times. Each time the algorithm starts with a different initial hyperplane, with one being the best axis parallel split and the others chosen randomly. After many hyperplanes have been evaluated, the one that maximises the is taken as the splitting hyperplane. The time complexity at each non-terminal node for OC1 in the worst case scenario is shown to be provided that Max Minority or Sum Minority impurity measures are used. However, the complexity increases for other impurity measures and for multi-class problems. One feature of both SADT and OC1 is that both algorithms can construct different decision trees on different runs using the same learning sample. Therefore, it is possible to run these algorithms multiple times and pick the best tree. However, this advantage is only realised on relatively small training example sets.
2.2 Tree induction methods based on standard statistical techniques
Various oblique DT induction algorithms have been developed using standard statistical techniques and can be found in (Gama and Brazdil, 1999),Kolakowska and Malina (2005) ,(Li et al., 2003) and López-Chau et al. (2013). The advantage of this approach is that the time required to induce DTs is generally lower than those based on optimisation algorithms.
Quick Unbiased Efficient Statistical Tree (QUEST) (Loh and Shih, 1997) uses Linear Discriminant Analysis (LDA) to find the best split at each node and hence there is no requirement for searching for the best split. QUEST’s axis parallel tree begins by performing an ANOVA test at each non-terminal node to select the best feature. LDA is then applied on the selected feature to find the best splitting point. QUEST’s oblique DT simply applies LDA on all features to find the best splitting hyperplane. Furthermore, QUEST is able to find oblique splits which are a linear combination of qualitative and quantitative features. For multi-class problems, QUEST groups the classes into two super-classes using 2-means clustering algorithm and this increases the time complexity of the algorithm.
2.3 Tree Induction Methods based on Heuristics
DTs based on heuristic arguments have gained more popularity in recent past. In this approach, a logic is constructed by assuming structure of class boundaries. If the assumption is true, DTs based on heuristic arguments produce accurate and smaller trees. DTs based on heuristic arguments can be found in (Amasyah and Ersoy, 2008) and (Manwani and Sastry, 2012).
The CARTopt algorithm introduced by (Robertson et al., 2013), uses a two class oblique tree to find a minimiser of a nonsmooth function where . Initially the examples in are labelled as high and low depending on their value of . An oblique DT is then used to form a partition on which separates the low points from high points. Rather than forming the oblique DT directly, the authors reflected the training examples using a Householder matrix. Axis parallel splits are then searched in the reflected training data. These splits are oblique in original space.
CARTopt introduces a new heuristic to induce oblique decision trees. It uses the simplest form of splits, axis parallel splits, to find oblique splits. Hence time complexity of searching oblique splits using CARTopt’s approach is less than those based on optimisation algorithms. In this study we extend the CARTopt’s idea in a number of ways to develop a complete oblique DT for statistical data classification.
We extend the oblique DT method used in the CARTopt optimisation algorithm of (Robertson et al., 2013) in a number of ways to develop a complete oblique DT called HHCART. First, CARTopt is designed to classify two classes whereas HHCART can handle multi-class classification problems. Second, CARTopt reflects the training examples at the root node only whereas HHCART performs reflections at each non-terminal node during tree construction. Finally, CARTopt is only defined for quantitative features whereas HHCART is capable of finding oblique splits which can be linear combinations of both quantitative and qualitative features.
First, we explain the basic concept of our algorithm for a two class classification problem. The algorithm easily generalises to the multi-class problem. In our approach we find each separating hyperplane by considering the orientation of each class. We propose the dominant eigenvector of the covariance matrix of a class to represent the orientation of that class. If this orientation is parallel to one of the feature axes, the best separating hyperplane may be found by performing axis parallel splits. Otherwise, we reflect the set of examples to a new coordinate system such that the orientation of one of the classes becomes parallel to one of the axes in the reflected feature space. Axis parallel splits can then be searched in the reflected feature space to find the best split. This split will be oblique in the original feature space (Robertson et al., 2013).
Consider the two dimensional, two class classification problem shown in Figure 1 (a).
First we define the estimated covariance matrix of a set of examples. Letbe dimensional feature vectors where . Then the estimated covariance matrix is given by
where is the mean vector.
We reflect the examples using a Householder matrix which can be defined as follows. Let be the dominant eigenvector of the estimated covariance matrix of class 1 examples. Then there exists an orthogonal symmetric matrix (where is number of features) such that
Let be the training example set. The reflected example set is obtained using . Since is symmetric and orthogonal, a point in the transformed space can be mapped back to original space at a minimal cost (). The mechanism of the Householder reflection is that it reflects vector on to by a reflection through the plane perpendicular to vector . The reflected example set is shown in Figure 1 (b).
Each column of represents the direction of a coordinate axis in the reflected space. Axis parallel splits are searched along these axes. These splits are oblique in the original space. The best axis parallel split found in the reflected space, which is oblique in the original space, is shown in Figure 1 (c).
The axis parallel search space can be enhanced by using all possible eigenvectors for reflections. For a -dimensional classification problem with classes there are eigenvectors to be considered for the Householder reflection. However, this increases the time complexity of tree induction, but have an opportunity to produce better trees.
Here we explain the complete algorithm of HHCART. We propose two versions of HHCART: HHCART(A) is based on all possible eigenvectors of all classes and HHCART(D) is based on only the dominant eigenvector of each class. For any given non-terminal node , let and be the set of examples and classes available at that node respectively. At node , HHCART(A) finds all eigenvectors of the estimated covariance matrix for each class whereas HHCART(D) finds only the dominant eigenvector of each class. A Householder matrix is constructed for each eigenvector. Then is reflected using each Householder matrix, and axis parallel splits are performed along each coordinate axis in the reflected space. The best axis parallel split is chosen as the separating hyperplane at node . However, if the eigenvector is already parallel to any of the feature axes, no reflection is done and hence axis parallel splits are searched in the original space. The hyperplane found divides node into two child nodes. The algorithm is recursively run on all child nodes until each child node satisfies either:
the misclassification rate at the child node is either 0 or not greater than a user specified threshold (MisRate); or
the number of examples in the node is less than or equal a user specified threshold (MinParent).
An overview of HHCART(A) algorithm at node is given in Algorithm 1. The time complexity at a node for HHCART(A) in the worst case is (See appendix A for the derivation). However, if HHCART(D) is used the time complexity reduces to .
3.1 Small Samples
As the tree grows, the number of examples at each node usually becomes small. This raises two questions to be answered. (a) Is it worthwhile searching for an oblique split or is an axis parallel split sufficient? (b) How are the eigenvectors calculated for small sample sizes? The first problem is common for any oblique DT. In the OC1 algorithm the authors suggest using oblique splits if the number of examples at a node is greater than twice the number of feature variables. The second question has two parts:
Non-availability of some eigenvalues (the ones equal to zero) due to a singular covariance matrix.
Performing an eigen analysis for classes having only one example or several examples with the same feature vector.
Part (1) can be solved without modifying the HHCART algorithm because the reflection is done using the eigenvectors whose eigenvalues are not zero. For part (2), we simply omit classes that have a single example or several examples with the same feature vector. However, if all the classes suffer from this problem, then axis parallel splits are performed.
3.2 Qualitative Variables
Data classification problems often contain a mixture of quantitative and qualitative feature variables. Since the class discriminatory information may be contained in both types of feature variables, an effective classifier should be able to handle both types of features in the classification process. For a qualitative feature variable , the form of the split is given by where is a non-empty subset of values taken by . If a qualitative feature has non-empty levels, splits are possible. Axis parallel algorithms which consider qualitative splits can be found in (Quinlan, 1986).
Incorporating qualitative features in oblique splits has not been explored much. The QUEST algorithm (Loh and Shih, 1997) is capable of finding oblique splits with both qualitative and quantitative features. QUEST transforms each unordered qualitative feature variable into a new ordered quantitative feature variable. Each level of an unordered qualitative feature is mapped to a ordered value called a CRIMCOORD. The exact CRIMCOORD algorithm can be found in (Loh and Shih, 1997). We implement the same CRIMCOORD algorithm in HHCART to induce oblique splits which contain both qualitative and quantitative features. At each node, a new quantitative feature is constructed for each qualitative feature by mapping its levels to CRIMCOORDS. Then these new quantitative features are amalgamated with the existing quantitative features in the example set. The HHCART algorithm can then be applied to find the best oblique split. At each node the CRIMCOORD corresponding to each level of each qualitative feature is stored. When predicting, the level of each qualitative feature of an unclassified observation is replaced by the corresponding CRIMCOORD attached to each node along its path.
Two sets of experiments are carried out to compare the performance of HHCART with other DT methods. The first experiment considers quantitative example sets and the second experiment considers example sets with both qualitative and quantitative features. Both HHCART(A) and HHCART(D) methods are considered in the experiments.
4.1 Comparison on datasets having quantitative features only
In this section, we compare the HHCART methods with OC1, OC1-LC (OC1 version of Breiman’s linear combination methods) and OC1-AP (OC1 version of axis parallel splits). All of these methods are available in the OC1 system which is freely available at (Murthy et al., 1993). However, the backward feature elimination process of Breiman’s CART-LC method is not included in OC1-LC and hence is somewhat different from the original method.
4.1.1 Experimental Setup
Experiments were performed on real data sets that were downloaded from (Bache and Lichman, 2013) and are given in Table 1. In our algorithm we set MinParent=2, MisRate=0 and . For OC1, OC1-LC and OC1-AP MinParent was set to 2. All the algorithms used the Twoing rule as the measure of impurity (Breiman et al., 1984) and Cost complexity pruning (Breiman et al., 1984)
with zero standard error. For OC1, the number of restarts and number of jumps were set to 20 and 5 (default values) respectively. Five-fold cross validations were used to estimate the classification accuracy. For each fold,of the training set was used exclusively for pruning. We then used ten, five-fold cross validations to estimate the accuracy and the size of the tree. Therefore, to estimate accuracy and tree size (number of terminal nodes) the average over ten runs was used. Results are reported in Table 2
along with respective standard deviations. The Shuttle data set comes with its own training set containing 43500 examples and a test set with 14500 examples. Therefore instead of performing a cross validation experiment, we inducedtrees, each using of training examples for induction and remaining for pruning. The accuracy of all the trees was estimated using the Shuttle data test set. Since approximately of the examples belong to class 1, the aim is to achieve an accuracy between (Bache and Lichman, 2013).
|Data set||No. of||No. of||No. of|
|Pima Indian (PIND)||8||2||768|
|Breast Cancer (BC)||9||2||638|
|Boston Housing (BH)||13||2||506|
|Balance Scale (BS)||4||3||625|
|Dataset||DT||Avg. Acc.||Avg. Size||Dataset||DT||Avg. Acc.||Avg. Size|
Table 2 shows the results for our first experiment. The average accuracies and the average tree sizes of ten, five-fold cross classifications are listed in the table. It is clear that oblique splits reduce the average tree size for all the data sets while increasing the accuracy for most of data sets. The average accuracy of HHCART(A) is significantly (more than 2 standard deviations) higher than all the other methods tested for the BC dataset. Also the average accuracy of HHCART(A) is higher than the other methods for BS, BH, WINE and SUR datasets.
The average tree sizes of HHCART(A) are consistently smaller than the average tree sizes of other methods except for the HRT dataset. Therefore the performance of HHCART(A) with respect to accuracy and tree size is better than the other methods for BS, BH, BC, WINE and SUR datasets.
Eight of the eleven datasets have at least 8 features. For six of these relatively high dimensional data sets, the performance of HHCART(A) is comparable with OC1 and OC1-LC. Therefore we can conclude that the proposed method works well in relatively high dimensional feature spaces.
For all the datasets except BS and WINE, HHCART(D) performs as well as HHCART(A) in terms of the average accuracy. Also the tree sizes of HHCART(D) are comparable with those produce by HHCART(A) except for the BH, BUPA and HRT datasets. The performance of HHCART(D) is as similar as OC1 with respect to both the accuracy and tree size for all the datasets except the BS, HRT and BUPA datasets. The time complexity of HHCART(A) is higher than that of HHCART(D) by a factor of . Results show that HHCART(D) produces DTs with similar accuracies and sizes as HHCART(A) and OC1 for most of the datasets. Hence HHCART(D) would be a more efficient method to use for higher dimensional problems.
4.2 Comparison on Datasets having qualitative and quantitative features
Experiments were performed to study the performance of the HHCART methods when the training examples contain both qualitative and quantitative features. Since OC1, OC1-AP, and OC1-LC are not designed to handle oblique splits containing both qualitative and qualitative features, QUEST (Loh and Shih, 1997) was used for comparison purposes.
4.2.1 Experimental setup
Experiments were performed on the datasets available in (Bache and Lichman, 2013), which are given in Table 3. Ten, five-fold cross validations were used and the average accuracies and tree sizes (over ten cross validations) are reported in Table 4. The Income dataset comes with its own training and testing set of 30162 and 15060 examples respectively. We induced trees, each using of the training examples and the remaining were used for pruning. The accuracy of all the trees were estimated using the same test set.
|Data Set||No. of features||No. of||No. of|
|(No. of Qualitative)||Classes||Examples|
QUEST uses the following parameter setting: estimated priors, unit misclassification cost, zero standard error for pruning, linear splits, linear discriminant analysis for the split point, minimum node size for splitting =2. The HHCART methods were implemented as above.
|Dataset||Decision Tree||Avg. Acc.||Avg. Size|
For the Income dataset, HHCART(A)’s performance is significantly (more than 2 standard deviations) better than QUEST both in terms of the average accuracy and average tree size. For the other two datasets, HHCART(A) produces comparable accuracies with smaller trees. These results also suggest that the HHCART algorithms perform well in relatively high dimensions. Though HHCART(D) produces larger trees compared with HHCART(A), its classification accuracy is comparable with HHCART(A).
In this work we have presented a new way of inducing oblique DTs called HHCART. It uses the eigenvectors of the estimated covariance matrices of respective classes to define a Householder matrix which is used to reflect the examples so that reflected axis parallel splits can be found. Two versions of HHCART have been presented: HHCART(A) uses all possible eigenvectors of the estimated covariance matrices of respective classes whereas HHCART(D) uses only the dominant eigenvector of each class. Based on the empirical results obtained, it is clear that both HHCART methods perform well in terms of accuracy and tree size. Furthermore, HHCART is capable of classifying datasets with both qualitative and quantitative features.
Appendix A Time Complexity of HHCART
Here we derive the maximal time complexity at a node of HHCART(A) and HHCART(D). Assume there are examples with quantitative features and classes at the node.
HHCART(A) and HHCART(D) - Complexity for constructing estimated covariance matrix for one class of examples is . For classes the complexity is .
HHCART(A) - Complexity of the complete eigen analysis for one class of examples is . For classes the complexity is .
HHCART(D) - Complexity for finding the dominant eigenvector for one class of examples is . For classes the complexity is .
HHCART(A) - Complexity for the reflection of examples using one Householder matrix is . Since there are Householder matrices the Complexity is .
HHCART(D) - Complexity for the reflection of examples using one Householder matrix is . For Householder matrices the complexity is .
HHCART(A) - Complexity of finding the best axis parallel splits for one reflected space is . Since there are reflected spaces the Complexity is .
HHCART(D) - Complexity of finding the best axis parallel splits for one reflected space is . For classes the complexity is
HHCART(A) - The maximal time complexity at a node is +.
HHCART(D) - The maximal time complexity at a node is + .
Amasyah and Ersoy (2008)
Amasyah, M., Ersoy, O., 2008. Cline: A new decision-tree family. Neural Networks, IEEE Transactions on 19 (2), 356–363.
Bache and Lichman (2013)
Bache, K., Lichman, M., 2013. UCI machine learning repository.
- Breiman et al. (1984) Breiman, L., Friedman, J., Stone, C. J., Olshen, R. A., 1984. Classification and regression trees. CRC press.
- Gama and Brazdil (1999) Gama, J., Brazdil, P., 1999. Linear tree. Intelligent Data Analysis 3 (1), 1–22.
- Heath et al. (1993) Heath, D., Kasif, S., Salzberg, S., 1993. Induction of oblique decision trees. In: IJCAI. Citeseer, pp. 1002–1007.
- Hyafil and Rivest (1976) Hyafil, L., Rivest, R. L., 1976. Constructing optimal binary decision trees is np-complete. Information Processing Letters 5 (1), 15–17.
- Ittner and Schlosser (1996) Ittner, A., Schlosser, M., 1996. Non-linear decision trees-ndt. In: ICML. Citeseer, pp. 252–257.
- Kolakowska and Malina (2005) Kolakowska, A., Malina, W., 2005. Fisher sequential classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 35 (5), 988–998.
- Li et al. (2003) Li, X.-B., Sweigart, J. R., Teng, J. T., Donohue, J. M., Thombs, L. A., Wang, S. M., 2003. Multivariate decision trees using linear discriminants and tabu search. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 33 (2), 194–205.
- Li et al. (2005) Li, Y., Dong, M., Kothari, R., 2005. Classifiability-based omnivariate decision trees. IEEE Transactions on Neural Networks 16 (6), 1547–1560.
- Loh and Shih (1997) Loh, W.-Y., Shih, Y.-S., 1997. Split selection methods for classification trees. Statistica sinica 7 (4), 815–840.
- López-Chau et al. (2013) López-Chau, A., Cervantes, J., López-García, L., Lamont, F. G., 2013. Fisher’s decision tree. Expert Systems with Applications 40 (16), 6283–6291.
- Manwani and Sastry (2012) Manwani, N., Sastry, P., 2012. Geometric decision tree. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42 (1), 181–192.
Murthy et al. (1993)
Murthy, S. K., Kasif, S., Salzberg, S., 1993. The oc1 decision tree software
- Murthy et al. (1994) Murthy, S. K., Kasif, S., Salzberg, S., 1994. A system for induction of oblique decision trees. J Artif. Intell Res. 2 (1), 1–32.
- Quinlan (1986) Quinlan, J. R., 1986. Induction of decision trees. Machine learning 1 (1), 81–106.
- Robertson et al. (2013) Robertson, B., Price, C., Reale, M., 2013. Cartopt: a random search method for nonsmooth unconstrained optimization. Computational Optimization and Applications 56 (2), 291–315.
- Rokach and Maimon (2005) Rokach, L., Maimon, O., 2005. Top-down induction of decision trees classifiers-a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 35 (4), 476–487.