Unifying Decision Trees Split Criteria Using Tsallis Entropy

11/25/2015 ∙ by Yisen Wang, et al. ∙ Tsinghua University 0

The construction of efficient and effective decision trees remains a key topic in machine learning because of their simplicity and flexibility. A lot of heuristic algorithms have been proposed to construct near-optimal decision trees. ID3, C4.5 and CART are classical decision tree algorithms and the split criteria they used are Shannon entropy, Gain Ratio and Gini index respectively. All the split criteria seem to be independent, actually, they can be unified in a Tsallis entropy framework. Tsallis entropy is a generalization of Shannon entropy and provides a new approach to enhance decision trees' performance with an adjustable parameter q. In this paper, a Tsallis Entropy Criterion (TEC) algorithm is proposed to unify Shannon entropy, Gain Ratio and Gini index, which generalizes the split criteria of decision trees. More importantly, we reveal the relations between Tsallis entropy with different q and other split criteria. Experimental results on UCI data sets indicate that the TEC algorithm achieves statistically significant improvement over the classical algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The decision tree is a non-parametric supervised learning method used for classification and regression. Although the decision tree methods have been one of the first machine learning approaches, it remains an actively researched domain in machine learning. It is not only simple to understand and interpret, but also offers relatively good results, computational efficiency and flexibility. The general idea of decision trees is to predict unknown input instances by learning simple decision rules inferred from several known training instances. Decision trees are most often induced in the following top-down manner. A given data set is partitioned into a left and right subset by a split criterion test on attributes. The highest scoring partition which reduces the average uncertainty mostly is selected and the data set is partitioned accordingly into two child nodes, growing the tree by making the node be the parent of the two newly created child nodes. This procedure is applied recursively until some stopping conditions, e.g. maximum tree depth or minimum leaf size, are reached.

Generally speaking, split criterion is a fundamental issue in decision trees induction. A large number of decision tree induction algorithms with different split criteria have been proposed. For example, the Iterative Dichotomiser 3 (ID3) algorithm is based on Shannon entropy

[1]; the C4.5 algorithm is based on Gain Ratio which is considered as a normalized Shannon entropy [2]; while the Classification And Regression Tree (CART) algorithm is based on Gini index [3]

. These algorithms seem to be independent, and it is hard to judge which algorithm always outperforms others. Actually, it reflects one drawback of this kind of split criteria is that they lack adaptability to data sets. Numerous alternatives have been proposed for the adaptive entropy estimate

[4, 5], but their statistical entropy estimates are too complex to lose the simplicity and comprehensibility of decision trees. Most of all, to the best of our knowledge, there is not a unified framework combining all the above criteria together. In addition, a series of papers have analyzed the importance of the split criterion [6, 7]. They demonstrated that different split criteria have substantial influence on the generalization error of the induced decision trees. This is the inspiration of our proposed new split criterion unifying and generalizing the classical split criteria.

To address the above issue, we propose a Tsallis entropy framework in this paper. Tsallis entropy is a generalization of Shannon entropy with an adjustable parameter and is first introduced into decision trees in the prior work [8]. [8] only tested the performance of Tsallis entropy in C4.5 with some given , but the relation between Tsallis entropy and other split criteria was not explored. And the unified framework was also not presented. In this paper, we propose a Tsallis entropy based decision tree induction algorithm called TEC algorithm and analyze the correspondence between Tsallis entropy with different and other split criteria. Shannon entropy and Gini index are just two specific cases of Tsallis entropy with and , while Gain Ratio is also can be considered as a normalized Tsallis entropy with . And Tsallis entropy indeed provides a new approach to improve the performance of decision trees with a tunable in a unified framework. Experimental results on UCI data sets indicate that the TEC algorithm achieves statistically significant improvement over the classical algorithms without losing the strengths of decision trees.

The rest of this paper is organized as follows. Section 2 presents the background of Tsallis entropy. Section 3 outlines our proposed TEC algorithm. Section 4 exhibits experimental results. Section 5 summaries the work.

Ii Tsallis entropy

Entropy is the measure of disorder in physical systems, or the measure of the amount of information that may be needed to specify the full microstates of the system [9]. In 1948, Shannon adopted entropy to information theory, called Shannon entropy [10]

, which is a measure of the uncertainty associated with a random variable.

(1)

where is a random variable that can take values and

is the corresponding probabilities of

. Shannon entropy is concave and attains maximum when .

There are two typical distributions observed in the macroscopic world, exponential distribution family and power-law heavy-tailed distribution family. However, we cannot characterize power-law heavy-tailed distribution through maximizing Shannon entropy subject to normal mean and variance. The reason is that Shannon entropy implicitly assumes certain trade-off between contributions from the tails and the main mass of distribution

[8]. It should be worthwhile to control this trade-off explicitly to characterize the two distribution family. Entropy measures that depend on powers of probability, , can provide such control. Thus, some parameterized entropies have been proposed. A well-known generalization of this concept is Tsallis entropy [11], which extends its applications to so-called non-extensive systems using an adjustable parameter . Tsallis entropy can explain some physical systems that have complex behaviours such as long-range and long-memory interactions [12].

Tsallis entropy is defined by:

(2)

which converges to Shannon entropy in the limit ,

(3)

The relation to Shannon entropy can be made clearer by rewriting the definition in the form:

(4)

where

(5)

is called the -logarithmic function. And when , .

Just like the exponential function to the logarithmic function, there is also the corresponding -exponential function to -logarithmic function.

(6)

For , Tsallis entropy is convex. For , Tsallis entropy is non-convex and non-concave. While for , Tsallis entropy is concave, satisfying similar properties to Shannon entropy [13]. For instance, for , , and

is maximal at the uniform distribution.

Additivity is a crucial difference of the fundamental property between Shannon entropy and Tsallis entropy. For two independent random variables and , Shannon entropy has the additivity property:

(7)

however, Tsallis entropy has the pseudo-additivity (also called -additivity) property:

(8)

Besides, Tsallis conditional entropy, Tsallis joint entropy and Tsallis mutual information are also derived similarly to Shannon entropy. For the conditional probability and the joint probability , Tsallis conditional entropy and Tsallis joint entropy [14] are denoted by:

(9)
(10)

It is remarkable that Eq.(9) can be easily deformed by

(11)

The relation between the conditional entropy and the joint entropy is given by:

(12)

Tsallis mutual information [15] is denoted as the difference between Tsallis entropy and Tsallis conditional entropy:

(13)

and the chain rule of Tsallis mutual information for random variables

and holds:

(14)

The relation among the conditional entropy, joint entropy and mutual information can be derived from Eq.(12) and Eq.(13):

(15)

In summary, Tsallis entropy generalizes Shannon entropy with an adjustable parameter and has a wider range of applications.

Iii Tsallis Entropy Criterion (TEC) algorithm

One key issue in the procedure of decision tree induction is the split criterion. At every step, the decision tree chooses one pair of attribute and cutting point which makes the maximal impurity decrease to split the data and grow the tree. Therefore, the attribute chosen to split significantly affects the construction of decision trees and further influences the classification performance.

Iii-a Tree construction

Given a data set , with attributes , and class label . For each tree node, we search for every possible pair of attribute and cutting point to choose the optimal attribute and cutting point as follows: for a attribute ,

(16)

Here is the candidate cutting point for attribute , is the data set belonging to one node to be partitioned, and , are the two child nodes that would be created if is partitioned at . The function is the impurity criterion, e.g. Tsallis entropy, which computes over the labels of the data which fall in the node. The pair of attribute and cutting point is chosen to construct the tree which maximizes .

The above procedure is applied recursively until some stopping conditions are reached. The stopping conditions consist of three principles: (i) The classification is achieved in a subset. (ii) No attributes are left for selection. (iii) The cardinality of a subset is not greater than the predefined threshold.

Iii-B Prediction

Once the tree has been trained by the data as a classifier

, it can be used to predict for new unlabeled instances.

Decision tree makes prediction in a majority vote manner. For each class ,

(17)

where denotes the leaf containing , and denotes the number of instances that located in . Then the tree prediction is the class that maximizes this value:

(18)

Iii-C TEC algorithm

Here, we summary our proposed Tsallis Entropy Criterion (TEC) algorithm in a pseudo-code format in Algorithm 1. Compared with the classical decision tree induction algorithms, the only difference is the split criterion. We use Tsallis entropy to replace the classical split criteria, e.g. Shannon entropy, Gain Ratio and Gini index. Actually, in the following subsection, we will see that Tsallis entropy unifies Shannon entropy, Gain Ratio and Gini index with different values of .

1:Input: Data , Attributes , Class
2:Output: A decision tree
3:while not satisfying stop condition do
4:     for each attribute  do
5:         
6:         // is the candidate cutting point set of attribute
7:         // is one cutting point in the set
8:         for each  do
9:              
10:              
11:              // is one instance in Data
12:              // , are the two child data sets
13:              Compute according to (16)
14:         end for
15:     end for
16:     
17:     
18:     // is the best pair of split attribute and cutting point
19:     Grow the tree using
20:     Go to line 3 for and
21:     // Recursively repeat the procedure
22:end while
23:Return Decision tree
24:// Tree is built by Nodes from the root to the leaf
Algorithm 1 TEC algorithm

Iii-D Relations to other criteria

As described above, Tsallis entropy unifies Shannon entropy, Gain Ratio and Gini index in a framework. In the following, we will reveal the relations between Tsallis entropy to other split criteria.

Tsallis entropy converges to Shannon entropy for :

(19)

Besides, Gini index is exactly a specific case of Tsallis entropy with :

(20)

As for the Gain Ratio which adds a normalized factor compared with Information Gain, it can be seen as the normalized Information Gain. According to the Eq.(16), we can obtain:

(21)

where represents Shannon entropy. If is replaced by Tsallis entropy, Gain Ratio is generalized to Tsallis Gain Ratio. Thus, Gain Ratio is also covered by the Tsallis entropy adding a normalized factor (Tsallis Gain Ratio) with .

In summary, Tsallis entropy unifies three kinds of split criteria, e.g. Shannon entropy, Gain Ratio and Gini index, and generalizes the split criterion of decision trees. As far as we know, this is the first time to unify common split criteria into a parametric framework. This is also the first time to reveal the correspondence between Tsallis entropy with different and other split criteria. The optimal for Tsallis entropy is obtained by cross-validation, which is usually not equal to or . This implies better performance than the traditional split criteria. Although the optimal may be different for different data sets, it is associated with the properties of data sets. That is to say, the parameter enables the TEC algorithm to have adaptability and flexibility. Tsallis entropy indeed provides a new approach to improve decision trees’ performance with a tunable in a unified framework. In the Experiments section, we will see that the TEC algorithm achieves higher accuracy than classical algorithms with an appropriate .

Iv Experiments

As illustrated in section III, the TEC algorithm is based on Tsallis entropy with an adjustable parameter which consists of Tsallis entropy and Tsallis Gain Ratio split criteria. Tsallis entropy split criterion degenerates to Shannon entropy and Gini index with and , respectively. With respect to Gain Ratio, Tsallis Gain Ratio (the normalized Tsallis entropy) also degenerates to Gain Ratio with .

Iv-a Evaluation Metric

In order to quantitatively compare trees obtained by different methods, we choose accuracy to evaluate the effectiveness of the tree and the total number of the tree nodes to measure the tree complexity.

Iv-B Data Set Description

As shown in Table I, the UCI data sets [16] are adopted to evaluate the proposed approaches. These data sets consist of three types, namely numeric, categorical and mixed data sets. Also, these data sets include two kinds of classification problems, binary and multi-class classification.

Data Set Type
No. of
instance
No. of
features
No. of
class
Yeast numeric 1484 8 10
Glass numeric 214 10 7
Vehicle numeric 946 18 4
Wine numeric 178 13 3
Haberman numeric 306 3 2
Car categorical 1728 6 4
Scale categorical 625 4 3
Hayes categorical 160 5 3
Monks categorial 432 7 2
Abalone mixed 4139 8 18
Cmc mixed 1473 9 3
TABLE I: Data sets from UCI

Iv-C Experiment Setup

The decision trees with different split criteria, e.g. Gain Ratio, Shannon entropy, Gini index, Tsallis entropy and Tsallis Gain Ratio, are implemented in Python. We refer to the CART algorithm implementation on scikit-learn platform [17] and the C4.5 algorithm implementation of J48 in Weka [18]. In each data set, we first partition the data into the training set and test set randomly where the test set holds . Then in the training set, we do a grid search using 10-fold cross-validation to determine the the values of in Tsallis entropy and Tsallis Gain Ratio. Maybe the optimal for Tsallis entropy and Tsallis Gain Ratio are different, but for the fair comparison we choose the same , e.g. optimal for Tsallis entropy. Besides, the minimal leaf size is set to to avoid overfitting. After the parameter selection, the above best parameters are fixed. Then, a decision tree is trained by the training data without post-pruning and evaluated by the test data. The procedure from the training-test data partition to the evaluation is repeated 10 times to reduce the influence of randomness.

Iv-D Results

Figure 1 gives an intuitive exhibition of the influence of different values of parameter in Tsallis entropy for the Glass data set. Figure 1 (a) illustrates that the accuracy is sensitive to the change of and the highest accuracy is obtained at . Figure 1 (b) shows that the tree complexity has different responds to the change of as accuracy and the lowest tree complexity is achieved at . It should be noted that there are different strategies to choose for various purpose, e.g. highest accuracy or lowest complexity or trade-off, which is also a reflection of the TEC algorithm’s adaptability for data sets. In this paper, we choose the highest accuracy principle for the choice of .

Table II reports the accuracy and complexity results of different criteria for different data sets. The highest accuracy and lowest complexity on each data set are in boldface. As expected, the performance of TEC outperforms ID3, CART and C4.5 due to the fact that Tsallis entropy is a generalization of Shannon entropy, Gini index and Gain Ratio. In respect to the two kinds of the TEC algorithm, e.g. Tsallis entropy and Tsallis Gain Ratio, no one can prevail another one absolutely. The results indicates that Tsallis entropy prefers high accuracy while Tsallis Gain Ratio prefers low complexity. The reason lies on the normalized factor which has influence in the tree structure to some extent. In addition, compared with Shannon entropy and Gini index, Tsallis entropy achieves better performance in accuracy and complexity. Tsallis Gain Ratio also obtains better results compared with Gain Ratio. Three Wilcoxon signed ranked tests [19]

on accuracy (Tsallis entropy vs Shannon entropy, Tsallis entropy vs Gini index, Tsallis Gain Ratio vs Gain Ratio) all reject the null hypothesis of equal performance at a p-value less than

. The results show that the TEC algorithm with appropriate achieves a average statistically significant improvement in accuracy and maintains a lower complexity.

In terms of optimal value of , we find a fuzzy trend from Table II that the more of class number, the smaller value is tended, e.g. for numeric type data sets from Yeast to Haberman, is increasing while the class number is decreasing (exception for Vehicle). In this paper, we choose the optimal value of using cross-validation method, but we conjecture that the values of is associated with the properties of data sets. For example, the Car data set, all the algorithms presents almost the same results which reflects the data set is not sensitive to the parameter . The relation between the and the properties of data sets will be discussed in the future work.

(a) Accuracy with different values of (b) Tree complexity with different values of
Fig. 1: Influence of parameter in tree classification accuracy and tree complexity for Glass data set
Data Set
Shannon entropy
(ID3)
Gini index
(CART)
Gain Ratio
(C4.5)
Tsallis entropy
(TEC)
Tsallis Gain Ratio
(TEC)
Accuracy
(%)
No. of
nodes
Accuracy
(%)
No. of
nodes
Accuracy
(%)
No. of
nodes
Accuracy
(%)
No. of
nodes
Accuracy
(%)
No. of
nodes
Yeast 52.8 199 51.8 196.6 52.1 326.2 56.9 195.8 1.4 51.2 197.1
Glass 51.2 52.4 52.6 53.8 44.2 52 60.6 52.6 2.6 53.1 51.5
Vehicle 71.7 103 70.2 100 72.3 147.2 73.8 111.0 0.6 73.4 135.7
Wine 92.9 12.0 90.0 12.0 92.4 9.4 95.9 9.6 3.1 92.9 9.2
Haberman 70.3 32.2 70.3 33.0 72.8 33.0 74.2 33.2 7.1 74.8 33.0
Car 98.2 106.4 98.1 106.8 98.5 106.5 98.3 106.2 0.8 98.4 106.6
Scale 75.9 97.6 76.1 97.2 74.5 77.0 78.2 93.1 3.1 78.5 77.0
Hayes 81.5 28.8 80.0 25.3 79.2 19.6 82.3 19.5 8.6 81.5 19.2
Monks 51.9 89.0 52.1 88.6 52.9 88.0 57.3 89.6 8.9 54.9 88.0
Abalone 25.4 89.2 25.0 85.8 20.3 84.3 26.8 86.2 0.8 25.7 85.1
Cmc 49.1 267.0 47.4 264.0 45.7 242.8 52.0 264.2 1.2 47.8 242.1
TABLE II: Comparison of different decision tree split criteria

V Conclusions

In this paper, we present and evaluate Tsallis entropy for enhancing decision trees in a fundamental issue, e.g. split criterion. We unify the classical split criteria into a parametric framework and propose the TEC algorithm with Tsallis entropy split criterion which generalizes Shannon entropy, Gain Ratio and Gini index through an adjustable parameter . Most of all, we reveal the relations between Tsallis entropy with different and other split criteria. Experimental results indicate that, with appropriate , the TEC algorithm achieves a average statistically significant improvement in accuracy. Nevertheless, the approaches have limitations that need to be addressed in the future, such as, the estimate method for parameter

in place of current cross-validation method. Furthermore, Tsallis entropy also has potential applications beyond decision trees, for instance, Random Forest and Bayesian network, to be investigated in future work.

Acknowledgments

This research is supported in part by the 973 Program of China (No. 2012CB315803), the National Natural Science Foundation of China (No. 61371078, 61375054), and the Research Fund for the Doctoral Program of Higher Education of China (No. 20130002110051).

References

  • [1] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
  • [2] J. R. Quinlan, C4. 5: programs for machine learning.   Morgan Kaufmann Publishers, 1993.
  • [3] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees.   CRC press, 1984.
  • [4] S. Nowozin, “Improved information gain estimates for decision tree induction,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12).   ACM, 2012, pp. 297–304.
  • [5]

    M. Serrurier and H. Prade, “Entropy evaluation based on confidence intervals of frequency estimates: Application to the learning of decision trees,” in

    Proceedings of the 32nd International Conference on Machine Learning (ICML-15).   ACM, 2015, pp. 1576–1584.
  • [6] W. Buntine and T. Niblett, “A further comparison of splitting rules for decision-tree induction,” Machine Learning, vol. 8, no. 1, pp. 75–85, 1992.
  • [7] W. Z. Liu and A. P. White, “The importance of attribute selection measures in decision tree induction,” Machine Learning, vol. 15, no. 1, pp. 25–41, 1994.
  • [8] T. Maszczyk and W. Duch, “Comparison of shannon, renyi and tsallis entropy used in decision trees,” in

    Proceedings of the 17th International Conference on Artificial Intelligence and Soft Computing (ICAISC-08)

    .   Springer, 2008, pp. 643–651.
  • [9] R. Frigg and C. Werndl, “Entropy-a guide for the perplexed,” Probabilities in physics, 2011.
  • [10] C. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [11] C. Tsallis, “Possible generalization of boltzmann-gibbs statistics,” Journal of Statistical Physics, vol. 52, no. 1-2, pp. 479–487, 1988.
  • [12] C. Tsallis, Introduction to nonextensive statistical mechanics.   Springer, 2009.
  • [13] C. Tsallis, “Generalizing what we learnt: Nonextensive statistical mechanics,” in Introduction to Nonextensive Statistical Mechanics.   Springer, 2009, pp. 37–106.
  • [14] S. Abe and A. Rajagopal, “Nonadditive conditional entropy and its significance for local realism,” Physica A: Statistical Mechanics and its Applications, vol. 289, no. 1, pp. 157–164, 2001.
  • [15] T. Yamano, “Information theory based on nonadditive information content,” Physical Review E, vol. 63, no. 4, p. 046105, 2001.
  • [16] M. Lichman, “UCI machine learning repository,” http://archive.ics.uci.edu/ml, 2013.
  • [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [18] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
  • [19] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.