I Introduction
The decision tree is a nonparametric supervised learning method used for classification and regression. Although the decision tree methods have been one of the first machine learning approaches, it remains an actively researched domain in machine learning. It is not only simple to understand and interpret, but also offers relatively good results, computational efficiency and flexibility. The general idea of decision trees is to predict unknown input instances by learning simple decision rules inferred from several known training instances. Decision trees are most often induced in the following topdown manner. A given data set is partitioned into a left and right subset by a split criterion test on attributes. The highest scoring partition which reduces the average uncertainty mostly is selected and the data set is partitioned accordingly into two child nodes, growing the tree by making the node be the parent of the two newly created child nodes. This procedure is applied recursively until some stopping conditions, e.g. maximum tree depth or minimum leaf size, are reached.
Generally speaking, split criterion is a fundamental issue in decision trees induction. A large number of decision tree induction algorithms with different split criteria have been proposed. For example, the Iterative Dichotomiser 3 (ID3) algorithm is based on Shannon entropy
[1]; the C4.5 algorithm is based on Gain Ratio which is considered as a normalized Shannon entropy [2]; while the Classification And Regression Tree (CART) algorithm is based on Gini index [3]. These algorithms seem to be independent, and it is hard to judge which algorithm always outperforms others. Actually, it reflects one drawback of this kind of split criteria is that they lack adaptability to data sets. Numerous alternatives have been proposed for the adaptive entropy estimate
[4, 5], but their statistical entropy estimates are too complex to lose the simplicity and comprehensibility of decision trees. Most of all, to the best of our knowledge, there is not a unified framework combining all the above criteria together. In addition, a series of papers have analyzed the importance of the split criterion [6, 7]. They demonstrated that different split criteria have substantial influence on the generalization error of the induced decision trees. This is the inspiration of our proposed new split criterion unifying and generalizing the classical split criteria.To address the above issue, we propose a Tsallis entropy framework in this paper. Tsallis entropy is a generalization of Shannon entropy with an adjustable parameter and is first introduced into decision trees in the prior work [8]. [8] only tested the performance of Tsallis entropy in C4.5 with some given , but the relation between Tsallis entropy and other split criteria was not explored. And the unified framework was also not presented. In this paper, we propose a Tsallis entropy based decision tree induction algorithm called TEC algorithm and analyze the correspondence between Tsallis entropy with different and other split criteria. Shannon entropy and Gini index are just two specific cases of Tsallis entropy with and , while Gain Ratio is also can be considered as a normalized Tsallis entropy with . And Tsallis entropy indeed provides a new approach to improve the performance of decision trees with a tunable in a unified framework. Experimental results on UCI data sets indicate that the TEC algorithm achieves statistically significant improvement over the classical algorithms without losing the strengths of decision trees.
The rest of this paper is organized as follows. Section 2 presents the background of Tsallis entropy. Section 3 outlines our proposed TEC algorithm. Section 4 exhibits experimental results. Section 5 summaries the work.
Ii Tsallis entropy
Entropy is the measure of disorder in physical systems, or the measure of the amount of information that may be needed to specify the full microstates of the system [9]. In 1948, Shannon adopted entropy to information theory, called Shannon entropy [10]
, which is a measure of the uncertainty associated with a random variable.
(1) 
where is a random variable that can take values and
is the corresponding probabilities of
. Shannon entropy is concave and attains maximum when .There are two typical distributions observed in the macroscopic world, exponential distribution family and powerlaw heavytailed distribution family. However, we cannot characterize powerlaw heavytailed distribution through maximizing Shannon entropy subject to normal mean and variance. The reason is that Shannon entropy implicitly assumes certain tradeoff between contributions from the tails and the main mass of distribution
[8]. It should be worthwhile to control this tradeoff explicitly to characterize the two distribution family. Entropy measures that depend on powers of probability, , can provide such control. Thus, some parameterized entropies have been proposed. A wellknown generalization of this concept is Tsallis entropy [11], which extends its applications to socalled nonextensive systems using an adjustable parameter . Tsallis entropy can explain some physical systems that have complex behaviours such as longrange and longmemory interactions [12].Tsallis entropy is defined by:
(2) 
which converges to Shannon entropy in the limit ,
(3) 
The relation to Shannon entropy can be made clearer by rewriting the definition in the form:
(4) 
where
(5) 
is called the logarithmic function. And when , .
Just like the exponential function to the logarithmic function, there is also the corresponding exponential function to logarithmic function.
(6) 
For , Tsallis entropy is convex. For , Tsallis entropy is nonconvex and nonconcave. While for , Tsallis entropy is concave, satisfying similar properties to Shannon entropy [13]. For instance, for , , and
is maximal at the uniform distribution.
Additivity is a crucial difference of the fundamental property between Shannon entropy and Tsallis entropy. For two independent random variables and , Shannon entropy has the additivity property:
(7) 
however, Tsallis entropy has the pseudoadditivity (also called additivity) property:
(8) 
Besides, Tsallis conditional entropy, Tsallis joint entropy and Tsallis mutual information are also derived similarly to Shannon entropy. For the conditional probability and the joint probability , Tsallis conditional entropy and Tsallis joint entropy [14] are denoted by:
(9)  
(10) 
It is remarkable that Eq.(9) can be easily deformed by
(11) 
The relation between the conditional entropy and the joint entropy is given by:
(12) 
Tsallis mutual information [15] is denoted as the difference between Tsallis entropy and Tsallis conditional entropy:
(13) 
and the chain rule of Tsallis mutual information for random variables
and holds:(14) 
The relation among the conditional entropy, joint entropy and mutual information can be derived from Eq.(12) and Eq.(13):
(15) 
In summary, Tsallis entropy generalizes Shannon entropy with an adjustable parameter and has a wider range of applications.
Iii Tsallis Entropy Criterion (TEC) algorithm
One key issue in the procedure of decision tree induction is the split criterion. At every step, the decision tree chooses one pair of attribute and cutting point which makes the maximal impurity decrease to split the data and grow the tree. Therefore, the attribute chosen to split significantly affects the construction of decision trees and further influences the classification performance.
Iiia Tree construction
Given a data set , with attributes , and class label . For each tree node, we search for every possible pair of attribute and cutting point to choose the optimal attribute and cutting point as follows: for a attribute ,
(16) 
Here is the candidate cutting point for attribute , is the data set belonging to one node to be partitioned, and , are the two child nodes that would be created if is partitioned at . The function is the impurity criterion, e.g. Tsallis entropy, which computes over the labels of the data which fall in the node. The pair of attribute and cutting point is chosen to construct the tree which maximizes .
The above procedure is applied recursively until some stopping conditions are reached. The stopping conditions consist of three principles: (i) The classification is achieved in a subset. (ii) No attributes are left for selection. (iii) The cardinality of a subset is not greater than the predefined threshold.
IiiB Prediction
Once the tree has been trained by the data as a classifier
, it can be used to predict for new unlabeled instances.Decision tree makes prediction in a majority vote manner. For each class ,
(17) 
where denotes the leaf containing , and denotes the number of instances that located in . Then the tree prediction is the class that maximizes this value:
(18) 
IiiC TEC algorithm
Here, we summary our proposed Tsallis Entropy Criterion (TEC) algorithm in a pseudocode format in Algorithm 1. Compared with the classical decision tree induction algorithms, the only difference is the split criterion. We use Tsallis entropy to replace the classical split criteria, e.g. Shannon entropy, Gain Ratio and Gini index. Actually, in the following subsection, we will see that Tsallis entropy unifies Shannon entropy, Gain Ratio and Gini index with different values of .
IiiD Relations to other criteria
As described above, Tsallis entropy unifies Shannon entropy, Gain Ratio and Gini index in a framework. In the following, we will reveal the relations between Tsallis entropy to other split criteria.
Tsallis entropy converges to Shannon entropy for :
(19) 
Besides, Gini index is exactly a specific case of Tsallis entropy with :
(20) 
As for the Gain Ratio which adds a normalized factor compared with Information Gain, it can be seen as the normalized Information Gain. According to the Eq.(16), we can obtain:
(21) 
where represents Shannon entropy. If is replaced by Tsallis entropy, Gain Ratio is generalized to Tsallis Gain Ratio. Thus, Gain Ratio is also covered by the Tsallis entropy adding a normalized factor (Tsallis Gain Ratio) with .
In summary, Tsallis entropy unifies three kinds of split criteria, e.g. Shannon entropy, Gain Ratio and Gini index, and generalizes the split criterion of decision trees. As far as we know, this is the first time to unify common split criteria into a parametric framework. This is also the first time to reveal the correspondence between Tsallis entropy with different and other split criteria. The optimal for Tsallis entropy is obtained by crossvalidation, which is usually not equal to or . This implies better performance than the traditional split criteria. Although the optimal may be different for different data sets, it is associated with the properties of data sets. That is to say, the parameter enables the TEC algorithm to have adaptability and flexibility. Tsallis entropy indeed provides a new approach to improve decision trees’ performance with a tunable in a unified framework. In the Experiments section, we will see that the TEC algorithm achieves higher accuracy than classical algorithms with an appropriate .
Iv Experiments
As illustrated in section III, the TEC algorithm is based on Tsallis entropy with an adjustable parameter which consists of Tsallis entropy and Tsallis Gain Ratio split criteria. Tsallis entropy split criterion degenerates to Shannon entropy and Gini index with and , respectively. With respect to Gain Ratio, Tsallis Gain Ratio (the normalized Tsallis entropy) also degenerates to Gain Ratio with .
Iva Evaluation Metric
In order to quantitatively compare trees obtained by different methods, we choose accuracy to evaluate the effectiveness of the tree and the total number of the tree nodes to measure the tree complexity.
IvB Data Set Description
As shown in Table I, the UCI data sets [16] are adopted to evaluate the proposed approaches. These data sets consist of three types, namely numeric, categorical and mixed data sets. Also, these data sets include two kinds of classification problems, binary and multiclass classification.
Data Set  Type 





Yeast  numeric  1484  8  10  
Glass  numeric  214  10  7  
Vehicle  numeric  946  18  4  
Wine  numeric  178  13  3  
Haberman  numeric  306  3  2  
Car  categorical  1728  6  4  
Scale  categorical  625  4  3  
Hayes  categorical  160  5  3  
Monks  categorial  432  7  2  
Abalone  mixed  4139  8  18  
Cmc  mixed  1473  9  3 
IvC Experiment Setup
The decision trees with different split criteria, e.g. Gain Ratio, Shannon entropy, Gini index, Tsallis entropy and Tsallis Gain Ratio, are implemented in Python. We refer to the CART algorithm implementation on scikitlearn platform [17] and the C4.5 algorithm implementation of J48 in Weka [18]. In each data set, we first partition the data into the training set and test set randomly where the test set holds . Then in the training set, we do a grid search using 10fold crossvalidation to determine the the values of in Tsallis entropy and Tsallis Gain Ratio. Maybe the optimal for Tsallis entropy and Tsallis Gain Ratio are different, but for the fair comparison we choose the same , e.g. optimal for Tsallis entropy. Besides, the minimal leaf size is set to to avoid overfitting. After the parameter selection, the above best parameters are fixed. Then, a decision tree is trained by the training data without postpruning and evaluated by the test data. The procedure from the trainingtest data partition to the evaluation is repeated 10 times to reduce the influence of randomness.
IvD Results
Figure 1 gives an intuitive exhibition of the influence of different values of parameter in Tsallis entropy for the Glass data set. Figure 1 (a) illustrates that the accuracy is sensitive to the change of and the highest accuracy is obtained at . Figure 1 (b) shows that the tree complexity has different responds to the change of as accuracy and the lowest tree complexity is achieved at . It should be noted that there are different strategies to choose for various purpose, e.g. highest accuracy or lowest complexity or tradeoff, which is also a reflection of the TEC algorithm’s adaptability for data sets. In this paper, we choose the highest accuracy principle for the choice of .
Table II reports the accuracy and complexity results of different criteria for different data sets. The highest accuracy and lowest complexity on each data set are in boldface. As expected, the performance of TEC outperforms ID3, CART and C4.5 due to the fact that Tsallis entropy is a generalization of Shannon entropy, Gini index and Gain Ratio. In respect to the two kinds of the TEC algorithm, e.g. Tsallis entropy and Tsallis Gain Ratio, no one can prevail another one absolutely. The results indicates that Tsallis entropy prefers high accuracy while Tsallis Gain Ratio prefers low complexity. The reason lies on the normalized factor which has influence in the tree structure to some extent. In addition, compared with Shannon entropy and Gini index, Tsallis entropy achieves better performance in accuracy and complexity. Tsallis Gain Ratio also obtains better results compared with Gain Ratio. Three Wilcoxon signed ranked tests [19]
on accuracy (Tsallis entropy vs Shannon entropy, Tsallis entropy vs Gini index, Tsallis Gain Ratio vs Gain Ratio) all reject the null hypothesis of equal performance at a pvalue less than
. The results show that the TEC algorithm with appropriate achieves a average statistically significant improvement in accuracy and maintains a lower complexity.In terms of optimal value of , we find a fuzzy trend from Table II that the more of class number, the smaller value is tended, e.g. for numeric type data sets from Yeast to Haberman, is increasing while the class number is decreasing (exception for Vehicle). In this paper, we choose the optimal value of using crossvalidation method, but we conjecture that the values of is associated with the properties of data sets. For example, the Car data set, all the algorithms presents almost the same results which reflects the data set is not sensitive to the parameter . The relation between the and the properties of data sets will be discussed in the future work.
Data Set 

















Yeast  52.8  199  51.8  196.6  52.1  326.2  56.9  195.8  1.4  51.2  197.1  
Glass  51.2  52.4  52.6  53.8  44.2  52  60.6  52.6  2.6  53.1  51.5  
Vehicle  71.7  103  70.2  100  72.3  147.2  73.8  111.0  0.6  73.4  135.7  
Wine  92.9  12.0  90.0  12.0  92.4  9.4  95.9  9.6  3.1  92.9  9.2  
Haberman  70.3  32.2  70.3  33.0  72.8  33.0  74.2  33.2  7.1  74.8  33.0  
Car  98.2  106.4  98.1  106.8  98.5  106.5  98.3  106.2  0.8  98.4  106.6  
Scale  75.9  97.6  76.1  97.2  74.5  77.0  78.2  93.1  3.1  78.5  77.0  
Hayes  81.5  28.8  80.0  25.3  79.2  19.6  82.3  19.5  8.6  81.5  19.2  
Monks  51.9  89.0  52.1  88.6  52.9  88.0  57.3  89.6  8.9  54.9  88.0  
Abalone  25.4  89.2  25.0  85.8  20.3  84.3  26.8  86.2  0.8  25.7  85.1  
Cmc  49.1  267.0  47.4  264.0  45.7  242.8  52.0  264.2  1.2  47.8  242.1 
V Conclusions
In this paper, we present and evaluate Tsallis entropy for enhancing decision trees in a fundamental issue, e.g. split criterion. We unify the classical split criteria into a parametric framework and propose the TEC algorithm with Tsallis entropy split criterion which generalizes Shannon entropy, Gain Ratio and Gini index through an adjustable parameter . Most of all, we reveal the relations between Tsallis entropy with different and other split criteria. Experimental results indicate that, with appropriate , the TEC algorithm achieves a average statistically significant improvement in accuracy. Nevertheless, the approaches have limitations that need to be addressed in the future, such as, the estimate method for parameter
in place of current crossvalidation method. Furthermore, Tsallis entropy also has potential applications beyond decision trees, for instance, Random Forest and Bayesian network, to be investigated in future work.
Acknowledgments
This research is supported in part by the 973 Program of China (No. 2012CB315803), the National Natural Science Foundation of China (No. 61371078, 61375054), and the Research Fund for the Doctoral Program of Higher Education of China (No. 20130002110051).
References
 [1] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
 [2] J. R. Quinlan, C4. 5: programs for machine learning. Morgan Kaufmann Publishers, 1993.
 [3] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
 [4] S. Nowozin, “Improved information gain estimates for decision tree induction,” in Proceedings of the 29th International Conference on Machine Learning (ICML12). ACM, 2012, pp. 297–304.

[5]
M. Serrurier and H. Prade, “Entropy evaluation based on confidence intervals of frequency estimates: Application to the learning of decision trees,” in
Proceedings of the 32nd International Conference on Machine Learning (ICML15). ACM, 2015, pp. 1576–1584.  [6] W. Buntine and T. Niblett, “A further comparison of splitting rules for decisiontree induction,” Machine Learning, vol. 8, no. 1, pp. 75–85, 1992.
 [7] W. Z. Liu and A. P. White, “The importance of attribute selection measures in decision tree induction,” Machine Learning, vol. 15, no. 1, pp. 25–41, 1994.

[8]
T. Maszczyk and W. Duch, “Comparison of shannon, renyi and tsallis entropy
used in decision trees,” in
Proceedings of the 17th International Conference on Artificial Intelligence and Soft Computing (ICAISC08)
. Springer, 2008, pp. 643–651.  [9] R. Frigg and C. Werndl, “Entropya guide for the perplexed,” Probabilities in physics, 2011.
 [10] C. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
 [11] C. Tsallis, “Possible generalization of boltzmanngibbs statistics,” Journal of Statistical Physics, vol. 52, no. 12, pp. 479–487, 1988.
 [12] C. Tsallis, Introduction to nonextensive statistical mechanics. Springer, 2009.
 [13] C. Tsallis, “Generalizing what we learnt: Nonextensive statistical mechanics,” in Introduction to Nonextensive Statistical Mechanics. Springer, 2009, pp. 37–106.
 [14] S. Abe and A. Rajagopal, “Nonadditive conditional entropy and its significance for local realism,” Physica A: Statistical Mechanics and its Applications, vol. 289, no. 1, pp. 157–164, 2001.
 [15] T. Yamano, “Information theory based on nonadditive information content,” Physical Review E, vol. 63, no. 4, p. 046105, 2001.
 [16] M. Lichman, “UCI machine learning repository,” http://archive.ics.uci.edu/ml, 2013.
 [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [18] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
 [19] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
Comments
There are no comments yet.