1 Introduction
Decision tree induction is a whitebox machine learning technique that obtains an easily interpretable model after training. For each prediction from the model, an accompanying explanation can be given. Moreover, as opposed to rule extraction algorithms, the complete structure of the model is easy to analyze as it is encoded in a decision tree.
In domains where the decisions that need to be made are critical, the emphasis of machine learning is on offering support and advice to the experts instead of making the decisions for them. As such, the interpretability and comprehensibility of the obtained models are of primal importance for the experts that need to base their decision on them. Therefore, a whitebox approach is preferred. Examples of critical domains include the medical domain (e.g. cardiology and oncology) and the financial domain (e.g. claim management and risk assessment).
One of the disadvantages of decision trees is that they are prone to overfit slonim2002patterns
. To overcome this shortcoming, ensemble techniques have been proposed. These techniques combine the results of different classifiers
Dietterich2000, leading to an improvement in the prediction performance because of three reasons. First, when the amount of training data is small compared to the size of the hypothesis space, a learning algorithm can find many different hypotheses that correctly classify all the training data, while not performing well on unseen data. By averaging the results of the different hypotheses, the risk of choosing a wrong hypothesis can be reduced. Second, many learning algorithms can get stuck in local optima. By constructing different models from different starting points, the chance to find the global optimum is increased. Third, because of the finite size of the training data set, the optimal hypothesis can be outside of the space searched by the learning algorithm. By combining classifiers, the search space gets extended, again increasing the chance to find the optimal classifier. Nevertheless, ensemble techniques also have disadvantages. First, they take considerably longer to train and make a prediction. Second, their resulting models require more storage. The third and most important disadvantage is that the obtained model consists either out of many decision trees or only one decision tree that contains uninterpretable nodes, making it infeasible to even impossible for experts to interpret and comprehend the obtained model. To bridge the gap between decision tree induction algorithms and ensemble techniques, methods are required that can convert the ensemble into a single model. By first constructing an ensemble from the data and then applying this postprocessing method, a better predictive performance can possibly be achieved than constructing a decision tree from the data directly.
This postprocessing technique is not only useful to increase the predictive performance while maintaining excellent interpretability. It can also be used in a big data setting where the size of the training data set is too large to construct a predictive model on a single node in a feasible amount of time. To solve this, the data set can be partitioned and a predictive model can be constructed for each of these partitions in a distributed fashion. Finally, the different models can be combined together.
In this paper, we present a novel postprocessing technique for ensembles, called genesim, which is able to convert the different models from the ensemble into a single, interpretable model. Since each of the models in the ensemble being merged will have an impact on the predictive performance of the final, combined model, a genetic approach can be applied which constructs a large ensemble and tries combining models from different subsets of this ensemble. The outline of the rest of this paper is as follows. First, in Section 2 work related to our technique and their shortcomings are presented. Then, in Section 3, the different steps of genesim are depicted. In Section 4, a comparison regarding predictive performance and model complexity is made between the proposed algorithm and prevalent ensemble & decision tree induction techniques. Finally, in Section 5, a conclusion and possible future work are presented.
2 Related work
In Van Assche et al. van2007seeing , a technique called Interpretable Single Model (ism
) is proposed. This technique is very similar to an induction algorithm, as it constructs a decision tree recursively topdown, by first extracting a fixed set of possible candidate tests from the trees in the ensemble. For each of these candidate tests, a split criterion is calculated by estimating the parameters using the ensemble instead of the training data. Then, the test with the optimal split criterion is chosen and the algorithm continues recursively until a preprune condition is met. Two shortcomings of this approach can be identified. First, information from all models, including the ones that will have a negative impact, are used to construct a final model. Second, because of the similarity with induction algorithms, it is possible to get stuck in the same local optimum as these algorithms.
Deng deng2014interpreting introduced stel, which converts an ensemble into an ordered rule list using the following steps. First, for each tree in the ensemble, each path from the root to a leaf is converted into a classification rule. After all rules are extracted, they are pruned and ranked to create an ordered rule list. This sorted rule set can then be used for classification by iterating over each rule and returning the target when a matching rule is found. While a good predictive performance is reported for this technique, it is much harder to grasp an ordered rule list completely than a decision tree. Therefore, when interpretability is of primal importance, the postprocessing technique, that converts the ensemble of models into a single model, should result in a decision tree.
A thorough survey of evolutionary algorithms for decision tree evolving can be found in
Barros2012. Evolutionary algorithms for decision trees generate an initial population of decision trees, and then crosses over the trees by replacing subtrees in one tree with subtrees of another. With a certain probability, an individual of the population can be mutated by applying operations such as replacing a subtree by a randomly generated tree, changing the information corresponding to the test in a node or swapping two subtrees in the same decision tree.
3 GENetic Extraction of a Single, Interpretable Model (genesim)
While in Barros et al. Barros2012 , genetic algorithms are discussed which genetically construct decision trees from the data directly, in this paper, a genetic algorithm is applied on an ensemble of decision trees, created by using wellknown induction algorithms combined with techniques including bagging and boosting. Applying a genetic approach allows to efficiently traverse the very large search space of possible model combinations. This results in an innovative approach for merging decision trees that exploits the positive properties of creating an ensemble. By exploiting multiobjective optimization, the resulting algorithm increases the accuracy ánd decreases the decision tree size at the same time, while most of the stateoftheart succeeds in only one of the two.
Below, the different generic steps of a genetic algorithm Sastry2005 , applied on genesim^{1}^{1}1https://github.com/IBCNServices/GENESIM, are elaborated:

Initialization: to create an initial population, decision trees are generated from a training set of data using different induction algorithms, combined with ensemble techniques such as bagging and boosting. It is important that this population provides enough diversity, which allows for an extensive search space and reduces the chance of being stuck at local optima.

Evaluation: in order to measure how ‘fit’ a certain individual is in our population, the accuracy on a validation set is measured. In case of a tie, the model with the lowest model complexity is preferred.

Selection: tournament selection Goldberg1989 is applied to select which individuals get combined in each iteration.

Recombination:
in order to merge two decision trees together, they are first converted to a set of kdimensional hyperplanes. When all the nodes from all the trees are converted to their corresponding set of hyperplanes, the different decision spaces can be merged together by calculating their intersection using a sweep line approach discussed in
Andrzejak2013 . In this approach, each hyperplane is projected on a line segment in each dimension. These line segments are then sorted, making it easy to find the intersecting line segments in a dimension. In the end, if the projected line segments of two hyperplanes intersect in each dimension, the hyperplanes intersect as well. Subsequently, their intersection can be calculated and added to the resulting decision space. This method requires computational time, with the dimensionality of the data andthe number of planes in the sets, opposed to the quadratic complexity of a naive approach which calculates the intersection of each possible pair of planes. Finally, we need to convert our merged decision space back to a decision tree. A heuristic approach is taken which identifies candidate splitting planes to create a node from, and then picks one from these candidates. To select a candidate, a metric (such as information gain) could be used, but this would introduce a bias. Therefore, a candidate is selected randomly. The candidate hyperplanes need to fulfill the constraint that they have no boundaries in all dimensions (or bounds equal to the lower and upper bound of the range of each dimension) except for one.

Mutation: two possible mutations are implemented: (i) choosing a random node in the tree and replacing its threshold value by a new random number and (ii) swapping two random subtrees with eachother.

Replacement: the population for the next iteration is created by sorting the individuals by their fitness and only selecting the first individuals.
4 Results and evaluation
The proposed algorithm is compared, regarding the predictive performance and model complexity, to two ensemble methods (Random Forests (
rf Breiman1984) & eXtreme Gradient Boosting (
xgboost DBLP:journals/corr/ChenG16 )) and four decision tree induction algorithms (c4.5 Quinlan1993_2 , cart Breiman1984 , guide loh2009improving and quest loh2008classification ). For this, twelve data sets, having very distinct properties, from the UCI Machine Learning Repository Lichman:2013 were used. An overview of the characteristics of each data set can be found in Table 1.name  #samples  #cont  #disc  class_dist  name  #samples  #cont  #disc  class_dist 

iris  150  4  0  33.3  33.3  33.3  austra  690  5  9  55.5  44.5 
cars  1727  0  6  70.0  22.2  4.0  3.8  ecoli  326  5  2  43.6  23.6  16.0  10.7  6.1 
glass  213  9  0  32.4  35.7  8.0  6.1  4.2  13.6  heart  269  5  8  55.8  44.2 
led7  2563  0  7  13  13  12  11  13  13  13  12  lymph  142  0  18  57.0  43.0 
pima  768  7  1  65.1  34.9  vehicle  846  14  4  25.1  25.7  25.8  23.5 
wine  177  13  0  32.8  40.1  27.1  breast  698  0  9  65.5  34.5 
The hyperparameters of each of the tree induction and ensemble techniques were tuned using grid search when the number of parameters was lower than four, else bayesian optimization was used. Unfortunately, because of a rather high complexity of genesim, hyperparameter optimization could not be applied. The ensemble that was transformed into a single model by genesim was constructed using different induction algorithms (c4.5, cart, quest and guide
) combined with bagging and boosting. We applied 3fold cross validation 10 times on each of the data sets and stored the mean accuracy and model complexity for the 3 folds. The mean accuracy and mean model complexity (and their corresponding standard deviations) over these 10 measurements can be found in Table
3 and Table 3. Bootstrap statistical significance testing was applied to construct a WinTieLoss matrix, which can be seen in Figure 1. Algorithm A wins over B for a certain data set when the mean accuracy is higher than B on that data set and the value for the statistical test is lower than 0.05. When an algorithm has more wins than losses compared to another algorithm, the cell is colored green (and hatched with stripes). Else, the cell is colored red (and hatched with dots). The darker the green, the more wins the algorithm has over the other. Similarly, the darker the red, the more losses an algorithm has over the other.A few things can be deduced from these matrices. First, we can clearly see that the ensemble techniques rf and xgboost have a superior accuracy compared to all other algorithms on these data sets, and that xgboost performs better than rf. While the accuracy is indeed better, the increase can be of a rather moderate size while the resulting model is completely uninterpretable. Second, in terms of accuracy, the proposed genesim is better than all decision tree induction algorithms, except c4.5. Although, genesim is very competitive to it (winning on two data sets while losing on three) and c4.5 could be better due to the fact that no hyperparameter optimization was applied to genesim
. For each data set, the same hyperparameters were used (such as a limited amount of iterations and using 50% of the training data as validation data). Third,
genesim produces very interpretable models with a very low model complexity (expressed here as the number of nodes in the tree). The average number of nodes in the resulting tree is lower than in cart and c4.5, but higher than quest and guide. But the predictive performance of the two lastmentioned algorithms is much lower than genesim.5 Conclusion
In this paper, a technique called genesim is proposed. While exploiting the positive properties of constructing ensembles, it results in a single, interpretable model which is ideally suited to support experts in critical domains. Results show that in most cases, an increased predictive performance can be achieved, while having a model complexity similar to the complexity of trees produced by induction algorithms. Results of genesim can still be improved by reducing the computational complexity of our algorithm, allowing hyperparameter optimization and our technique to run for more iterations in a feasible amount of time. Moreover, in the future, an implementation of similar techniques, such as ism, to allow a comparison with genesim can be performed.
XGB  CART  QUEST  GENESIM  RF  ISM  C4.5  GUIDE  

heart  0.8257
0.01 
0.7441
0.02 
0.7585
0.02 
0.7982
0.02 
0.8129
0.01 
0.8024
0.02 
0.7877
0.03 
0.7829
0.02 
led7  0.8018
0.0 
0.7997
0.0 
0.7986
0.0 
0.7926
0.0 
0.8027
0.0 
0.7996
0.0 
0.8012
0.0 
0.761
0.01 
iris  0.9505
0.01 
0.9504
0.01 
0.9562
0.0 
0.9463
0.01 
0.95
0.01 
0.9519
0.01 
0.9395
0.01 
0.9467
0.01 
cars  0.9842
0.0 
0.9749
0.0 
0.9411
0.01 
0.9543
0.01 
0.9701
0.01 
0.9685
0.0 
0.966
0.0 
0.9426
0.01 
ecoli  0.8651
0.01 
0.8196
0.02 
0.8195
0.01 
0.8325
0.02 
0.8486
0.01 
0.7507
0.04 
0.817
0.03 
0.8319
0.01 
glass  0.7494
0.02 
0.6667
0.03 
0.649
0.03 
0.6696
0.03 
0.7526
0.03 
0.6489
0.03 
0.6763
0.03 
0.6557
0.02 
austra  0.8686
0.01 
0.8506
0.01 
0.8547
0.01 
0.8553
0.01 
0.8663
0.01 
0.8557
0.01 
0.8528
0.01 
0.8582
0.01 
vehicle  0.7606
0.01 
0.6988
0.01 
0.6986
0.01 
0.6834
0.01 
0.7383
0.01 
0.6672
0.01 
0.7115
0.01 
0.6821
0.01 
breast  0.9591
0.0 
0.94
0.01 
0.947
0.01 
0.9496
0.01 
0.958
0.01 
0.9466
0.0 
0.9443
0.0 
0.937
0.01 
lymph  0.8354
0.02 
0.7686
0.02 
0.7907
0.03 
0.7866
0.02 
0.817
0.02 
0.7822
0.03 
0.7839
0.03 
0.7659
0.04 
pima  0.7543
0.01 
0.7174
0.02 
0.7385
0.01 
0.7266
0.01 
0.7626
0.01 
0.7346
0.01 
0.7348
0.01 
0.7285
0.02 
wine  0.9709
0.01 
0.9072
0.01 
0.9055
0.03 
0.9128
0.03 
0.9603
0.01 
0.8838
0.01 
0.9217
0.01 
0.8828
0.03 
XGB(*)  CART  QUEST  GENESIM  RF(*)  ISM  C4.5  GUIDE  

heart  408.4815
188.2 
35.8148
12.54 
9.1852
2.97 
17.4444
4.84 
448.6113
154.6 
35.8889
10.71 
23.5556
6.62 
9.1481
2.28 
led7  459.9792
152.2 
201.9583
1.2 
57.625
4.91 
92.0417
17.08 
516.25
155.4 
111.2917
15.45 
58.9583
2.09 
32.9167
2.55 
iris  544.5238
144.6 
12.2857
1.34 
5.8571
0.59 
5.9048
0.65 
453.2381
204.4 
10.5714
1.91 
7.3809
1.06 
5.3333
0.55 
cars  631.2821
123.7 
140.1282
2.66 
45.6667
4.7 
103.1539
14.42 
438.4615
178.3 
131.4102
9.62 
98.4359
4.6 
43.6154
5.07 
ecoli  487.5625
202.9 
35.6667
11.77 
14.5833
3.48 
19.0833
4.27 
447.0623
147.7 
60.125
16.06 
19.25
2.84 
10.0833
1.43 
glass  530.7017
179.2 
57.8421
11.27 
22.4035
5.66 
29.6667
5.75 
486.9825
160 
80.3684
24.1 
36.2982
3.09 
16.1579
2.47 
austra  433.0392
72.7 
7.7451
6.19 
7.902
3.23 
23.7843
7.37 
396.3333
181.5 
38.8824
15.73 
26.7255
6.82 
8.2941
3.12 
vehicle  465.6667
119.4 
177.1111
22.26 
81.7778
14.85 
83.2222
9.68 
485.2778
146.8 
345.5556
45.92 
92.4444
12.43 
33.2222
8.71 
breast  563.3333
170.6 
30.619
7.89 
12.619
3.73 
18.5238
3.49 
395.5714
161.4 
43.7619
13.31 
19.4762
2.38 
10.4286
1.65 
lymph  608.4375
140.5 
32.0417
5.75 
13.5417
3.14 
14.8333
4.0 
497.9375
162.3 
30.9583
6.6 
16.9583
2.44 
8.875
2.81 
pima  180.0556
85.5 
52.4445
19.8 
12.0
4.32 
45.2222
8.53 
434.8334
68.04 
101.6667
18.5 
26.0
5.12 
8.1111
2.36 
wine  487.0948
176.9 
13.4762
1.58 
9.1905
1.66 
8.0476
0.93 
409.2381
116.1 
33.3809
3.04 
9.381
0.33 
6.8095
0.77 
References
 [1] Donna K Slonim. From patterns to pathways: gene expression data analysis comes of age. Nature genetics, 32:502–508, 2002.
 [2] Thomas G. Dietterich. Multiple Classifier Systems: First International Workshop, chapter Ensemble Methods in Machine Learning, pages 1–15. Springer Berlin Heidelberg, 2000.

[3]
Anneleen Van Assche and Hendrik Blockeel.
Seeing the forest through the trees.
In
International Conference on Inductive Logic Programming
, pages 269–279. Springer, 2007.  [4] Houtao Deng. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456, 2014.
 [5] Rodrigo Coelho Barros, Marcio Porto Basgalupp, Andre C P L F De Carvalho, and Alex a. Freitas. A survey of evolutionary algorithms for decisiontree induction. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 42(3):291–312, 2012.
 [6] Kumara Sastry, David Goldberg, and Graham Kendall. Search Methodologies. Compute, pages 97–125, 2005.
 [7] David E Goldberg, Bradley Korb, and Kalyanmoy Deb. Messy Genetic Algorithms : Motivation , Analysis , and First Results. Engineering, 3:493–530, 1989.
 [8] Artur Andrzejak, Felix Langner, and Silvestre Zabala. Interpretable models from distributed data via merging of decision trees. Proceedings of the 2013 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2013  2013 IEEE Symposium Series on Computational Intelligence, SSCI 2013, pages 1–9, 2013.
 [9] Leo Breiman, Jerome Friedman, Charles J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall/CRC, 1984.
 [10] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.
 [11] J. Ross Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., 1993.
 [12] WeiYin Loh. Improving the precision of classification trees. The Annals of Applied Statistics, pages 1710–1737, 2009.
 [13] WeiYin Loh. Classification and regression tree methods. Encyclopedia of statistics in quality and reliability, 2008.
 [14] M. Lichman. UCI machine learning repository, 2013.
Comments
There are no comments yet.