. This is even the case for tasks for which it is commonly accepted that decision trees are inappropriate models, e.g., in domains with continuous input spaces for which the axis-aligned decision boundaries of regular decision trees impose a limitation. Two aspects in the induction of RF seem to be crucial for that improved ability. First, due to random feature sub-sampling and bagging, which leads to slightly shifted class distributions, each decision tree selects slightly different feature thresholds, patterns and conditions. This results in a variety of alternative explanations, increasing the chances of including patterns with a good generalization and simultaneously avoiding overfitting by maintaining a certain degree of variability. Second, by using an ensemble of trees whose predictions are combined via voting, random forests achieve smooth decision boundaries which enables them to address problems not solvable by common decision tree learner for single trees.
Unfortunately, one of the key advantages of decision trees, namely their interpretability, is strongly limited. Even though each of the trees can be comprehended and analyzed individually, inspecting hundreds or thousands of them would easily overwhelm a human being. In this work, we are interested in the trade-off between the effectiveness and the interpretability of such ensembles. We propose to transform random forests into classification rules and investigate said trade-off by selecting subsets of these rules using different approaches and heuristics known from the field of rule induction. In our analysis we consider different aspects such as coverage and completeness, redundancy avoidance, and the accuracy of the final models. We argue that a rule-based strategy has conceptual advantages compared to the selection of a subset of trees. First, it allows a more fine-grained analysis since we can investigate the effect of adding individual rules rather than entire trees that potentially consists of a large number of rules. Second, to base the selection on trees rather than rules would not result in the best trade-off between effectiveness and interpretability, as each tree covers the full instance space and therefore contributes a large number of redundant rules to the model.
2 Simplifying Random Forests by Selection of Rules
A random forest consists of a predefined number of decision trees. For building the individual trees, only a subset of the available training instances (bagging) and attributes (random feature sampling) are considered, which guarantees a diverse set of trees .
The trees of a RF can be converted into an equivalent set of rules by considering each path from the root node to a leaf as a propositional rule, with the conjunction containing all conditions encountered on the path and predicting the class in the leaf. For making a prediction, each covering rule provides a vote for the class in its head. Note that the way these rules are generated guarantees that each example will be covered by exactly rules.
To assess the quality of individual rules, we rely on heuristics commonly used in rule learning (see e.g., ). These heuristics map the number of true positives, false positives, true negatives, and false negatives predicted by a rule into a heuristic value , typically . Among the heuristics we use are precision or confidence, which
has frequently been used
despite its tendency to overfit, as well as recall, which we expect to work well in a setting where voting is used. Additionally, we use a parametrization of the m-estimate
m-estimate, which has proved to be one of the most effective heuristics in rule learning . For selecting a subset of rules among all rules that have been extracted from a RF, we use the following strategies. (1) Best rules:The best rules according to a specified heuristic are included in the subset. (2) Weighted covering:Like (1), but after selecting a rule, the weights of the instances it covers are halved. By revalidating the remaining rules based on the new weights equal coverage of the training data is achieved.
For our experiments, we trained a random forest with 100 trees on the training data set, extracted rules with respect to the strategies mentioned above, and evaluated their accuracy on the test set.
As a first step we applied our method on a synthetic data set, where the task is to approximate an oblique decision boundary in two-dimensional space (Fig. 1). The results show that selecting the best rules according to a certain heuristic (m-estimate in this case) yields a set of redundant rules, covering a mostly homogeneous region. In contrast, weighted covering selects more diverse rules, covering the instance space more evenly. This results in a more precise approximation of the original decision boundary.
Fig. 2 shows the behavior of the different rule selection strategies on the breast-cancer data set from the UCI repository (obtained via 10 fold cross validation). In particular weighted covering using the m-estimate or recall performs well, reaching the baseline performance of using all rules already after selecting around (recall) or (m-estimate) rules. When using around of all rules, a clearly visible improvement can be observed. Interestingly, the performance can also clearly drop again (m-estimate). Recall seems to be more robust, also when using the best rule strategy. It is also the approach with the steepest increase in coverage. Results on other data sets (not shown here) exhibit similar effects.
Synthetic data with 800 red and 200 blue instances, normally distributed along the corresponding lines. The rectangles denote the rules, their colors refer to the difference between blue and red votes. The green line visualizes the decision boundary.
We have proposed a technique for simplifying random forests based on rule subset selection. As our preliminary results show, high predictive performance is already achievable with a few rules, often even clearly outperforming the baselines. Our results suggest that the greatest potential for improvements lies in the rule selection strategy.
-  (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §1, §2.
Do we need hundreds of classifiers to solve real world classification problems?. Journal of Machine Learning Research 15, pp. 3133–3181. Cited by: §1.
-  (2012) Foundations of rule learning. Springer. Cited by: §2.
-  (2010) On the quest for optimal rule learning heuristics. Machine Learning 78 (3). Cited by: §2.
Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research 18 (48), pp. 1–33. Cited by: §1.