Decision trees and their variants have had a rich and successful history in machine learning in general, and their performance has been empirically demonstrated in many competitions and even in automatic machine learning settings.
Various approaches have been used to enable decision tree representations within a neural network setting, in which this paper will consider non-greedy tree algorithms which are built on top of oblique decision boundaries through probabilistic routing. In this way, decision tree boundaries and the resulting classification is treated as a problem which can be learned through back propagation in a neural network setting .
On the neural network component, it has been further demonstrated that highway networks can viewed as an ensemble of shallow neural networks 
. As ensembles of classifiers are related to the Bayesian Model Averaging in an asymptotic manner, thus, creating a decision tree model within a neural network setting over a highway network can be used to determine the optimal neural network architecture and by extension the optimal hyperparameters for decision tree learning.
As our contribution, we aim to provide an automated way to induce decision tree whilst retaining existing weights in order to progressively grow or prune decision trees in an online manner. This simplifies the hyperparameters required in choosing models, instead allowing our algorithm to automatically search through the ideal neural network architecture. As such in this work, we modify existing non-greedy decision tree algorithms through stacking our models through modifying the routing algorithm of decision trees. Thus whilst previously, a single induced decision tree may have only one set of training for the leaf nodes, in our approach, a single decision tree can have different set of leaf nodes stacked in order determine the ideal neural network architecture.
This paper seeks to identify and bridge two commonly used Machine Learning techniques in the form of tree models and neural networks, as well as identifying some avenues for future research .
Within decision tree algorithms, research has been done to grow tree and prune trees in a post-hoc manner, greedy trees are limited in their ability to fine tune the split function once a parent node has already been split. In this section, we will briefly outline related works for non-greedy decision trees, approaches to extending non-greedy tree to ensemble models and finally mechanisms for performing model choice through Bayesian model determination.
2.1 Inducing Decision Trees in Neural Networks
. This has been done through soft-routing of the decision tree wherein the contribution of each leaf node to the final probability is determined probabilistically. One of the contributions by Kontschieder et al compared with other approaches the separating of training the probabilistic routing of the underlying binary decision tree and the training of the leaf classification nodes which need not be binary classification. The decision tree algorithm was also modified in through a shallow ensemble manner to a decision forest through bagging the classifiers.
In early implementation of decision trees, algorithms used were often using recursive partitioning methods, which aim to perform partitions in the form of where is one of the variables in the dataset and is a constant which is the split decision. These decision trees are also called axis-parallel
, because each node produces a axis-parallel hyperplane in the attribute space. These trees are often considered greedy trees, as they grow a tree one node at and time with no ability to fine tune the splits based on the results of training at lower levels of the tree.
In contrast, recent implementations of decision trees focus instead on the ability to update the tree in an online fashion leading to non-greedy optimizations typically based on oblique decision trees. the goal of oblique decision trees is to change the partition decisions instead to be in the form where are real-valued coefficients. Theses tests are equivalent to hyperplanes at an oblique orientation relative to the axis hence the name oblique decision trees. From this setting, one could convert oblique decision trees to the axis-parallel counterpart by simply setting for all coefficients except one.
2.2 Ensemble Modelling and the Model Selection Problem
Ensemble modelling within the neural networks has also been covered by Veit et al. , who demonstrated the relationship between residual networks (and by extension Highway Networks) and the shallow ensembling of neural networks, in the form . Furthermore, in this setting as we are interested in stacking models of the same class, Le and Clarke  have demonstrated the asymptotic properties in stacking and Bayesian model averaging. Approaches like sequential Monte Carlo methods can be used in order in order to change state and continually update the underlying model.
A simple approach to consider an ensemble approach to the problem. In this setting we would simply treat the new data independent of the old data and construct a separate learner. Then we can combine it together using a stacking approach. In this setting, we aim to combine models as a linear combination together with a base model which might represent any kind of underlying learner .
More recently there have been attempts at building ensemble tree methods for online decision trees, including the use of bagging techniques in a random forest fashion. Furthermore it has been demonstrated that boosting and ensemble models have connections with residual networks, giving the rise to the possibility of constructing boosted decision tree algorithms using neural network frameworks.
These approaches to ensemble models have a Bayesian parallel. In the Baye-sian model averaging algorithms. These models are related to stacking, where the marginal distribution over the data set is given by . The interpretation of this summation over
is that just one model is responsible for generating the whole data set, and the probability distribution over
reflects uncertainty as to which model that is. As the size of the data set increases, this uncertainty reduces and the posterior probabilitiesbecome increasingly focused on just one of the models.
3 Our Approach
In this section we present the proposed method in which we describe our approach to automatically grow and prune decision trees. This section is divided into the following parts: decision routing, highway networks and stacking.
3.1 Decision Routing
In our decision routing, a single neural network can have multiple routing paths for the same decision tree. If we start from the base decision tree, we could have two additional variations, a tree with one node pruned and a tree with one additional node grafted. In all three scenarios, the decision tree would share the same set of weights; the only alteration is that the routing would be different in each case.
In this scenario, all trees shared the same underlying tree structure and were connected in the same way. I it is in this manner which weights can be shared among all the trees. The routing layer determines whether nodes are to be pruned or grafted. The decision to prune or graft a node was done through . In the simpliest case, we simply pick a leaf node uniformly at random to prune or to graft. Additional weighting could be given depending on the past history of the node and updated using SMC approaches with a uniform prior.
3.2 Highway Networks and Stacked Ensembles
In order to enforce stacking as a highway network, the function would be weights of one dimension, that is that that it is a scalar of one dimension, that is is for all where , .
In this manner, the different decision trees which are perturbed as a stacked ensemble. Using this information, the corresponding weights can be interpreted in the Bayesian model sense.
The construction of such a network differs from the usual highway network in the sense that the underlying distribution of data does not alter how it is routed, instead all instances are routed in the same way which is precisely how the stacking ensembles operate, as oppose to the usage of other ensemble methods. The advantages of using a simple stacking in this instance, is primarily the interpretation of the weights as posterior probability of the Bayesian model selection problem.
3.3 Model Architecture Selection
Finally the model architecture selection problem can be constructed through combining the two elements above. In this setting, at every iteration we would randomly select nodes to prune and grow. At the end of each iteration, we would perform weighted random sampling based on the model posterior probabilities.
After several iterations we would expect that will eventually converge to a particular depth and structure of the decision tree. In order to facilitate this, the slope annealing trick , where is the modified weighted samples, and is the output from the highway network and is the temperature. This is introduced to the highway net weights in order to progressively reduce the temperature so that base model selected to perturb becomes more deterministic in the end.
Furthermore, this can be extended to ensemble approach through construction of such trees in parallel leading to the decision forest algorithms. In this scenario, each tree in the forest will have its own set of parameters and will induce different trees randomly. As they would be induced separately and randomly, we may yield more diverse set of classifiers leading to stronger results which may optimize different portions of the underlying space.
4 Experiments and Results
In our experiments we use ten publicly available datasets to demonstrate the efficacy of our approach. We used training and test datasets where provided to compare performance as shown in the table below. Where not provided, we performed a random split into training and testing respectively. The following table reports the average and median, over all the datasets, the relative improvement in log-loss over the respective baseline non-greedy decision tree.
|Model||Avg Impr.||Avg Impr.||Median Impr.||Median Impr.|
|Model||Avg Impr.||Avg Impr.||Median Impr.||Median Impr.|
In all instances, we began training our decision tree with a depth of , with initialized at with a discount rate of per iteration. Our decision forest was also set to have
decision trees, and combined through average voting. For all datasets we used standard data preparation approach from the recipes R library whereby we center, scale, and remove near zero variance predictors from our datasets. All models for baseline and our algorithm were built and trained using Python 3.6 running Keras and Tensorflow. In all models we use decision tree of depth 5 as a starting point with benchmark models trained for 200 epochs. With out automatic induced decision trees we train our models for 10 iterations, each with 20 epochs. We further train the final selected models to fine tune the selected architecture with the results as shown.
From the results, we notice that both approaches improve over the baseline where the tree depth is fixed to . With further fine tuning, it become apparent that the decision forest algorithm outperforms the vanilla decision tree approach. Even without fine tuning, it is clear that the forest approach is more robust in its performance against the testing dataset, demonstrating the efficacy of our approach.
From the results above, and compared with other benchmark algorithms, we have demonstrated an approach for non-greedy decision trees to learn ideal architecture through the use of sequential model optimization and Bayesian model selection. Through the ability to transfer learning weights effectively, and controlling the routing, we have demonstrated how we can concurrently train strong decision tree and decision forest algorithms, whilst inducing the ideal neural network architecture.
C. M. Bishop, Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: Springer-Verlag, 2006.
Chung, J., Ahn, S., and Bengio, Y. Hierarchical Multiscale Recurrent Neural Networks. In ICLR, 2017.
Y. Freund and R. E. Schapire, “A short introduction to boosting,” in In proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999, pp. 1401–1406.
Hastie, D.I. and Green, P.J. ”Model choice using reversible jump Markov chain Monte Carlo.” Statistica Neerlandica 66.3 2012: 309-338.
-  F. Huang, J. T. Ash, J. Langford, and R. E. Schapire, “Learning deep resnet blocks sequentially using boosting theory,” International Conference of Machine Learning 2018, vol. abs/1706.04964, 2018.
-  Le, T. and Clarke, B. ”On the Interpretation of Ensemble Classifiers in Terms of Bayes Classifiers.” Journal of Classification 35.2 (2018): 198-229.
-  Carvalho, C.M., Johannes, M.S., Lopes, H.F. and Polson, N.G. ”Particle learning and smoothing.” Statistical Science 25.1 (2010): 88-106.
-  S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” J. Artif. Int. Res., vol. 2, no. 1, pp. 1–32, Aug. 1994.
-  Norouzi, M., Collins, M., Johnson, M.A., Fleet, D.J. and Kohli, P., ”Efficient non-greedy optimization of decision trees.” Advances in Neural Information Processing Systems. 2015.
-  P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò, “Deep neural decision forests,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, new york, ny, usa, 9-15 july 2016, 2016, pp. 4190–4194.
-  Veit, A., Wilber, M.J. and Belongie, S. ”Residual networks behave like ensembles of relatively shallow networks.” Advances in Neural Information Processing Systems. 2016.
-  D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992.