1 Introduction
Decision trees are popular machine learning techniques applied to both classification and regression tasks. This technique is characterized by the resulting model, which is encoded as a tree structure. All nodes in a tree can be observed and understood, thus decision trees are considered white boxes. In addition, the tree structure can return the output with considerably fewer computations than other, more complex, machine learning techniques. Some examples of classical decision tree algorithms include the CART Brei84 and the C4.5 Qui93 . These algorithms have later been improved in Frie00 , boosted trees, and extended to several trees in Brei01 , Random Forests.
The first probabilistic approaches, also known as Bayesian Decision Trees, were introduced in Bun92 , Chip98 , and Den98 . The first article proposed a deterministic algorithm while the other two are based on Markov Chain Monte Carlo convergence. The main challenge is that the space of all possible tree structures is large, and as noted in Chic96 , Chic97 , the search for an optimal decision tree given a scoring function is NPhard.
In this article, we propose an algorithm similar to Bun92 where we explicitly model the entire tree generation process
as opposed to only providing a probabilistic score of the possible partitions. This allows us to view the pruning aspect probabilistically instead of relying on a heuristic algorithm.
We start in Section 2 with an overview of the article’s Bayesian trees. Section 3 provides our building blocks, the partition probability space, followed by the trees probability space construction in Section 4. We present some numerical results in Section 5 showing that the greedymodal Bayesian Decision Tree works well for various publicly available data sets. Even though we focus on the classification task as part of this article’s evaluation section, this algorithm can equivalently be applied to a regression task. Finally, some conclusive remarks follow in Section 6.
2 Bayesian Trees Overview
We define a data set of independent observations. Points in describe the features^{1}^{1}1
When our data set contains nonordinal categorical features, we extend the feature space with as many features as the number of categories minus one, and assign these features the values 0 or 1, where 1 is assigned to the column the category belongs to – akin to a dummy variable approach. For instance, a data set
would be transformed into . of each observation whose outcome is randomly sampled from . The distribution ofwill determine the type of problem we are solving: a discrete random variable translates into a classification problem whereas a continuous random variable translates into a regression problem. The beta function will be specially useful to compute the likelihood for the classification examples in this article,
(1) 
different classes and the gamma function
The data set is sampled from a data generation process. We divide this process into two steps: first, a point is sampled in ; second, the outcome is sampled given . In this article we do not consider any prior knowledge of the generation of locations . Hence, we will focus on the distribution of . This conditional distribution is assumed to be encoded under a tree structure created following a set of simple recursive rules based on , namely the tree generation process: starting at the root, we determine if we are to expand the current node with two leaves under a predefined probability. This probability may depend on the current depth. If there is no expansion, the process terminates for this node, otherwise we choose which dimension within d is the split applicable to. Once the dimension is chosen, we assume that the specific location of the split across the available range is distributed across all distinct points within that range. After we have determined the specific location of the split, we assume that the process continues iteratively by again determining if each new leaf is to be split again. When there are no more nodes with a split or each leaf contains only one distinct set of observations, the tree is finalized.
Given the generating process above, it is clear that the building block of the Bayesian Decision Trees is to consider partitions of that better explain the outcomes under a probabilistic approach. All points in the same set of a partition share the same outcome distribution, i.e. does not depend on . Hence, assuming a prior for the distribution parameters of we can obtain the likelihood of each set in a partition. Since all observations are assumed independent, the total partition likelihood is obtained by multiplying the likelihoods of each set. Figure 1 shows some partition likelihoods for points and categorical outcomes . From the likelihoods in Figure 1, we see that the highest likely partition is (0(d)), i.e. . We do not imply however, that the other partitions cannot occur.
, outcomes following a Bernoulli distribution, and assuming a prior Beta
.Finally, given an additional prior on the partition space, we can obtain the probability of each partition given , i.e. . While is the probability of a partition, the information contained in each partition set is the posterior of the parameters. In Figure 1, the posterior distributions are Beta if , and Beta otherwise. This partition space is described in detail in Section 3.
Given a nontrivial partition, we can create additional partition subspaces for each partition set. We can iterate the subpartitioning until the trivial partition is chosen or the set has only one observation. Thus, a specific choice of partitions with subpartitions can be represented as a tree. We include the trivial partition in each probability space to train the trees without the need of a pruning step. Section 4 constructs the probability space of binary trees, i.e. those for which each nontrivial partition splits the data into exactly two sets.
3 Partition Probability Space
In this section we define the partition space which is the building block for the Bayesian Decision Trees. Let be a data set with independent observations in , . A partition of divides into disjoint subsets such that . As a consequence, the data set will be split into , where the observation belongs to if and only if . All sampled within region follow the same distribution with probability measure and parameters in .
We assume a prior distribution for parameters . Then, the likelihood of our data given the prior is,
(2) 
where is the set of all such that is in . The likelihood of our data given the partition is,
(3) 
In addition, if we provide a prior probability measure for partitions,
over , the updated probability of a partition given our data is,(4) 
For practical reasons we will work with the nonnormalized probabilities,
(5) 
Finally, the posterior distribution of in each region is
There are uncountably many possible partitions of . To construct binary trees, we reduce the space of partitions to the trivial partition , and partitions of the form for some real and some dimension . This choice of partitions will make the algorithm invariant to feature dimension scaling. Furthermore, we note that any partition splitting two neighboring observations will result in the same posteriors and likelihood of the data. For instance, all partitions of the form , in Figure 1
are equivalent. We borrow the following idea from Support Vector Machines (SVM)
Cort95: from all partitions splitting two distinct neighboring observations, we only consider the maximum margin classifier, i.e. the partition that splits at their midpoint. Hence, the partition space
is finite by definition. In addition, with this we can aggregate all probabilities of partitions along dimension , i.e. , to evaluate the importance of each dimension. Figure 1 shows all possible partitions for the example in Section 2.As an example, Figure 2 provides the logprobabilities of the partition space for a data set in whose outcomes are drawn from Bernoulli random variables. The samples are generated from two distributions equally likely, i.e. we draw from each distribution with probability 0.5. The first distribution is a multivariate Gaussian with mean and covariance . Points sampled from this distribution have a probability of 0.25 of being green. The second distribution is another multivariate Gaussian with mean and covariance
. In this case, the probability of a sample of being green is 0.75. Because the means of these Gaussian distributions are further apart along the
axis, the highest probable partitions are found in this dimension.One natural choice of partition is the mode. In the Appendices, we provide Algorithm 1 which returns the modal partition. For the special case of the classification problem, we also provide Algorithm 2. To improve stability and efficiency, both algorithms work with the logprobabilities and assume we know the sorted indices for our points, namely such that for all and . In addition, Algorithm 2 assumes follows a multivariate Bernoulli distribution with outcomes in , and the prior belongs to the Dirichlet family. The prior will be characterized by its parameters .
4 Bayesian Decision Trees
Trees are directed acyclic graphs formed of nodes with a single root, and where every two connected nodes are connected by a unique path. We classify the nodes as either leaves or sprouts. While leaves are terminal nodes containing the model information, sprouts point to additional child nodes. If the number of child nodes is always two, we call the tree a binary tree. Each sprout contains a question or rule whose answer will lead to one of its children. Starting from the root, which is the only node without a parent, we follow the tree path until we reach a leaf. Figure 3 shows an example of a binary tree.
The partition space from Section 3 is the preamble to constructing the Bayesian trees. Each partition from can be identified to a tree node. If is the trivial partition , the node becomes a leaf, otherwise a sprout. By construction, nontrivial partitions will only have two subsets of : the lower subset , and the upper subset . If we choose a nontrivial partition , we can construct additional partition spaces on given , and given . We can repeat this process until we choose all trivial partitions, i.e. leaves, as per equation (6). The leaves will contain the posterior distributions , as shown in Figure 4, and the not normalized probability of the tree will be,
(6) 
This probability will be normalized over the set of trees we construct. Note that when the data set contains less than two distinct observations, .
assuming that outcomes are generated from Bernoulli random variables and the conjugate prior is Beta
.The total number of partition spaces and trees is finite but large, and increases exponentially with the number of points. To illustrate this, assume that we have a data set with observations, all observations with different coordinate values at each dimension. If is the total number of partition spaces,
(7) 
The number of trees is larger than the number of partition spaces. Given the same assumptions as in (7), the total number of possible trees is
Although we do not explicitly compute as we did for , we can bound from above by
(8) 
Clearly, exploring all partition spaces is not practical. For instance, if we have a data set of observations in , and assuming all observations have different values on each coordinate, is close to one trillion. Although the number of trees is even larger than the number of partition spaces, a considerable amount of these trees will have a negligible probability. If we were to account for a subset of the total number of trees, we can renormalize their probabilities which is equivalent to condition their probabilities to this subset of trees. The resulting normalized trees and probabilities will form our tree probability space.
We can construct a tree and obtain at the same time, see Algorithm 3 in the Appendices. Within this algorithm, the method choose_partition
is not specified and should contain the search logic to choose the children partitions. The cost of Algorithm 3 is where is the cost of choose_partition
. The main problem is to construct the trees whose probability is highest and that are structurally different. To start with, a particularly interesting Bayesian Decision Tree is the one obtained by choosing the modal partition at each step. We will call it the greedymodal tree (GMT). This tree can be constructed by specifying choose_partition
to be find_modal_partition
from Algorithms 1 or 2. If we choose Algorithm 2 to be the choose_partition
, the average cost of Algorithm 3 becomes . If we want to add more trees that are highly likely but structurally different, we can construct them as we do for the greedymodal tree but by choosing different roots.
Algorithms 1 and 2 only look at one level ahead. As suggested in Bun92 , we could optimize the construction by looking at several levels ahead at the expense of increasing the order of . However, Section 5 shows that the GMT constructed with Algorithm 2 performs well in practice.
When we query for a point in the tree space, the answer is the posterior . Nonetheless, we can also return an expected value. In the classification problems from Section 5, we return the tree weighted average of .
5 Numerical Examples
In this Section we apply Algorithms 2 and 3 to construct the GMT. We assume that the outcomes, 0 or 1, are drawn from Bernoulli random variables. The prior distribution is set to Beta and each tree will return the expected probability of drawing the outcome , namely . The prior probabilities for each partition will be if is the trivial partition, and otherwise, where is the number of nontrivial partitions along the dimension, and is the depth at which the partition space lies. Note that dividing the nontrivial partition prior probability by is implicitly assuming a uniform prior distribution on the dimension space. One could also include in the analysis the posterior distributions of each dimension to visualize which features are most informative. As an alternative to this prior, we could use the partition distance weighted approach from Bun92 . These are the default settings which we apply to all of the datasets studied here.
The accuracy is measured as a percentage of correct predictions. Each prediction will simply be the maximum probability outcome. If there is a tie, we choose category 0 by default. We compare the results to Decision Trees (DT) Brei84 , and Random Forests (RF) Brei01 computed with the DecisionTreeClassifier
and RandomForestRegressor
objects from the Python scikitlearn package scikitlearn . For reproducibility purposes, we set all random seeds to 0. In the case of RF, we enable bootstrapping to improve its performance and fix the number of trees to 5. The GMT results are generated with Java although we also provide the Python module with integration into scikitlearn in GMT19 .
5.1 UCI Data Sets
We test the GMT on some data sets from the University of California, Irvine (UCI) database Dua17 . We compute the accuracy the DT, RF, and GMT. Except for the Ripley set where a test set with 1000 points is provided, we apply a shuffled 10fold cross validation. Results are shown in Table 1 and training time in Table 2.
Accuracy  

DT  RF  GMT  GMT  RF  
Heart  20  270  76.3%  78.5%  83.0%  4.5% 
Credit  23  30 000  72.6%  78.1%  82.0%  3.9% 
Haberman  3  306  65.0%  68.3%  71.9%  3.6% 
Seismic  18  2 584  87.7%  91.5%  93.2%  1.7% 
Ripley  2  250/1 000  83.8%  87.9%  87.6%  0.3% 
Gamma  10  19 020  81.4%  85.6%  85.2%  0.4% 
Diabetic  19  1 151  62.6%  64.8%  63.5%  1.3% 
EEG  14  14 980  84.0%  88.6%  81.2%  7.4% 
The results reveal some interesting properties of GMT. Noticeably, the tree constructed in GMT seems to perform well in general with a training time between the DT and RF times. In all cases, the DT accuracy is lower than the RF accuracy. The cases in which RF outperforms the GMT are the Ripley, Gamma, Diabetic and EEG data sets. The accuracy difference between GMT and RF may show that the assumptions for both models are different. Furthermore, the fixed prior Beta might not be optimal for some of the data sets. The EEG data set is the least performing set for GMT compared to both, DT and RF. One reason may be that some information is hidden in lower levels, i.e. information that cannot be extracted by selecting the modal partition at each level.
Train time (ms)  

DT  RF  GMT  
Heart  0.0  4.7  3.8 
Credit  521.0  1388.4  823.2 
Haberman  0.0  4.7  0.6 
Seismic  10.9  28.1  18.9 
Ripley  0.0  1.6  0.0 
Gamma  254.3  634.9  252.1 
Diabetic  7.8  31.2  10.6 
EEG  115.4  338.5  271.4 
5.2 Smoothing
In some cases, the nature of the problem warrants some smoothness in the solution, i.e. we do not desire abrupt changes in the posterior distributions for small regions. Intuitively, when we zoom in a region, we expect the parameters of
to be more certain. Hence, we can reduce the variance of the prior distribution at each split. In particular, for the examples in this Section, we propose to modify the prior distribution as follows: let Beta
be the prior distribution at one partition space. For a specific nontrivial partition in this space, define the total number of samples of each category, and the total number of samples of each category. The prior distributions that we further use in and are Beta and Beta respectively, where is a proportion of the total number of samples. If we choose , we return to the original formulation with constant prior. Figure 5 displays the effects of the smoothing on GMT.6 Discussion and Future Work
The proposed GMT is a Bayesian Decision Tree that reduces the training time by avoiding any Markov Chain Monte Carlo sampling or the pruning step. The GMT numerical example results have similar predictive power to RF. This approach may be most useful where the ability to explain the model is a requirement as it has the same interpretability as the DT but a performance similar to RF. Hence, the advantages of the GMT are that it can be easily understood and takes less time than RF to train. Furthermore, the ability to specify a prior probability may be particularly suitable to some problems. It still remains to find a more efficient way to explore meaningful trees and improve performance.
As an extension, we would like to assess the performance of this the algorithm on some regression problems and experiment with larger partition spaces such as the SVM hyperplanes. Another computational advantage not explored is parallelization, which would allow for a more exhaustive exploration of the tree probability space. Finally, the smoothing concept has been briefly introduced: despite its intuitive appeal, it still remains to define the theoretical foundations behind it.
References
 [1] L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.
 [2] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. The Wadsworth and BrooksCole statisticsprobability series. Taylor & Francis, 1984.
 [3] W. Buntine. Learning classification trees. Statistics and Computing, 2(2):63–73, Jun 1992.

[4]
D. M. Chickering.
Learning Bayesian Networks is NPComplete
, pages 121–130. Springer New York, New York, NY, 1996. 
[5]
D. M. Chickering, D. Heckerman, and C. Meek.
A bayesian approach to learning bayesian networks with local
structure.
In
Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence
, UAI’97, pages 80–89, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.  [6] H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–948, 1998.
 [7] C. Cortes and V. Vapnik. Supportvector networks. Machine Learning, 20(3):273–297, Sep 1995.
 [8] D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. A bayesian cart algorithm. Biometrika, 85(2):363–377, 1998.
 [9] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.

[10]
J. H. Friedman.
Greedy function approximation: A gradient boosting machine.
Annals of Statistics, 29:1189–1232, 2000.  [11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine Learning in Python . Journal of Machine Learning Research, 12:2825–2830, 2011.
 [12] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
 [13] K. Thommen, B. Goswami, and A. I. Cross. Bayesian decision tree. https://github.com/UBSIB/bayesian_tree, 2019.