DeepAI
Log In Sign Up

Efficient Bayesian Decision Tree Algorithm

01/10/2019
by   Giuseppe Nuti, et al.
UBS
0

Bayesian Decision Trees are known for their probabilistic interpretability. However, their construction can sometimes be costly. In this article we present a general Bayesian Decision Tree algorithm applicable to both regression and classification problems. The algorithm does not apply Markov Chain Monte Carlo and does not require a pruning step. While it is possible to construct a weighted probability tree space we find that one particular tree, the greedy-modal tree (GMT), explains most of the information contained in the numerical examples. This approach seems to perform similarly to Random Forests.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/10/2019

A Bayesian Decision Tree Algorithm

Bayesian Decision Trees are known for their probabilistic interpretabili...
03/03/2013

Top-down particle filtering for Bayesian decision trees

Decision tree learning is a popular approach for classification and regr...
06/03/2011

An Analysis of Reduced Error Pruning

Top-down induction of decision trees has been observed to suffer from th...
02/27/2017

Memory-Efficient Global Refinement of Decision-Tree Ensembles and its Application to Face Alignment

Ren et al. recently introduced a method for aggregating multiple decisio...
06/16/2016

ACDC: α-Carving Decision Chain for Risk Stratification

In many healthcare settings, intuitive decision rules for risk stratific...
10/26/2017

Big Data Classification Using Augmented Decision Trees

We present an algorithm for classification tasks on big data. Experiment...
09/23/2018

Interaction Detection with Bayesian Decision Tree Ensembles

Methods based on Bayesian decision tree ensembles have proven valuable i...

1 Introduction

Decision trees are popular machine learning techniques applied to both classification and regression tasks. This technique is characterized by the resulting model, which is encoded as a tree structure. All nodes in a tree can be observed and understood, thus decision trees are considered white boxes. In addition, the tree structure can return the output with considerably fewer computations than other, more complex, machine learning techniques. Some examples of classical decision tree algorithms include the CART Brei84 and the C4.5 Qui93 . These algorithms have later been improved in Frie00 , boosted trees, and extended to several trees in Brei01 , Random Forests.

The first probabilistic approaches, also known as Bayesian Decision Trees, were introduced in Bun92 , Chip98 , and Den98 . The first article proposed a deterministic algorithm while the other two are based on Markov Chain Monte Carlo convergence. The main challenge is that the space of all possible tree structures is large, and as noted in Chic96 , Chic97 , the search for an optimal decision tree given a scoring function is NP-hard.

In this article, we propose an algorithm similar to Bun92 where we explicitly model the entire tree generation process

as opposed to only providing a probabilistic score of the possible partitions. This allows us to view the pruning aspect probabilistically instead of relying on a heuristic algorithm.

We start in Section 2 with an overview of the article’s Bayesian trees. Section 3 provides our building blocks, the partition probability space, followed by the trees probability space construction in Section 4. We present some numerical results in Section 5 showing that the greedy-modal Bayesian Decision Tree works well for various publicly available data sets. Even though we focus on the classification task as part of this article’s evaluation section, this algorithm can equivalently be applied to a regression task. Finally, some conclusive remarks follow in Section 6.

2 Bayesian Trees Overview

We define a data set of independent observations. Points in describe the features111

When our data set contains non-ordinal categorical features, we extend the feature space with as many features as the number of categories minus one, and assign these features the values 0 or 1, where 1 is assigned to the column the category belongs to – akin to a dummy variable approach. For instance, a data set

would be transformed into . of each observation whose outcome is randomly sampled from . The distribution of

will determine the type of problem we are solving: a discrete random variable translates into a classification problem whereas a continuous random variable translates into a regression problem. The beta function will be specially useful to compute the likelihood for the classification examples in this article,

(1)

different classes and the gamma function

The data set is sampled from a data generation process. We divide this process into two steps: first, a point is sampled in ; second, the outcome is sampled given . In this article we do not consider any prior knowledge of the generation of locations . Hence, we will focus on the distribution of . This conditional distribution is assumed to be encoded under a tree structure created following a set of simple recursive rules based on , namely the tree generation process: starting at the root, we determine if we are to expand the current node with two leaves under a predefined probability. This probability may depend on the current depth. If there is no expansion, the process terminates for this node, otherwise we choose which dimension within d is the split applicable to. Once the dimension is chosen, we assume that the specific location of the split across the available range is distributed across all distinct points within that range. After we have determined the specific location of the split, we assume that the process continues iteratively by again determining if each new leaf is to be split again. When there are no more nodes with a split or each leaf contains only one distinct set of observations, the tree is finalized.

Given the generating process above, it is clear that the building block of the Bayesian Decision Trees is to consider partitions of that better explain the outcomes under a probabilistic approach. All points in the same set of a partition share the same outcome distribution, i.e. does not depend on . Hence, assuming a prior for the distribution parameters of we can obtain the likelihood of each set in a partition. Since all observations are assumed independent, the total partition likelihood is obtained by multiplying the likelihoods of each set. Figure 1 shows some partition likelihoods for points and categorical outcomes . From the likelihoods in Figure 1, we see that the highest likely partition is (0(d)), i.e. . We do not imply however, that the other partitions cannot occur.

(a)
(b)
(c)
(d)
(e)
Figure 1: Likelihoods of our data set given some partitions of

, outcomes following a Bernoulli distribution, and assuming a prior Beta

.

Finally, given an additional prior on the partition space, we can obtain the probability of each partition given , i.e. . While is the probability of a partition, the information contained in each partition set is the posterior of the parameters. In Figure 1, the posterior distributions are Beta if , and Beta otherwise. This partition space is described in detail in Section 3.

Given a non-trivial partition, we can create additional partition sub-spaces for each partition set. We can iterate the sub-partitioning until the trivial partition is chosen or the set has only one observation. Thus, a specific choice of partitions with sub-partitions can be represented as a tree. We include the trivial partition in each probability space to train the trees without the need of a pruning step. Section 4 constructs the probability space of binary trees, i.e. those for which each non-trivial partition splits the data into exactly two sets.

3 Partition Probability Space

In this section we define the partition space which is the building block for the Bayesian Decision Trees. Let be a data set with independent observations in , . A partition of divides into disjoint subsets such that . As a consequence, the data set will be split into , where the observation belongs to if and only if . All sampled within region follow the same distribution with probability measure and parameters in .

We assume a prior distribution for parameters . Then, the likelihood of our data given the prior is,

(2)

where is the set of all such that is in . The likelihood of our data given the partition is,

(3)

In addition, if we provide a prior probability measure for partitions,

over , the updated probability of a partition given our data is,

(4)

For practical reasons we will work with the non-normalized probabilities,

(5)

Finally, the posterior distribution of in each region is

There are uncountably many possible partitions of . To construct binary trees, we reduce the space of partitions to the trivial partition , and partitions of the form for some real and some dimension . This choice of partitions will make the algorithm invariant to feature dimension scaling. Furthermore, we note that any partition splitting two neighboring observations will result in the same posteriors and likelihood of the data. For instance, all partitions of the form , in Figure 1

are equivalent. We borrow the following idea from Support Vector Machines (SVM)

Cort95

: from all partitions splitting two distinct neighboring observations, we only consider the maximum margin classifier, i.e. the partition that splits at their mid-point. Hence, the partition space

is finite by definition. In addition, with this we can aggregate all probabilities of partitions along dimension , i.e. , to evaluate the importance of each dimension. Figure 1 shows all possible partitions for the example in Section 2.

As an example, Figure 2 provides the log-probabilities of the partition space for a data set in whose outcomes are drawn from Bernoulli random variables. The samples are generated from two distributions equally likely, i.e. we draw from each distribution with probability 0.5. The first distribution is a multivariate Gaussian with mean and covariance . Points sampled from this distribution have a probability of 0.25 of being green. The second distribution is another multivariate Gaussian with mean and covariance

. In this case, the probability of a sample of being green is 0.75. Because the means of these Gaussian distributions are further apart along the

axis, the highest probable partitions are found in this dimension.

One natural choice of partition is the mode. In the Appendices, we provide Algorithm 1 which returns the modal partition. For the special case of the classification problem, we also provide Algorithm 2. To improve stability and efficiency, both algorithms work with the log-probabilities and assume we know the sorted indices for our points, namely such that for all and . In addition, Algorithm 2 assumes follows a multivariate Bernoulli distribution with outcomes in , and the prior belongs to the Dirichlet family. The prior will be characterized by its parameters .

(a)
(b)
Figure 2: Example of a data set whose outcomes are either green or red. The location of the points is sampled from a mixture of two Gaussian distributions with equal probability. One distribution draws outcomes from a Bernoulli distribution with probability 0.25, while the other with probability 0.75. Figure 1(a): log-probabilities of all possible non-trivial partitions given the data set. Figure 1(b): actual probability of a data point being green in addition to the modal split of each dimension.

4 Bayesian Decision Trees

Trees are directed acyclic graphs formed of nodes with a single root, and where every two connected nodes are connected by a unique path. We classify the nodes as either leaves or sprouts. While leaves are terminal nodes containing the model information, sprouts point to additional child nodes. If the number of child nodes is always two, we call the tree a binary tree. Each sprout contains a question or rule whose answer will lead to one of its children. Starting from the root, which is the only node without a parent, we follow the tree path until we reach a leaf. Figure 3 shows an example of a binary tree.

Raining?

Cloudy?

10%

no

40%

yes

no

in/h?

50%

less

5 h?

90%

less

70%

more

more

yes
Figure 3: Example of a binary tree. Leaves are marked in grey and sprouts in yellow. This tree provides the probability of raining in one hour.

The partition space from Section 3 is the preamble to constructing the Bayesian trees. Each partition from can be identified to a tree node. If is the trivial partition , the node becomes a leaf, otherwise a sprout. By construction, non-trivial partitions will only have two subsets of : the lower subset , and the upper subset . If we choose a non-trivial partition , we can construct additional partition spaces on given , and given . We can repeat this process until we choose all trivial partitions, i.e. leaves, as per equation (6). The leaves will contain the posterior distributions , as shown in Figure 4, and the not normalized probability of the tree will be,

(6)

This probability will be normalized over the set of trees we construct. Note that when the data set contains less than two distinct observations, .

(a)

?

Beta

?

Beta

Beta

(b)
Figure 4: Example of a tree for a 2-categories data set. Figure 3(a): the data set is displayed three times. The first layer corresponds to the data set before any split. The second layer displays the lower and upper sets resulting from splitting along dimension . The third layer is an additional split of the upper subset along dimension . Figure 3(b): resulting tree with posterior distributions for

assuming that outcomes are generated from Bernoulli random variables and the conjugate prior is Beta

.

The total number of partition spaces and trees is finite but large, and increases exponentially with the number of points. To illustrate this, assume that we have a data set with observations, all observations with different coordinate values at each dimension. If is the total number of partition spaces,

(7)

The number of trees is larger than the number of partition spaces. Given the same assumptions as in (7), the total number of possible trees is

Although we do not explicitly compute as we did for , we can bound from above by

(8)

Clearly, exploring all partition spaces is not practical. For instance, if we have a data set of observations in , and assuming all observations have different values on each coordinate, is close to one trillion. Although the number of trees is even larger than the number of partition spaces, a considerable amount of these trees will have a negligible probability. If we were to account for a subset of the total number of trees, we can re-normalize their probabilities which is equivalent to condition their probabilities to this subset of trees. The resulting normalized trees and probabilities will form our tree probability space.

We can construct a tree and obtain at the same time, see Algorithm 3 in the Appendices. Within this algorithm, the method choose_partition is not specified and should contain the search logic to choose the children partitions. The cost of Algorithm 3 is where is the cost of choose_partition. The main problem is to construct the trees whose probability is highest and that are structurally different. To start with, a particularly interesting Bayesian Decision Tree is the one obtained by choosing the modal partition at each step. We will call it the greedy-modal tree (GMT). This tree can be constructed by specifying choose_partition to be find_modal_partition from Algorithms 1 or 2. If we choose Algorithm 2 to be the choose_partition, the average cost of Algorithm 3 becomes . If we want to add more trees that are highly likely but structurally different, we can construct them as we do for the greedy-modal tree but by choosing different roots.

Algorithms 1 and 2 only look at one level ahead. As suggested in Bun92 , we could optimize the construction by looking at several levels ahead at the expense of increasing the order of . However, Section 5 shows that the GMT constructed with Algorithm 2 performs well in practice.

When we query for a point in the tree space, the answer is the posterior . Nonetheless, we can also return an expected value. In the classification problems from Section 5, we return the tree weighted average of .

5 Numerical Examples

In this Section we apply Algorithms 2 and 3 to construct the GMT. We assume that the outcomes, 0 or 1, are drawn from Bernoulli random variables. The prior distribution is set to Beta and each tree will return the expected probability of drawing the outcome , namely . The prior probabilities for each partition will be if is the trivial partition, and otherwise, where is the number of non-trivial partitions along the dimension, and is the depth at which the partition space lies. Note that dividing the non-trivial partition prior probability by is implicitly assuming a uniform prior distribution on the dimension space. One could also include in the analysis the posterior distributions of each dimension to visualize which features are most informative. As an alternative to this prior, we could use the partition distance weighted approach from Bun92 . These are the default settings which we apply to all of the data-sets studied here.

The accuracy is measured as a percentage of correct predictions. Each prediction will simply be the maximum probability outcome. If there is a tie, we choose category 0 by default. We compare the results to Decision Trees (DT) Brei84 , and Random Forests (RF) Brei01 computed with the DecisionTreeClassifier and RandomForestRegressor objects from the Python scikit-learn package scikit-learn . For reproducibility purposes, we set all random seeds to 0. In the case of RF, we enable bootstrapping to improve its performance and fix the number of trees to 5. The GMT results are generated with Java although we also provide the Python module with integration into scikit-learn in GMT19 .

5.1 UCI Data Sets

We test the GMT on some data sets from the University of California, Irvine (UCI) database Dua17 . We compute the accuracy the DT, RF, and GMT. Except for the Ripley set where a test set with 1000 points is provided, we apply a shuffled 10-fold cross validation. Results are shown in Table 1 and training time in Table 2.

Accuracy
DT RF GMT GMT - RF
Heart 20 270 76.3% 78.5% 83.0% 4.5%
Credit 23 30 000 72.6% 78.1% 82.0% 3.9%
Haberman 3 306 65.0% 68.3% 71.9% 3.6%
Seismic 18 2 584 87.7% 91.5% 93.2% 1.7%
Ripley 2 250/1 000 83.8% 87.9% 87.6% -0.3%
Gamma 10 19 020 81.4% 85.6% 85.2% -0.4%
Diabetic 19 1 151 62.6% 64.8% 63.5% -1.3%
EEG 14 14 980 84.0% 88.6% 81.2% -7.4%
Table 1: Accuracy of DT, RF, and GMT for several data sets. Except for the Ripley data set, we apply a 10-fold cross validation to each test. Results are sorted by relative performance, starting from hightest accuracy difference between GMT and RF.

The results reveal some interesting properties of GMT. Noticeably, the tree constructed in GMT seems to perform well in general with a training time between the DT and RF times. In all cases, the DT accuracy is lower than the RF accuracy. The cases in which RF outperforms the GMT are the Ripley, Gamma, Diabetic and EEG data sets. The accuracy difference between GMT and RF may show that the assumptions for both models are different. Furthermore, the fixed prior Beta might not be optimal for some of the data sets. The EEG data set is the least performing set for GMT compared to both, DT and RF. One reason may be that some information is hidden in lower levels, i.e. information that cannot be extracted by selecting the modal partition at each level.

Train time (ms)
DT RF GMT
Heart 0.0 4.7 3.8
Credit 521.0 1388.4 823.2
Haberman 0.0 4.7 0.6
Seismic 10.9 28.1 18.9
Ripley 0.0 1.6 0.0
Gamma 254.3 634.9 252.1
Diabetic 7.8 31.2 10.6
EEG 115.4 338.5 271.4
Table 2: Training time in milliseconds per fold.

5.2 Smoothing

In some cases, the nature of the problem warrants some smoothness in the solution, i.e. we do not desire abrupt changes in the posterior distributions for small regions. Intuitively, when we zoom in a region, we expect the parameters of

to be more certain. Hence, we can reduce the variance of the prior distribution at each split. In particular, for the examples in this Section, we propose to modify the prior distribution as follows: let Beta

be the prior distribution at one partition space. For a specific non-trivial partition in this space, define the total number of samples of each category, and the total number of samples of each category. The prior distributions that we further use in and are Beta and Beta respectively, where is a proportion of the total number of samples. If we choose , we return to the original formulation with constant prior. Figure 5 displays the effects of the smoothing on GMT.

Figure 5: Probabilities of outcome 0 for the Ripley test set with different . To allow for a deeper exploration, we fix for non-trivial partitions. On the left , and on the right .

6 Discussion and Future Work

The proposed GMT is a Bayesian Decision Tree that reduces the training time by avoiding any Markov Chain Monte Carlo sampling or the pruning step. The GMT numerical example results have similar predictive power to RF. This approach may be most useful where the ability to explain the model is a requirement as it has the same interpretability as the DT but a performance similar to RF. Hence, the advantages of the GMT are that it can be easily understood and takes less time than RF to train. Furthermore, the ability to specify a prior probability may be particularly suitable to some problems. It still remains to find a more efficient way to explore meaningful trees and improve performance.

As an extension, we would like to assess the performance of this the algorithm on some regression problems and experiment with larger partition spaces such as the SVM hyperplanes. Another computational advantage not explored is parallelization, which would allow for a more exhaustive exploration of the tree probability space. Finally, the smoothing concept has been briefly introduced: despite its intuitive appeal, it still remains to define the theoretical foundations behind it.

References

  • [1] L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.
  • [2] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984.
  • [3] W. Buntine. Learning classification trees. Statistics and Computing, 2(2):63–73, Jun 1992.
  • [4] D. M. Chickering.

    Learning Bayesian Networks is NP-Complete

    , pages 121–130.
    Springer New York, New York, NY, 1996.
  • [5] D. M. Chickering, D. Heckerman, and C. Meek. A bayesian approach to learning bayesian networks with local structure. In

    Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence

    , UAI’97, pages 80–89, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
  • [6] H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–948, 1998.
  • [7] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, Sep 1995.
  • [8] D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. A bayesian cart algorithm. Biometrika, 85(2):363–377, 1998.
  • [9] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
  • [10] J. H. Friedman.

    Greedy function approximation: A gradient boosting machine.

    Annals of Statistics, 29:1189–1232, 2000.
  • [11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [12] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
  • [13] K. Thommen, B. Goswami, and A. I. Cross. Bayesian decision tree. https://github.com/UBS-IB/bayesian_tree, 2019.

Appendix

1:procedure find_modal_partition()
2:      Applies equation (2).
3:      Computes .
4:     
5:     for each dimension in  do
6:          Identifies each non-trivial partition.
7:         for each observation in  do
8:              if  then
9:                  
10:                   Applies equation (2).
11:                   Applies equation (2).
12:                   Applies eq (3).
13:                   Computes .
14:                  
15:                  if  then
16:                       
17:                  end if
18:              end if
19:         end for
20:     end for
21:     Return
22:end procedure
Algorithm 1 Find the modal partition in for the general problem.
1:procedure find_modal_partition()
2:     
3:     
4:     for each observation in  do Adds category outcomes.
5:         
6:     end for
7:      Obtains pseudo count.
8:     
9:     
10:     
11:     
12:     
13:     for each dimension in  do
14:         
15:         
16:         
17:         
18:         
19:         for each observation in  do
20:              
21:              
22:              
23:              
24:              if  then
25:                  
26:                  
27:                  
28:                  
29:                  if  then
30:                       
31:                  end if
32:              end if
33:         end for
34:     end for
35:     Return
36:end procedure
Algorithm 2 Find the modal partition in for the classification problem.
1:procedure tree_logprob()
2:     if  is a sprout then
3:         Return
4:     else
5:          Applies equation (2).
6:          Obtains .
7:         Return
8:     end if
9:end procedure
10:
11:procedure fill_tree()
12:     
13:      Contains the search logic.
14:      Creates the node based on .
15:     
16:      Contains the search logic.
17:      Creates the node based on .
18:     
19:     if  is a sprout then
20:         
21:     else
22:         
23:     end if
24:     if  is a sprout then
25:         
26:     else
27:         
28:     end if
29:     Return
30:end procedure
Algorithm 3 Fill the Bayesian greedy-modal tree and return the tree non-normalized log-probability.