DeepAI
Log In Sign Up

Top-down particle filtering for Bayesian decision trees

03/03/2013
by   Balaji Lakshminarayanan, et al.
0

Decision tree learning is a popular approach for classification and regression in machine learning and statistics, and Bayesian formulations---which introduce a prior distribution over decision trees, and formulate learning as posterior inference given data---have been shown to produce competitive performance. Unlike classic decision tree learning algorithms like ID3, C4.5 and CART, which work in a top-down manner, existing Bayesian algorithms produce an approximation to the posterior distribution by evolving a complete tree (or collection thereof) iteratively via local Monte Carlo modifications to the structure of the tree, e.g., using Markov chain Monte Carlo (MCMC). We present a sequential Monte Carlo (SMC) algorithm that instead works in a top-down manner, mimicking the behavior and speed of classic algorithms. We demonstrate empirically that our approach delivers accuracy comparable to the most popular MCMC method, but operates more than an order of magnitude faster, and thus represents a better computation-accuracy tradeoff.

READ FULL TEXT VIEW PDF
01/10/2019

A Bayesian Decision Tree Algorithm

Bayesian Decision Trees are known for their probabilistic interpretabili...
04/19/2019

Continuous-Time Birth-Death MCMC for Bayesian Regression Tree Models

Decision trees are flexible models that are well suited for many statist...
01/10/2019

Efficient Bayesian Decision Tree Algorithm

Bayesian Decision Trees are known for their probabilistic interpretabili...
07/26/2022

Single MCMC Chain Parallelisation on Decision Trees

Decision trees are highly famous in machine learning and usually acquire...
11/17/2020

TreeGen – a Monte Carlo generator for data frames

The typical problem in Data Science is creating a structure that encodes...
01/22/2023

Parallel Approaches to Accelerate Bayesian Decision Trees

Markov Chain Monte Carlo (MCMC) is a well-established family of algorith...
10/17/2022

A Mixing Time Lower Bound for a Simplified Version of BART

Bayesian Additive Regression Trees (BART) is a popular Bayesian non-para...

1. Introduction

Decision tree learning algorithms are widely used across statistics and machine learning, and often deliver near state-of-the-art performance despite their simplicity. Decision trees represent predictive models from an input space, typically , to an output space of labels, and work by specifying a hierarchical partition of the input space into blocks. Within each block of the input space, a simple model predicts labels.

In classical decision tree learning, a decision tree (or collection thereof) is learned in a greedy, top-down manner from the examples. Examples of classical approaches that learn single trees include ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and CART (Breiman et al., 1984), while methods that learn combinations of decisions trees include boosted decision trees (Friedman, 2001)

, Random Forests

(Breiman, 2001), and many others.

Bayesian decision tree methods, like those first proposed by Buntine (1992), Chipman et al. (1998), Denison et al. (1998), and Chipman and McCulloch (2000), and more recently revisited by Wu et al. (2007), Taddy et al. (2011) and Anagnostopoulos and Gramacy (2012)

, cast the problem of decision tree learning into the framework of Bayesian inference. In particular, Bayesian approaches start by placing a prior distribution on the decision tree itself. To complete the specification of the model, it is common to associate each leaf node with a parameter indexing a family of likelihoods, e.g., the means of Gaussians or Bernoullis. The labels are then assumed to be conditionally independent draws from their respective likelihoods. The Bayesian approach has a number of useful properties: e.g., the posterior distribution on the decision tree can be interpreted as reflecting residual uncertainty and can be used to produce point and interval estimates.

On the other hand, exact posterior computation is typically infeasible and so existing approaches use approximate methods such as Markov chain Monte Carlo (MCMC) in the batch setting. Roughly speaking, these algorithms iteratively improve a complete decision tree by making a long sequence of random, local modifications, each biased towards tree structures with higher posterior probability. These algorithms stand in marked contrast with classical decision tree learning algorithms like ID3 and C4.5, which rapidly build a decision tree for a data set in a top-down greedy fashion guided by heuristics. Given the success of these methods, one might ask whether they could be adapted to work in the Bayesian framework.

In this article, we present such an adaptation, proposing a sequential Monte Carlo (SMC) method for approximate inference in Bayesian decision trees that works by sampling a collection of trees in a top-down manner like ID3 and C4.5. Unlike classical methods, there is no pruning stage after the top-down learning stage to prevent over-fitting, as the prior combines with the likelihood to automatically cut short the growth of the trees, and resampling focuses attention on those trees that better fit the data. In the end, the algorithm produces a collection of sampled trees that approximate the posterior distribution. While both existing MCMC algorithms and our novel SMC algorithm produce approximations to the posterior that are exact in the limit, we show empirically that our algorithms run more than an order of magnitude faster than existing methods while delivering the same predictive performance.

The article is organized as follows: we begin by describing the Bayesian decision tree model precisely in Section 2, and then describe the SMC algorithm in detail in Section 3. Through a series of empirical tests, we demonstrate in Section 4 that this approach is fast and produces good approximations. We conclude in Section 5 with a discussion comparing this approach with existing ones in the Bayesian setting, and point towards future avenues.

2. Model and notation

In this section, we present the decision tree model for the distribution of the labels

corresponding to input vectors

, . The assumption is that the probabilistic mapping from input vectors to their labels is mediated by a latent decision tree that serves to partition the input space into axis-aligned blocks. Each block is then associated with a parameter that determines the distribution of the labels of the input vectors falling in that block.

Figure 1. A decision tree represents a hierarchical partitioning of a space. Here, the space is the unit square and the tree contains the nodes . The root node represents the whole space , while its two children and , represent the two halves of the cut , where represents the dimension of the cut, and represents the location of the cut along that dimension. (The origin is at the bottom left of each figure, and the -axis is dimension 1. The red stars and blue circles represent observed data points.) The second cut, , splits the block into the two halves and . When defining the prior over decision trees given by Chipman et al. (1998), it will be necessary to refer to the “extent” of the data in a block. E.g., and are the extent of the data in dimensions 1 and 2, respectively, in block . For each node , the set contains those dimensions with non-trivial extent. Here, , but , because there is no variation in dimension 1.

A rooted, strictly binary tree is a finite tree with a single root, denoted by the empty string , where each internal node except the root has exactly two children, called the left child and the right child . Denote the leaves of (those nodes without children) by . Each node of the tree is associated with a block of the input space as follows: At the root we have , while each internal node “cuts” its block into two halves, with denoting the dimension of the cut, and denoting the location of the cut, so that

(1)

We call the tuple the decision tree. (See Figure 1 for more intuition on the representation and notation of decision trees.) Note that the blocks associated with the leaves of the tree partition . It will be convenient to write for the set of data point indices such that . For every subset , let and similarly for , so that are the input vectors in block and are their labels. Note that both and depend on , although we have chosen to elide this dependence for notational simplicity.

Conditioned on the examples , we assume that the joint density of the labels and the latent decision tree factorizes as follows:

where denotes a likelihood, defined below.

In this paper, we focus on the case of categorical labels taking values in the set . It is natural to take to be the Dirichlet-Multinomial likelihood, corresponding to the data being conditionally i.i.d. draws from a multinomial distribution on with a Dirichlet prior. In particular,

where denotes the number of labels among those and is the concentration parameter of the symmetric Dirichlet prior. Generalisations to other likelihood functions based on conjugate pairs of exponential families are straightforward.

The final piece of the model is the prior density over decision trees. In order to make straightforward comparisons with existing algorithms, we adopt the model proposed by Chipman et al. (1998). In this model, the prior distribution of the latent tree is defined conditionally on the given input vectors (see Section 5 for a discussion of this dependence on and its effect on the exchangeability of the labels). Informally, the tree is grown starting at the root, and each new node either splits and grows two children (turning the node into an internal node) or stops (leaving it a leaf) stochastically.

We now describe the generative process more precisely in terms of a Markov chain capturing the construction of a decision tree in stages, beginning with the trivial tree containing only the root node. At each stage , is produced from by choosing one leaf in and either growing two children nodes or stopping the leaf. Once stopped, a leaf is ineligible for future growth. The identity of the chosen leaf is deterministic, while the choice to grow or stop is stochastic. The process proceeds until all leaves are stopped, and so each node is considered for expansion exactly once throughout the process. This will be seen to give rise to a finite sequence of decision trees once we define the associated cut functions and . We will use this Markov chain in Section 3 as scaffolding for a sequential Monte Carlo algorithm. A similar approach was employed by Taddy et al. (2011) in the setting of online Bayesian decision trees. There are similarities also with the bottom-up SMC algorithms by Teh et al. (2008) and Bouchard-Côté et al. (2012).

We next describe the rule for stopping or growing nodes, and the distribution of cuts. Let be the node chosen at some stage of the generative process. If the input vectors are all identical, then the node stops and becomes a leaf. (Chipman et al. chose this rule because no choice of cut to the block would result in both children containing at least one input vector.) Otherwise, let be the set of dimensions along which varies, and let be the range of the input vectors along dimension . (See last subfigure of Figure 1.) Under the Chipman et al. model, the probability that node is split is

where is the depth of the node, and and are parameters governing the shape of the resulting tree. For larger and smaller the typical trees are larger, while the deeper is in the tree the less likely it will be cut. If is cut, the dimension and then location of the cut are sampled uniformly from and , respectively. Note that the choice for the support of the distribution over cut dimensions and locations are such that both children of will, with probability one, contain at least one input vector. Finally, the choices of whether to grow or stop, as well the cut dimensions and locations, are conditionally independent across different subtrees.

To complete the generative model, we define , and , where is the first stage such that all nodes are stopped. We note that with probability one because each cut of a node produces a non-trivial partition of the data in the block, and a node with one data point will be stopped instead of cut. The conditional density of the decision tree can now be expressed as

Note that the prior distribution of

does not depend on the deterministic rule for choosing a leaf at each stage. However this choice will have an effect on the bias/variance of the corresponding SMC algorithm.

3. Sequential Monte Carlo (SMC) for Bayesian decision trees

In this section we describe an SMC algorithm for approximating the posterior distribution over the decision tree given the labeled training data . (We refer the reader to (Cappé et al., 2007) for an excellent overview of SMC techniques.) The approach we will take is to perform particle filtering following the sequential description of the prior. In particular, at stage , the particles approximate a modified posterior distribution where the prior on is replaced by the distribution of , i.e., the process truncated at stage .

Let denote the set of unstopped leaves at stage , all of which are eligible for expansion. An important freedom we have in our SMC algorithm is the choice of which candidate leaf (or set of candidate leaves) to consider expanding. In order to avoid “multipath” issues (Del Moral et al., 2006, §3.5) which lead to high variance, we fix a deterministic rule for choosing . (Multiple candidates are expanded or stopped in turn, independently.) This rule can be a function of and the state of the current particle, as the correctness of resulting approximation is unaffected. We evaluate two choices in experiments: first, the rule where we consider expanding all eligible nodes; and second, the rule where contains a single node chosen in a breadth-first (i.e., oldest first) manner from .

We may now define the sequence of target distributions. Recall the sequential process defined in Section 2. If the generative process for the decision tree has not completed by stage , the process has generated along with , capturing which leaves in have been considered for expansion in previous stages already and which have not. Let be the variables generated on stage , and write for the prior distribution on the sequence . We construct the target distribution as follows: Given , we generate labels with likelihood , i.e., as if were the complete decision tree. We then define to be the conditional distribution of given . That is, is the posterior with a truncated prior.

In order to complete the description of our SMC method, we must define proposal kernels that sample approximations for the th stage given values for the th stage. As with our choice of , we have quite a bit of freedom. In particular, the proposals can depend on the training data . An obvious choice is to take to be the conditional distribution of given under the prior, i.e., setting . Informally, this choice would lead us to propose extensions to trees at each stage of the algorithm by sampling from the prior, so we will refer to this as the prior proposal kernel (aka the Bayesian bootstrap filter (Gordon et al., 1993)).

We consider two additional proposal kernels: The first,

is called the (one-step) optimal proposal kernel because it would be the optimal kernel assuming that the th stage were the final stage. We return to discuss this kernel in Section 3.1. The second alternative, which we will refer to as the empirical proposal kernel, is a small modification to the prior proposal, differing only in the choice of the split point . Recall that, in the prior, is chosen uniformly from the interval . This ignores the empirical distribution given by the input data in the partition. We can account for this by first choosing, uniformly at random, a pair of adjacent data points along feature dimension , and then sampling a cut uniformly from the interval between these two data points.

The pseudocode for our proposed SMC algorithm is given in Algorithm 1 in Appendix A. Note that the SMC framework only requires us to compute the density of under the target distribution up to a normalization constant. In fact, the SMC algorithm produces an estimate of the normalization constant, which, at the end of the algorithm, is equal to the marginal probability of the labels given , with the latent decision tree marginalized out. In general, the joint density of a Markov chain can be hard to compute, but because the set of nodes considered at each stage is a deterministic function of , the path taken is a deterministic function of . As a result, the joint density is simply a product of probabilities for each stage. The same property holds for the proposal kernels defined above because they use the same candidate set , and have the same support as . These properties justify the equations in Algorithm 1.

3.1. The one-step optimal proposal kernel

In this section we revisit the definition of the one-step optimal proposal kernel. While the prior and empirical proposal kernels are relatively straightforward, the one-step optimal proposal kernel is defined in terms of an additional conditioning on the labels , which we now study in greater detail.

Recall that the one-step optimal proposal kernel is given by . To begin, we note that, conditionally on and , the subtrees rooted at each node are independent. This follows from the fact that the likelihood of given factorizes over the leaves. Thus, the proposal’s probability density is

where is the probability density of the cuts at node under , and denotes whether the node was split or not. On the event we split a node , if we condition further on and , we note that the conditional likelihood of , when viewed as a function of the split , is piecewise constant, and in particular, only changes when the split crosses an example. It follows that we can sample from this proposal by first considering the discrete choice of an interval, and then sampling uniformly at random from within the interval, as with the empirical proposal. Some algebra shows that

and

3.2. Computational complexity

Let denote the number of unique values in dimension , denote the number of training data points at node and denote the number of nodes in particle . For all the SMC algorithms, the space complexity is . The time complexity is for prior and empirical proposals and for the optimal proposal. The optimal proposal typically requires higher computational cost per particle, but fewer number of particles than the prior and empirical proposals.

4. Experiments

In this section, we experimentally evaluate the design choices of the SMC algorithm (proposal, expansion strategy, number of particles and “islands”) on real world datasets. In addition, we compare the performance of SMC to the most popular MCMC method for Bayesian decision tree learning (Chipman et al., 1998)

, as well as CART, a popular (non-Bayesian) tree induction algorithm. We evaluate all the algorithms on the following datasets from the UCI ML repository

(Asuncion and Newman, 2007):

  • [topsep=1pt,parsep=1pt,itemsep=1pt]

  • MAGIC gamma telescope data 2004 (magic-04): , , .

  • Pen-based recognition of handwritten digits (pen-digits): , , .

Previous work has focused mainly on small datasets (e.g., the Wisconsin breast cancer database used by Chipman et al. (1998) has data points). We chose the above datasets to illustrate the scalability of our approach. For the pen-digits dataset, we used the predefined training/test splits, while for the other datasets, we split the datasets randomly into a training set and a test set containing approximately 70% and 30% of the data points respectively.

We implemented our scripts in Python and applied similar software optimization techniques to SMC and MCMC scripts.The scripts can be downloaded from the authors’ webpages. Our experiments were run on a cluster with machines of similar processing power.

4.1. Design choices in the SMC algorithm

In these set of experiments, we fix the hyperparameters to

and compare the predictive performance of different configurations of the SMC algorithm for this fixed model. Under the prior, these values of produce trees whose mean depth and number of nodes are and , respectively. Given particles, we use an effective sample size (ESS) threshold of and set the maximum number of stages to 5000 (although the algorithms never reached this number).

4.1.1. Proposal choice and node expansion

We consider the SMC algorithm proposed in Section 3 under two proposals: optimal and prior. (The empirical proposal performed similar to the prior proposal and hence we do not report those results here.) We consider two strategies for choosing , i.e., the list of nodes considered for expansion at stage : (i) node-wise expansion, where a single node is considered for expansion per stage (i.e., is a singleton chosen deterministically from eligible nodes ), and (ii) layer-wise expansion, where all nodes at a particular depth are considered for expansion simultaneously (i.e., ). For node-wise expansion, we evaluate two strategies for selecting the node deterministically from : (i) breadth-first priority, where the oldest node is picked first, and (ii) marginal-likelihood based priority, where we expand the node with the lowest marginal likelihood. Both of these priority schemes performed similarly; hence we report only the results for breadth-first priority. We use multinomial resampling in our experiments. We also evaluated systematic resampling (Douc et al., 2005) but found that the performance was not significantly different.

We report the log predictive probability on test data as a function of runtime and of the number of particles (similar trends are observed for test accuracy; see Appendix B

). The times reported do not account for prediction time. We average the numbers over 10 random initializations and report standard deviations. The results are shown in Figure

2. In summary, we observe the following:

  1. [topsep=1pt,parsep=1pt,itemsep=1pt]

  2. node-wise expansion outperforms layer-wise expansion for prior proposal. The prior proposal does not account for likelihood; one could think of the resampling steps as ‘correction steps’ for the sub-optimal decisions sampled from the prior proposal. Because node-wise expansion can potentially resample at every stage, it can correct individual bad decisions immediately, whereas layer-wise expansion cannot. In particular, we have observed that layer-wise expansion tends to produce shallower trees compared to node-wise

     expansion, leading to poorer performance. This phenomenon can be explained as follows: as the depth of the node increases, the prior probability of stopping increases whereas the posterior probability of stopping might be quite low. In

    node-wise expansion, the resampling step can potentially retain the particles where the node has not been stopped. However, in layer-wise expansion, too many nodes might have stopped prematurely and the resampling step cannot ‘correct’ all these bad decisions easily (i.e., it would require many more particles to sample trees where all the nodes in a layer have not been stopped). Another interesting observation is that layer-wise expansion exhibits higher variance: this can be explained by the fact that layer-wise

     expansion samples a greater number of random variables (on average) than

    node-wise before resampling, and so suffers for the same reason that importance sampling can suffer from high variance. Note that both expansion strategies perform similarly for the optimal proposal due to the fact that the proposal accounts for the likelihood and resampling does not affect the results significantly. Due to its superior performance, we consider only node-wise expansion in the rest of the paper.

  3. The plots on the right side of Figure 2 suggest that the optimal proposal requires fewer particles than the prior proposal (as expected). However, the per-stage cost of optimal proposal is much higher than the prior, leading to significant increase in the overall runtime (see Section 3.2 for a related discussion). Hence, the prior proposal offers a better predictive performance vs computation time tradeoff than the optimal proposal.

  4. The performance of optimal proposal saturates very quickly and is near-optimal even when the number of particles is small ().

Figure 2. Results on pen-digits (top), and magic-04 (bottom). Left column plots test vs runtime, while right column plots test vs number of particles. The blue circles and red squares represent optimal and prior proposals respectively. The solid and dashed lines represent node-wise and layer-wise proposals respectively.

4.1.2. Effect of irrelevant features

In the next experiment, we test the effect of irrelevant features on the performance of the various proposals. We use the madelon datasethttp://archive.ics.uci.edu/ml/datasets/Madelon for this experiment, in which the data points belong to one of 2 classes and lie in a 500-dimensional space, out of which only 20 dimensions are deemed relevant. The training dataset contains 2000 data points and the test dataset contains 600 data points. We use the validation dataset in the UCI ML repository as our test set because labels are not available for the test dataset.

The setup is identical to the previous section. The results are shown in Figure 3. Here, the optimal proposal outperforms the prior proposal in both the columns, requiring fewer particles as well as outperforming the prior proposal for a given computational budget. While this dataset is atypical (only of the features are relevant), it illustrates a potential vulnerability of the prior proposal to irrelevant features.

Figure 3. Results on madelon dataset: The top and bottom rows display and accuracy on the test data against runtime (left) and the number of particles (right) respectively. The blue circles and red squares represent optimal and prior proposals respectively.

4.1.3. Effect of the number of islands

Averaging the results of several independent particle filters (aka islands) is a way to reduce variance at the cost of bias, compared with running a single, larger filter. In the asymptotic regime, this would not make sense, but as we will see, performance is improved with multiple islands, suggesting we are not yet in the asymptotic regime. In this experiment, we evaluate the effect of the number of islands on the test performance of the prior proposal. We fix the total number of particles to and vary , the number of islands (and hence, the number of particles per island). Note that all the islands operate on the entire dataset unlike bagging. Here, we present results only on the pen-digits dataset (see Appendix C for results on the magic-04 dataset). The results are shown in Figure 4. We observe that (i) the test performance drops sharply if we use fewer than 100 particles per island and (ii) when , the choices of outperform . Since the islands are independent, the computation across islands is ‘embarrassingly parallelizable’.

Figure 4. Results on pen-digits: Test (left) and accuracy (right) vs and for fixed .

4.2. SMC vs MCMC

In this experiment, we compare the SMC algorithms to the MCMC algorithm proposed by Chipman et al. (1998), which employs four types of Metropolis-Hastings proposals: grow (split a leaf node into child nodes), prune (prune a pair of leaf nodes belonging to the same parent), change (change the decision rule at a node) and swap (swap the decision rule of a parent with the decision rule of the child). In our experiments, we average the MCMC predictions over the trees from all previous iterations.

The experimental setup is identical to Section 4.1, except that we fix the number of islands, . We vary the number of particles for SMCWe fix so that the minimum value of () corresponds to particles per island. Further improvements could be obtained by ‘adapting’ to as discussed in Section 4.1.3. and the number of iterations for MCMC and plot the log predictive probability and accuracy on the test data as a function of runtime. In Figure 5, we observe that SMC (prior, node-wise) is roughly two orders of magnitude faster than MCMC while achieving similar predictive performance on pen-digits and magic-04 datasets. Although the exact speedup factor depends on the dataset in general, we have observed that SMC (prior, node-wise) is at least an order of magnitude faster than MCMC. The SMC runtimes in Figure 5 are recorded by running the islands in a serial fashion. As discussed in Section 4.1.3, one could parallelize the computation leading to an additional speedup by a factor of . In the pen-digits dataset, the performance of prior proposal seems to drop as we increase beyond 2000. However, the marginal likelihood on the training data increases with (see Appendix D). We believe that the deteriorating performance is due to model misspecification (axis-aligned decision trees are hardly the ‘right’ model for handwritten digits) rather than the inference algorithm itself: ‘better’ Bayesian inference in a misspecified model might lead to a poorer solution (see (Minka, 2000) for a related discussion).

To evaluate the sensitivity of the trends above to the hyper parameters , we systematically varied the values of these hyper parameters and repeated the experiment. The results are qualitatively similar. See Appendix E for additional information.

Figure 5. Results on pen-digits (top row), and magic-04 (bottom row). Left column plots test vs runtime, while right column plots test accuracy vs runtime. The blue cirlces, red squares and black diamonds represent optimal, prior proposals and MCMC respectively.

4.3. SMC vs other existing approaches

The goal of these experiments was to verify that our SMC approximation performed as well as the “gold standard” MCMC algorithms most commonly used in the Bayesian decision tree learning setting. Indeed, our results suggest that, for a fraction of the computational budget, we can achieve a comparable level of accuracy. In this final experiment, we re-affirm that the Bayesian algorithms are competitive in accuracy with the classic CART algorithm. (There are many other comparisons that one could pursue and other authors have already performed such comparisons. E.g., Taddy et al. (2011) demonstrated that their tree structured models yield similar performance as Gaussian processes and random forests.) We used the CART implementation provided by scikit-learn (Pedregosa et al., 2011) with two criteria: gini purity and information gain and set (minimum number of data points at a leaf node).Lower values () tend to yield slightly higher test accuracies (comparable to SMC and MCMC) but much lower predictive probabilities. In addition, we performed Laplacian smoothing on the probability estimates from CART using the same as for the Bayesian methods. Our Python implementation of SMC takes about 50-100x longer to achieve the same test accuracy as the highly-optimized implementation of CART. For this reason, we plot CART accuracy as a horizontal bar. The accuracy and log predictive probability on test data are shown in Figure 5. The Bayesian decision tree frameworks achieve similar (or better) test accuracy to CART, and outperform CART significantly in terms of the predictive likelihood. SMC delivers the benefits of having an approximation to the posterior, but in a fraction of the time required by existing MCMC methods.

5. Discussion and Future work

We have proposed a novel class of Bayesian inference algorithms for decision trees, based on the sequential Monte Carlo framework. The algorithms mimic classic top-down algorithms for learning decision trees, but use “local” likelihoods along with resampling steps to guide tree growth. We have shown good computational and statistical performances, especially compared with a state-of-the-art MCMC inference algorithm. Our algorithms are easier to implement than their MCMC counterparts, whose efficient implementations require sophisticated book-keeping.

We have also explored various design choices leading to different SMC algorithms. We have found that expanding too many nodes simultaneously degraded performance, and more sophisticated ways of choosing nodes surprisingly did not improve performance. Finally, while the one-step optimal proposal often required fewer particles to achieve a given accuracy, it was significantly more computationally intensive than the prior proposal, leading to a less efficient algorithm overall on datasets with few irrelevant input dimensions. As the number of irrelevant dimensions increased the balance tipped in favour of the optimal

 proposal. An interesting direction of exploration is to devise some way to interpolate between the

prior and optimal proposals, getting the best of both worlds.

The model underlying this work assumes that the data is explained by a single tree. In contrast, many uses of decision trees, e.g., random forests, bagging, etc., can be interpreted as working within a model class where the data is explained by a collection of trees. Bayesian additive regression trees (BART) (Chipman et al., 2010) are such a model class. Prior work has considered MCMC techniques for posterior inference (Chipman et al., 2010). A significant but important extension of this work would be to tackle additive combinations of trees, potentially in a way that continues to mimic classic algorithms.

Finally, in order to more closely match existing work in Bayesian decision trees, we have used a prior over decision trees that depends on the input data . This has the undesirable side-effect of breaking exchangeability in the model, making it incoherent with respect to changing dataset sizes and to working with online data streams. One solution is to use an alternative prior for decision trees, e.g., based on the Mondrian process (Roy and Teh, 2009), whose projectivity would re-establish exchangeability while allowing for efficient posterior computations that depend on data.

Acknowledgments

We would like to thank Charles Blundell, Arnaud Doucet, David Duvenaud, Jan Gasthaus, Hong Ge, Zoubin Ghahramani, and James Robert Lloyd for helpful discussions and feedback on drafts. DMR is supported by a Newton International Fellowship and Emmanuel College. BL and YWT gratefully acknowledge generous funding from the Gatsby Charitable Foundation.

References

  • Anagnostopoulos and Gramacy (2012) C. Anagnostopoulos and R. Gramacy. Dynamic trees for streaming and massive data contexts. arXiv preprint arXiv:1201.5568, 2012.
  • Asuncion and Newman (2007) A. Asuncion and D. J. Newman. UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html, 2007.
  • Bouchard-Côté et al. (2012) A. Bouchard-Côté, S. Sankararaman, and M. I. Jordan. Phylogenetic inference via sequential monte carlo. Systematic biology, 61(4):579–593, 2012.
  • Breiman (2001) L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
  • Breiman et al. (1984) L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. Chapman & Hall/CRC, 1984.
  • Buntine (1992) W. Buntine. Learning classification trees. Stat. Comput., 2:63–73, 1992.
  • Cappé et al. (2007) O. Cappé, S. J. Godsill, and E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. Proc. IEEE, 95(5):899–924, 2007.
  • Chipman and McCulloch (2000) H. Chipman and R. E. McCulloch. Hierarchical priors for Bayesian CART shrinkage. Stat. Comput., 10(1):17–24, 2000.
  • Chipman et al. (1998) H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search. J. Am. Stat. Assoc., pages 935–948, 1998.
  • Chipman et al. (2010) H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian additive regression trees. Ann. Appl. Stat., 4(1):266–298, 2010.
  • Del Moral et al. (2006) P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(3):411–436, 2006.
  • Denison et al. (1998) D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. A Bayesian CART algorithm. Biometrika, 85(2):363–377, 1998.
  • Douc et al. (2005) R. Douc, O. Cappé, and E. Moulines. Comparison of resampling schemes for particle filtering. In Image Sig. Proc. Anal., pages 64–69, 2005.
  • Friedman (2001) J. H. Friedman.

    Greedy function approximation: a gradient boosting machine.

    Ann. Statist, 29(5):1189–1232, 2001.
  • Gordon et al. (1993) N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar Sig. Proc., IEE Proc. F, 140(2):107–113, 1993.
  • Minka (2000) T. P. Minka. Bayesian model averaging is not model combination. MIT Media Lab note. http://research.microsoft.com/en-us/um/people/minka/papers/bma.html, 2000.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and D. E. Scikit-learn: Machine Learning in Python. J. Machine Learning Res., 12:2825–2830, 2011.
  • Quinlan (1986) J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
  • Quinlan (1993) J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1993.
  • Roy and Teh (2009) D. M. Roy and Y. W. Teh. The Mondrian process. In Adv. Neural Information Proc. Systems, volume 21, pages 1377–1384, 2009.
  • Taddy et al. (2011) M. A. Taddy, R. B. Gramacy, and N. G. Polson. Dynamic trees for learning and design. J. Am. Stat. Assoc., 106(493):109–123, 2011.
  • Teh et al. (2008) Y. W. Teh, H. Daumé III, and D. M. Roy. Bayesian agglomerative clustering with coalescents. In Adv. Neural Information Proc. Systems, volume 20, 2008.
  • Wu et al. (2007) Y. Wu, H. Tjelmeland, and M. West. Bayesian CART: Prior specification and posterior simulation. J. Comput. Graph. Stat., 16(1):44–66, 2007.

Appendix A SMC algorithm

  Inputs: Training data            Number of particles
  Initialize:                                               
  for  do
     for  do
        Sample from where
        Update weights: (Here denote their densities.)
     end for
     Compute normalization:
     Normalize weights:
     if  then
         Resample indices from
         ;
     end if
     if  then
        exit for loop
     end if
  end for
  return  Estimated marginal probability and        weighted samples .
Algorithm 1 SMC for Bayesian decision tree learning

Appendix B Effect of SMC proposal and expansion strategy on test accuracy

The results are shown in Figure 6.

Figure 6. Results on pen-digits (top), and magic-04 (bottom). Left column plots test accuracy vs runtime, while right column plots test accuracy vs number of particles. The blue circles and red squares represent optimal and prior proposals respectively. The solid and dashed lines represent node-wise and layer-wise proposals respectively.

Appendix C Effect of the number of islands: magic-04 dataset

The results are shown in Figure 7.

Figure 7. Results on magic-04: Test (left) and accuracy (right) vs and for fixed .

Appendix D Marginal likelihood

The log marginal likelihood of the training data for different proposals is shown in Figure 8. As the number of particles increases, the log marginal likelihood of prior and optimal proposals converge to the same value (as expected).

Figure 8. Results on pen-digits (left), and magic-04 (right). Mean log marginal likelihood (i.e., mean for training data averaged across 10 runs) vs number of particles. The blue circles and red squares represent optimal and prior proposals respectively.

Appendix E Sensitivity of results to choice of hyperparameters

In this experiment, we evaluate the sensitivity of the runtime vs predictive performance comparison between SMC (prior and optimal proposals), MCMC and CART to the choice of hyper parameters (Dirichlet concentration parameter) and (tree priors). We consider only node-wise expansion since it consistently outperformed layer-wise expansion in our previous experiments. In the first variant, we fix (since we do not expect it to affect the timing results) and vary the hyper parameters from to (bold reflects changes) and also consider intermediate configurations and . In the second variant, we fix and set . Figures 9, 10, 11 and 12 display the results on pen-digits (top row), and magic-04 (bottom row). The left column plots test vs runtime, while the right column plots test accuracy vs runtime. The blue circles and red squares represent optimal and prior proposals respectively. Comparing the results to Figure 5 (in main text), we observe that the trends are qualitatively similar to those observed for in Section 4.2 (in main text): (i) SMC consistently offers a better runtime vs predictive performance tradeoff than MCMC, (ii) the prior proposal offers a better runtime vs predictive performance tradeoff than the optimal proposal, (iii) leads to similar test accuracies as (the predictive probabilities are obviously not comparable).

Figure 9. Hyperparameters:
Figure 10. Hyperparameters: (see main text for additional information).
Figure 11. Hyperparameters: (see main text for additional information).
Figure 12. Hyperparameters: (see main text for additional information).