Accelerated Bayesian Additive Regression Trees

by   Jingyu He, et al.

Although less widely known than random forests or boosted regression trees, Bayesian additive regression trees (BART) chipman2010bart is a powerful predictive model that often outperforms those better-known alternatives at out-of-sample prediction. BART is especially well-suited to settings with unstructured predictor variables and substantial sources of unmeasured variation as is typical in the social, behavioral and health sciences. This paper develops a modified version of BART that is amenable to fast posterior estimation. We present a stochastic hill climbing algorithm that matches the remarkable predictive accuracy of previous BART implementations, but is orders of magnitude faster and uses a fraction of the memory. Simulation studies show that the new method is comparable in computation time and more accurate at function estimation than both random forests and gradient boosting.


page 1

page 2

page 3

page 4


Stochastic tree ensembles for regularized nonlinear regression

This paper develops a novel stochastic tree ensemble method for nonlinea...

MPBART - Multinomial Probit Bayesian Additive Regression Trees

This article proposes Multinomial Probit Bayesian Additive Regression Tr...

Uncertainty Quantification in Ensembles of Honest Regression Trees using Generalized Fiducial Inference

Due to their accuracies, methods based on ensembles of regression trees ...

On Soft Bayesian Additive Regression Trees and asynchronous longitudinal regression analysis

In many longitudinal studies, the covariate and response are often inter...

Universal Consistency of Decision Trees in High Dimensions

This paper shows that decision trees constructed with Classification and...

Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs

This paper develops Bayesian econometric methods for posterior and predi...

Distributed Soft Bayesian Additive Regression Trees

Bayesian Additive Regression Trees(BART) is a Bayesian nonparametric app...

1 Introduction

Tree-based regression methods — CART (Breiman et al., 1984), random forests (Breiman, 2001), and gradient boosting (Breiman, 1997; Friedman, 2001, 2002)

— are highly successful and widely used for supervised learning. Bayesian additive regression trees — or BART — is a closely related but less well-known method that often achieves superior prediction/estimation accuracy. The “Bayesian CART” (single-tree) model was introduced in

Chipman et al. (1998) and the BART model first appeared in Chipman et al. (2010), although software was publicly available as early as 2006. Contrary to common perception, BART is not merely a version of random forests or boosted regression trees in which prior distributions have been placed over model parameters. Instead, the Bayesian perspective leads to a fundamentally new tree growing criteria and algorithm, which yields a number of practical advantages — robustness to the choice of user-selected tuning parameters, more accurate predictions, and a natural Bayesian measure of uncertainty.

Despite these virtues, BART’s wider adoption has been slowed by its more severe computational demands relative to alternatives, owing to its reliance on a random walk Metropolis-Hastings Markov chain Monte Carlo approach. The current fastest implementation, the

R package dbarts, takes orders of magnitude longer than the widely-used R package xgboost, for example. This paper develops a variant of BART that is amenable to fast posterior estimation, making it almost as fast as xgboost

, while still retaining BART’s hyperparameter robustness and remarkable predictive accuracy.

First, we describe the BART model to motivate our computational innovations. We derive the BART model’s tree-growing criteria, which is notably different than the traditional sum-of-squares criteria used by other methods. We then describe the new algorithm accelerated Bayesian additive regression trees heuristic (ABARTH) and illustrate its impact on fast, accurate statistical prediction. Specifically, we compare the new method’s performance to random forests, boosted regression trees, as well as the standard MCMC implementations of BART.

2 Bart in Detail

2.1 The Model: Likelihood and Prior

The BART model is an additive error mean regression model

where the are assumed to be independent mean zero Gaussians and is an unknown function. The BART prior represents the unknown function as a sum of many piecewise constant binary regression trees:


where denotes a regression tree and

denotes a vector of scalar means associated to the leafs nodes of

. Each tree , consists of a set of internal decision nodes which define a partition of the covariate space (say ), as well as a set of terminal nodes or leaves corresponding to each element of the partition. Further, each element of the partition is associated a parameter value, . Taken together the partition and the leaf parameters define a piecewise constant function: ; see Figure 1.







Figure 1: (Left) An example binary tree, with internal nodes labelled by their splitting rules and terminal nodes labelled with the corresponding parameters . (Right) The corresponding partition of the sample space and the step function.

The tree prior

is specified by three components: (i) the probability of a node having children at depth

(ii) the uniform distribution over available predictors for splitting rule assignment at each interior node, and (iii) the uniform distribution on the discrete set of available splitting values for the assigned predictor at each interior node. This last choice has the appeal of invariance under monotone transformations of the predictors.

Chipman et al. (2010) recommend and to enforce small trees. Finally, the leaf mean parameters, are assigned independent mean-zero normal priors: . The parameter

is a crucial regularization parameter; pointwise prior variance of

is .

2.2 The BART Splitting Criteria

By viewing the model as a data generating process, the Bayesian vantage point motivates modifications to the usual splitting criteria. Because the model stipulates that observations in the same leaf node share the same mean parameter, the prior predictive distribution — obtained by integrating out the unknown group specific mean — is simply a mean-zero multivariate normal distribution with covariance matrix

where is the prior variance of the leaf-specific mean parameter, is the variance of the additive error, and is a column vector of all ones. Observe that the prior predictive density of is

which can be simplified by a direct application of the matrix inversion lemma to :

Applying Sylvester’s determinant theorem to yields

Taking logarithms yields a marginal log-likelihood of

where we write so that . Applying this likelihood separately to partitions of the data corresponding to the leaves of a single fixed regression tree. Because observations in different leaf nodes are independent (conditional on ), the full marginal log-likelihood is given by

where runs over all the leaf nodes and . Notice that the first three terms are not functions of the partition (the tree parameter), so they are constant, leaving


as the model-based split criterion, where are functions of the data and the tree .


The basic BART MCMC proceeds as a Metropolis-within-Gibbs algorithm, with the key update of the individual regression trees being conducted as a local random walk Metropolis-Hastings (MH) update, given all of the other trees as well as the residual variance parameter, . Let denote the set of trees and denote the set of leaf parameter vectors. Recall that , and each is length .

The sequence of Gibbs updates are

  1. , for , which is done compositionally (for each ) as

    1. ,

    2. ,

  2. .

Taking advantage of the additive structure of the model, these updates can be written as

  1. , for , which is done compositionally (for each ) as

    1. ,

    2. ,

  2. .

for “residuals” defined as


where indexes the Monte Carlo iteration. Update 1(a) is a Metropolis-Hastings update based on the integrated likelihood given in (2). Update 1(b) is a conditionally conjugate Gaussian mean update done separately for each leaf node parameter , . Update 2 is a conditionally conjugate inverse-Gamma update.

Step 1(a) is handled with a random walk as follows. Given a current tree, , modifications are proposed and either accepted or rejected according to a likelihood ratio based on (2). Chipman et al. (1998) describes proposals comprising a birth/death pair, in which a birth spawns to children from a given bottom node and a death kills a pair of sibling children; see Pratola (2016) for alternative choices. For example, in a birth move, a variable to split on, as well as a cut-point to split at, are selected uniformly at random from the available splitting rules. Via these simple MH updates, BART stochastically searches through regression models of varying complexity (in terms of tree-depth). For “smaller” problems, with dozens of predictors and thousands of observations, this MCMC approach has proven to be remarkably effective; for larger problems, with hundreds of thousands of observations, it does not work well on standard desktops.

In the next section, we present our new stochastic hill climbing algorithm called accelerated Bayesian additive regression trees heuristic (ABARTH), see algorithm 2. It follows the Gibbs update framework but replace the Metropolis-Hastings updates of each single tree by a new grow-from-root backfitting strategy, see Algorithm 1.

3 Accelerated Bart Model Fitting

3.1 Grow-from-root Backfitting

Rather than making small moves to a given tree at iteration , here we ignore the current tree and grow an entirely new tree from scratch. We grow each tree recursively and stochastically and the tree growing process is also terminated stochastically, based on the “residual” data defined above. The pseudo-code is presented in Algorithm 1.

Specifically, at each level of the recursion we consider every available cut-point (decision rule threshold) for each variable111For simplicity, in this paper we consider only continuous predictor variables. and evaluate the integrated likelihood criterion, the exponential of expression (2). We also consider the no-split option, which corresponds to a cut-point outside of the range of the available data. How many such null cut-points to consider is a modeling decision; we default to one such null cut-point per variable. Accordingly, with available active cut-points and total variables we perform likelihood evaluations. Each of the active cut-points is weighted by and the unweighted cut-points weighted by , as per the prior222Equivalently, the active cut-points are equally weighted and the no split option is weighted . An additional multiplier could be used here to encourage/discourage tree growth.. Selection of a variable to split on and a cut-point to split are then chosen by Bayes rule after normalizing:



for . Here is the number of observations in the current leaf node that have and is the sum of the residual of those same observations; and are defined analogously. Also, .

For , corresponding to null cut-points or the stop-splitting option, we have instead

and , where denotes the number of observations in the current leaf node, and denotes the sum over all the current leaf data.

Using this new tree-growing strategy, we find that different default parameters are advisable. We recommend , , and . This choice of is a function that is faster growing than , but slower than , while the lower value of permits deeper trees (than BART’s default ). Allowing to grow as a function of the data permits smoother functions to be estimated more accurately as the sample size grows, whereas a sample size-independent choice would be limited in its smoothness by the number of trees. The suggested choice of dictates that a priori

the function will account for 30% of the observed variance of the response variable. Finally, while BART must be run for many thousands of iterations with a substantial burn-in period, our default suggestion is just 40 sweeps through the data, discarding the first 15 as burn-in.

3.2 Pre-sorting Features for Efficiency

Observe that the BART criterion depends on the partition sums only. An important implication of this, for computation, is that with sorted predictor variables the various cut-point integrated likelihoods can be computed rapidly via a single sweep through the data (per variable), taking cumulative sums. Let denote the -by- array such that denotes the index, in the data, of the observation with the th smallest value of the th predictor variable . Then, taking the cumulative sums gives




The subscript on the residual indicates that these evaluations pertain to the update of the th tree.

The above formulation is useful if the data can be presorted and, furthermore, the sorting can be maintained at all levels of the recursive tree-growing process. To achieve this, we must “sift” each of the variables before passing to the next level of the recursion. Specifically, we form two new index matrices and that partition the data according to the selected split rule. For the selected split variable and selected split , this is automatic: and . For the other variables, we sift them by looping through all available observations, populating and , for , sequentially, with values according to whether or , for .

Because the data is processed in sorted order, the ordering will be preserved in each of the new matrices and . This strategy was first presented in Mehta et al. (1996) in the context of tree classification algorithms.

3.3 Recursively Defined Cut-points

Evaluating the integrated likelihood criterion is straightforward, but the summation and normalization required to sample the cut-points contribute a substantial computational burden in its own right. Therefore, it is helpful to consider a restricted number of cut-points . This can simply be achieved by taking every th value (starting from the smallest) as an eligible split point with . As the tree grows deeper, the amount of data that is skipped over diminishes. Eventually we get , and each data point defines a unique cut-point. In this way the data could, without regularization, be fit perfectly, even though the number of cut-points at any given level is given an upper limit. As a default, we set the number of cut-points to , where is the sample size of the entire data set.

Our cut-point subsampling strategy is more naive than the cut-point subselection search heuristics used by XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017), which both consider the gradient evaluated at each cut-point when determining the next split. Our approach does not consider the response information at all, but rather defines a predictor-dependent prior on the response surface. That is, given a design matrix

, a sample functions can be drawn from the prior distribution by sampling trees, splitting uniformly at random among the cut-points defined by the node-specific quantiles, in a sequential fashion. In further contrast, the proposed method stochastically samples cut-points proportional to its objective function, rather than deterministically maximizing the likelihood-prior. Then, multiple sweeps are made through the data. Rather than greedy (approximate) optimization, like XGBoost and LightGBM, the proposed algorithm performs a stochastic hill climb by coordinate ascent over multiple sweeps through the parameters.

3.4 Sparse Proposal Distribution

As a final modification, we strike an intermediate balance between the local BART updates, which randomly consider one variable at a time, and the all-variables Bayes rule described above. We do this by considering variables at a time when sampling each splitting rule. Rather than drawing these variables uniformly at random, as done in random forests, we introduce a parameter vector

which denotes the prior probability that a given variable is chosen to be split on, as suggested in

Linero (2016). Before sampling each splitting rule, we randomly select variables with probability proportional to . These variables are sampled sequentially and without replacement, with selection probability proportional to .

The variable weight parameter is given a Dirichlet prior with hyperparameter set to all ones and subsequently incremented to count the total number of splits across all trees. The split counts are then updated in between each tree sampling/growth step:


where denotes the length- vector recording the number of splits on each variable in tree at iteration . The weight parameter is then resampled as Splits that improve the likelihood function will be chosen more often than those that don’t. The parameter is then updated to reflect that, making chosen variables more likely to be considered in subsequent sweeps. In practice, we find it is helpful to use all variables during an initialization phase, to more rapidly obtain an accurate initial estimate of .

3.5 The Final Estimator

Given iterations of the algorithm, the final samples are used to compute a point-wise average function evaluation, where is denotes the length of the burn-in period. As mentioned above, we recommend and for routine use. The final estimator is therefore expressible as


where denotes a sample of the forest, as in expression 1, drawn by algorithm 2. We note that this corresponds to the Bayes optimal estimator under mean squared error estimation loss, provided that we have samples from a legitimate posterior distribution. As the grow-from-root strategy is not a proper full conditional (without modification, which we cannot discuss here due to space constraints), this estimator must be considered a greedy stochastic approximation. Nonetheless, the results of the next section strongly suggest that the approximation is adequate.

procedure grow_from_root(, , , , ) Fit a tree using data and by recursion.
output A tree and a vector of split counts .
      number of rows of
     Sample variables use weight as shown in section 3.4.
     Select cutpoints as shown in section 3.3.
     Evaluate candidate cutpoints and no-split option with equation (3).
     Sample one cutpoint propotional to equation (3).
     if sample no-split option then return
         , add count of selected split variable.
         Split data to left and right node.
         GROW_FROM_ROOT(,, , , )
         GROW_FROM_ROOT(,, , , )      
Algorithm 1 Grow-from-root backfitting
procedure ABARTH() ( are prior parameter of )
output Samples of forest
      number of columns of
      number of rows of
     Initialize .
     for  in 1 to  do
         for  in 1 to  do
              Calculate residual as shown in section 2.3.
              if  then
                  GROW_FROM_ROOT(,, , , ) use all variables in burnin iterations
                  GROW_FROM_ROOT(,, , , )               
               update with split counts of current tree
Algorithm 2 Accelerated Bayesian Additive Regression Trees (ABARTH)

4 Simulation Studies

4.1 Data Generating Process

To demonstrate the performance of the new accelerated BART heuristic, which we call ABARTH, we estimate function evaluations with a hold-out set that is a quarter of the training sample size and judge accuracy according to root mean squared error (RMSE). We consider four different challenging functions, , as defined in Table 1. In all cases, for . The data is generated according to

for . We consider for .

Name Function
Linear ; 
Single index ; ;  .
Trig + poly
Table 1: Four true functions

4.2 Methods

We compare to leading tree-based machine learning algorithms: random forests, gradient boosting machines, and BART MCMC. All implementations had an

R interface and were the current fastest implementations to our knowledge: ranger (Wright and Ziegler, 2015), xgboost (Chen and Guestrin, 2016), and dbarts, respectively. For ranger and dbarts we use the software defaults. For xgboost we consider two specifications, one using the software defaults and another determined by by 5-fold cross-validated grid optimization (see Table 2); a reduced grid of parameter values was used at sample sizes .

Parameter name K K
subsample 0.8 0.8
gamma 0.1 0.1
Table 2: Hyperparameter Grid for XGBoost

4.3 Computation

The simulations were computed on Ubuntu 18.04.1 LTS with Intel i7-8700K Hexa-core 3.7GHz 12MB Cache-64-bit processing ,4.3GHz overclocking speed.

The software used was R version 3.4.4 with xgboost 0.71.2, dbarts version 0.9.1 , and ranger 0.10.1. The default hyperparameters for XGBoost are eta , colsample_bytree ,min_child_weight , and max_depth . Ranger was fit with num.trees and mtry . BART, with the package dbarts, was fit with the defaults of ntrees , alpha , beta , with a burn-in of 5,000 samples (nskip ) and 2,000 retrained posterior samples (ndpost ). The default dbarts algorithm uses an evenly spaced grid of 100 cut-point candidates along the observed range of each variable (numcuts , usequants = FALSE).

4.4 Results

The performance of the new ABARTH algorithm was even better than anticipated, showing superior speed and performance relative to all the considered alternatives on essentially every data generating processes. The full results, averaged across five Monte Carlo replications, are reported in Tables 3 where AB, cv and xgb are results of ABARTH and XGBoost with and without turning parameter by cross validation, respectively. Across all data generating processes and sample sizes, ABARTH was 31% more accurate than the cross-validated XGBoost method and nearly 10 times faster. The ABARTH method was 8 times slower than the untuned default XGBoost method, but was 3.5 times more accurate. This pattern points to one of the main benefits of the proposed method, which is that it has excellent performance using the same hyperparameter settings across all data generating processes. Importantly, these default hyperparameter settings were decided on the basis of prior elicitation experiments using different true functions than were used in the reported simulations. While XGBoost is quite fast, the tuning processes is left to the user and can increase the total computational burden by orders of magnitude.

Random forests and traditional MCMC BART were prohibitively slow at larger sample sizes. However, at several notable patterns did emerge. First was that BART and ABARTH typically gave very similar results, as would be expected. BART performed slightly better in the low noise setting and quite a bit worse in the high noise setting (likely due to inadequate burn-in period). Similarly, random forests do well in higher noise settings, while XGBoost performs better in lower noise settings.

RMSE Seconds
n AB cv xgb AB cv xgb
10K 1 0.69 1.07 1.29 3 26
10K 10 2.14 3.26 8.36 1 24
50K 1 0.38 0.80 1.03 33 57 1
50K 10 1.52 2.21 6.49 9 56 1
250K 1 0.23 0.61 0.80 360 510 15
250K 10 0.94 1.28 4.62 87 495 15
Single index
n AB cv xgb AB cv xgb
10K 1 0.89 1.05 1.44 2 24
10K 10 2.97 4.06 11.32 1 24
50K 1 0.63 0.64 1.13 22 56 1
50K 10 1.75 2.80 8.56 9 56 1
250K 1 0.46 0.48 0.86 245 516 15
250K 10 1.23 1.63 5.99 83 532 26
Trig + poly
n AB cv xgb AB cv xgb
10K 1 0.53 0.88 1.01 2 25
10K 10 2.05 2.88 7.03 1 24
50K 1 0.33 0.52 0.70 17 56 1
50K 10 1.31 1.90 5.28 9 56 1
250K 1 0.20 0.33 0.48 146 479 15
250K 10 0.80 1.34 3.79 85 515 15
n AB cv xgb AB cv xgb
10K 1 0.16 0.17 2.90 2 25
10K 10 0.74 1.11 2.90 1 24
50K 1 0.11 0.12 0.24 13 56 1
50K 10 0.53 0.75 2.23 9 56 1
250K 1 0.06 0.08 0.16 110 535 21
250K 10 0.28 0.42 1.54 86 531 21
Table 3: Simulation comparison with XGBoost.
Function Method RMSE Seconds
Linear ABARTH 2.28 1
Linear XGBoost Tuned 3.21 24
Linear XGBoost Untuned 8.61
Linear Random Forest 2.50 2
Linear BART 2.74 41
Trig + Poly ABARTH 2.05 1
Trig + Poly XGBoost Tuned 2.75 24
Trig + Poly XGBoost Untuned 7.03
Trig + Poly Random Forest 2.44 2
Trig + Poly BART 2.47 41
Single Index ABARTH 2.96 1
Single Index XGBoost Tuned 4.41 24
Single Index XGBoost Untuned 11.45
Single Index Random Forest 3.68 2
Single Index BART 3.99 42
Max ABARTH 0.76 1
Max XGBoost Tuned 1.08 24
Max XGBoost Untuned 2.88
Max Random Forest 0.95 2
Max BART 0.91 41
Table 4: High noise (, K)
Function Method RMSE Seconds
Linear ABARTH 0.65 3
Linear XGBoost Tuned 1.08 25
Linear XGBoost Untuned 1.25
Linear Random Forest 1.40 2
Linear BART 0.58 44
Trig + Poly ABARTH 0.55 2
Trig + Poly XGBoost Tuned 0.87 24
Trig + Poly XGBoost Untuned 1.02
Trig + Poly Random Forest 1.18 2
Trig + Poly BART 0.52 44
Single Index ABARTH 0.93 2
Single Index XGBoost Tuned 1.07 24
Single Index XGBoost Untuned 1.47
Single Index Random Forest 1.41 2
Single Index BART 0.86 43
Max ABARTH 0.17 2
Max XGBoost Tuned 0.17 24
Max XGBoost Untuned 0.31
Max Random Forest 0.16 2
Max BART 0.18 42
Table 5: Low noise (, K)

5 Discussion

The grow-from-root strategy proposed here opens the door for computational innovations to be married to the novel BART stochastic fitting algorithm. Further, the proposed adaptive cut-points and variable selection proposal together define a novel predictor-dependent prior, marking a distinct Bayesian model. The simulation studies clearly demonstrate the beneficial synergy realized by the proposed approach: ABARTH is a state-of-the-art nonlinear regression method with computational demands that are competitive with the current fastest alternatives. In particular, the excellent performance without the need to cross-validate recommends ABARTH as a suitable default method for function estimation and prediction tasks when little is known about the response surface.

The source of ABARTH’s superior performance is not entirely clear, but preliminary investigations point to two important factors. One, the BART splitting criterion involves (the current estimate of) the error standard deviation,

, meaning that it is adaptively regularizing within the model fitting process. Two, we conjecture that the stochastic nature of the algorithm leads to better exploration of the parameter space than iterative optimizers. With fast model fitting software now in hand, this issue can be investigated more systematically in future work.


  • Breiman (1997) Breiman, L. (1997). Arcing the edge. Technical report, Technical Report 486, Statistics Department, University of California at Berkeley.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
  • Breiman et al. (1984) Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). Classification and regression trees. Chapman and Hall/CRC.
  • Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM.
  • Chipman et al. (1998) Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443):935–948.
  • Chipman et al. (2010) Chipman, H. A., George, E. I., McCulloch, R. E., et al. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
  • Friedman (2001) Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232.
  • Friedman (2002) Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378.
  • Ke et al. (2017) Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017).

    LightGBM: A highly efficient gradient boosting decision tree.

    In Advances in Neural Information Processing Systems, pages 3146–3154.
  • Linero (2016) Linero, A. R. (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association, (just-accepted).
  • Mehta et al. (1996) Mehta, M., Agrawal, R., and Rissanen, J. (1996).

    SLIQ: A fast scalable classifier for data mining.

    In International Conference on Extending Database Technology, pages 18–32. Springer.
  • Pratola (2016) Pratola, M. (2016). Efficent Metropolis-Hastings proposal mechanism for Bayesian regression tree models. Bayesian Analysis, 11(3):885–911.
  • Wright and Ziegler (2015) Wright, M. N. and Ziegler, A. (2015). ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409.