1 Introduction
Tree-based regression methods — CART (Breiman et al., 1984), random forests (Breiman, 2001), and gradient boosting (Breiman, 1997; Friedman, 2001, 2002)
— are highly successful and widely used for supervised learning. Bayesian additive regression trees — or BART — is a closely related but less well-known method that often achieves superior prediction/estimation accuracy. The “Bayesian CART” (single-tree) model was introduced in
Chipman et al. (1998) and the BART model first appeared in Chipman et al. (2010), although software was publicly available as early as 2006. Contrary to common perception, BART is not merely a version of random forests or boosted regression trees in which prior distributions have been placed over model parameters. Instead, the Bayesian perspective leads to a fundamentally new tree growing criteria and algorithm, which yields a number of practical advantages — robustness to the choice of user-selected tuning parameters, more accurate predictions, and a natural Bayesian measure of uncertainty.Despite these virtues, BART’s wider adoption has been slowed by its more severe computational demands relative to alternatives, owing to its reliance on a random walk Metropolis-Hastings Markov chain Monte Carlo approach. The current fastest implementation, the
R package dbarts, takes orders of magnitude longer than the widely-used R package xgboost, for example. This paper develops a variant of BART that is amenable to fast posterior estimation, making it almost as fast as xgboost, while still retaining BART’s hyperparameter robustness and remarkable predictive accuracy.
First, we describe the BART model to motivate our computational innovations. We derive the BART model’s tree-growing criteria, which is notably different than the traditional sum-of-squares criteria used by other methods. We then describe the new algorithm accelerated Bayesian additive regression trees heuristic (ABARTH) and illustrate its impact on fast, accurate statistical prediction. Specifically, we compare the new method’s performance to random forests, boosted regression trees, as well as the standard MCMC implementations of BART.
2 Bart in Detail
2.1 The Model: Likelihood and Prior
The BART model is an additive error mean regression model
where the are assumed to be independent mean zero Gaussians and is an unknown function. The BART prior represents the unknown function as a sum of many piecewise constant binary regression trees:
(1) |
where denotes a regression tree and
denotes a vector of scalar means associated to the leafs nodes of
. Each tree , consists of a set of internal decision nodes which define a partition of the covariate space (say ), as well as a set of terminal nodes or leaves corresponding to each element of the partition. Further, each element of the partition is associated a parameter value, . Taken together the partition and the leaf parameters define a piecewise constant function: ; see Figure 1.The tree prior
is specified by three components: (i) the probability of a node having children at depth
(ii) the uniform distribution over available predictors for splitting rule assignment at each interior node, and (iii) the uniform distribution on the discrete set of available splitting values for the assigned predictor at each interior node. This last choice has the appeal of invariance under monotone transformations of the predictors.
Chipman et al. (2010) recommend and to enforce small trees. Finally, the leaf mean parameters, are assigned independent mean-zero normal priors: . The parameteris a crucial regularization parameter; pointwise prior variance of
is .2.2 The BART Splitting Criteria
By viewing the model as a data generating process, the Bayesian vantage point motivates modifications to the usual splitting criteria. Because the model stipulates that observations in the same leaf node share the same mean parameter, the prior predictive distribution — obtained by integrating out the unknown group specific mean — is simply a mean-zero multivariate normal distribution with covariance matrix
where is the prior variance of the leaf-specific mean parameter, is the variance of the additive error, and is a column vector of all ones. Observe that the prior predictive density of is
which can be simplified by a direct application of the matrix inversion lemma to :
Applying Sylvester’s determinant theorem to yields
Taking logarithms yields a marginal log-likelihood of
where we write so that . Applying this likelihood separately to partitions of the data corresponding to the leaves of a single fixed regression tree. Because observations in different leaf nodes are independent (conditional on ), the full marginal log-likelihood is given by
where runs over all the leaf nodes and . Notice that the first three terms are not functions of the partition (the tree parameter), so they are constant, leaving
(2) |
as the model-based split criterion, where are functions of the data and the tree .
2.3 The BART MCMC
The basic BART MCMC proceeds as a Metropolis-within-Gibbs algorithm, with the key update of the individual regression trees being conducted as a local random walk Metropolis-Hastings (MH) update, given all of the other trees as well as the residual variance parameter, . Let denote the set of trees and denote the set of leaf parameter vectors. Recall that , and each is length .
The sequence of Gibbs updates are
-
, for , which is done compositionally (for each ) as
-
,
-
,
-
-
.
Taking advantage of the additive structure of the model, these updates can be written as
-
, for , which is done compositionally (for each ) as
-
,
-
,
-
-
.
for “residuals” defined as
and
where indexes the Monte Carlo iteration. Update 1(a) is a Metropolis-Hastings update based on the integrated likelihood given in (2). Update 1(b) is a conditionally conjugate Gaussian mean update done separately for each leaf node parameter , . Update 2 is a conditionally conjugate inverse-Gamma update.
Step 1(a) is handled with a random walk as follows. Given a current tree, , modifications are proposed and either accepted or rejected according to a likelihood ratio based on (2). Chipman et al. (1998) describes proposals comprising a birth/death pair, in which a birth spawns to children from a given bottom node and a death kills a pair of sibling children; see Pratola (2016) for alternative choices. For example, in a birth move, a variable to split on, as well as a cut-point to split at, are selected uniformly at random from the available splitting rules. Via these simple MH updates, BART stochastically searches through regression models of varying complexity (in terms of tree-depth). For “smaller” problems, with dozens of predictors and thousands of observations, this MCMC approach has proven to be remarkably effective; for larger problems, with hundreds of thousands of observations, it does not work well on standard desktops.
In the next section, we present our new stochastic hill climbing algorithm called accelerated Bayesian additive regression trees heuristic (ABARTH), see algorithm 2. It follows the Gibbs update framework but replace the Metropolis-Hastings updates of each single tree by a new grow-from-root backfitting strategy, see Algorithm 1.
3 Accelerated Bart Model Fitting
3.1 Grow-from-root Backfitting
Rather than making small moves to a given tree at iteration , here we ignore the current tree and grow an entirely new tree from scratch. We grow each tree recursively and stochastically and the tree growing process is also terminated stochastically, based on the “residual” data defined above. The pseudo-code is presented in Algorithm 1.
Specifically, at each level of the recursion we consider every available cut-point (decision rule threshold) for each variable111For simplicity, in this paper we consider only continuous predictor variables. and evaluate the integrated likelihood criterion, the exponential of expression (2). We also consider the no-split option, which corresponds to a cut-point outside of the range of the available data. How many such null cut-points to consider is a modeling decision; we default to one such null cut-point per variable. Accordingly, with available active cut-points and total variables we perform likelihood evaluations. Each of the active cut-points is weighted by and the unweighted cut-points weighted by , as per the prior222Equivalently, the active cut-points are equally weighted and the no split option is weighted . An additional multiplier could be used here to encourage/discourage tree growth.. Selection of a variable to split on and a cut-point to split are then chosen by Bayes rule after normalizing:
(3) |
where
for . Here is the number of observations in the current leaf node that have and is the sum of the residual of those same observations; and are defined analogously. Also, .
For , corresponding to null cut-points or the stop-splitting option, we have instead
and , where denotes the number of observations in the current leaf node, and denotes the sum over all the current leaf data.
Using this new tree-growing strategy, we find that different default parameters are advisable. We recommend , , and . This choice of is a function that is faster growing than , but slower than , while the lower value of permits deeper trees (than BART’s default ). Allowing to grow as a function of the data permits smoother functions to be estimated more accurately as the sample size grows, whereas a sample size-independent choice would be limited in its smoothness by the number of trees. The suggested choice of dictates that a priori
the function will account for 30% of the observed variance of the response variable. Finally, while BART must be run for many thousands of iterations with a substantial burn-in period, our default suggestion is just 40 sweeps through the data, discarding the first 15 as burn-in.
3.2 Pre-sorting Features for Efficiency
Observe that the BART criterion depends on the partition sums only. An important implication of this, for computation, is that with sorted predictor variables the various cut-point integrated likelihoods can be computed rapidly via a single sweep through the data (per variable), taking cumulative sums. Let denote the -by- array such that denotes the index, in the data, of the observation with the th smallest value of the th predictor variable . Then, taking the cumulative sums gives
(4) |
and
(5) |
The subscript on the residual indicates that these evaluations pertain to the update of the th tree.
The above formulation is useful if the data can be presorted and, furthermore, the sorting can be maintained at all levels of the recursive tree-growing process. To achieve this, we must “sift” each of the variables before passing to the next level of the recursion. Specifically, we form two new index matrices and that partition the data according to the selected split rule. For the selected split variable and selected split , this is automatic: and . For the other variables, we sift them by looping through all available observations, populating and , for , sequentially, with values according to whether or , for .
Because the data is processed in sorted order, the ordering will be preserved in each of the new matrices and . This strategy was first presented in Mehta et al. (1996) in the context of tree classification algorithms.
3.3 Recursively Defined Cut-points
Evaluating the integrated likelihood criterion is straightforward, but the summation and normalization required to sample the cut-points contribute a substantial computational burden in its own right. Therefore, it is helpful to consider a restricted number of cut-points . This can simply be achieved by taking every th value (starting from the smallest) as an eligible split point with . As the tree grows deeper, the amount of data that is skipped over diminishes. Eventually we get , and each data point defines a unique cut-point. In this way the data could, without regularization, be fit perfectly, even though the number of cut-points at any given level is given an upper limit. As a default, we set the number of cut-points to , where is the sample size of the entire data set.
Our cut-point subsampling strategy is more naive than the cut-point subselection search heuristics used by XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017), which both consider the gradient evaluated at each cut-point when determining the next split. Our approach does not consider the response information at all, but rather defines a predictor-dependent prior on the response surface. That is, given a design matrix
, a sample functions can be drawn from the prior distribution by sampling trees, splitting uniformly at random among the cut-points defined by the node-specific quantiles, in a sequential fashion. In further contrast, the proposed method stochastically samples cut-points proportional to its objective function, rather than deterministically maximizing the likelihood-prior. Then, multiple sweeps are made through the data. Rather than greedy (approximate) optimization, like XGBoost and LightGBM, the proposed algorithm performs a stochastic hill climb by coordinate ascent over multiple sweeps through the parameters.
3.4 Sparse Proposal Distribution
As a final modification, we strike an intermediate balance between the local BART updates, which randomly consider one variable at a time, and the all-variables Bayes rule described above. We do this by considering variables at a time when sampling each splitting rule. Rather than drawing these variables uniformly at random, as done in random forests, we introduce a parameter vector
which denotes the prior probability that a given variable is chosen to be split on, as suggested in
Linero (2016). Before sampling each splitting rule, we randomly select variables with probability proportional to . These variables are sampled sequentially and without replacement, with selection probability proportional to .The variable weight parameter is given a Dirichlet prior with hyperparameter set to all ones and subsequently incremented to count the total number of splits across all trees. The split counts are then updated in between each tree sampling/growth step:
(6) |
where denotes the length- vector recording the number of splits on each variable in tree at iteration . The weight parameter is then resampled as Splits that improve the likelihood function will be chosen more often than those that don’t. The parameter is then updated to reflect that, making chosen variables more likely to be considered in subsequent sweeps. In practice, we find it is helpful to use all variables during an initialization phase, to more rapidly obtain an accurate initial estimate of .
3.5 The Final Estimator
Given iterations of the algorithm, the final samples are used to compute a point-wise average function evaluation, where is denotes the length of the burn-in period. As mentioned above, we recommend and for routine use. The final estimator is therefore expressible as
(7) |
where denotes a sample of the forest, as in expression 1, drawn by algorithm 2. We note that this corresponds to the Bayes optimal estimator under mean squared error estimation loss, provided that we have samples from a legitimate posterior distribution. As the grow-from-root strategy is not a proper full conditional (without modification, which we cannot discuss here due to space constraints), this estimator must be considered a greedy stochastic approximation. Nonetheless, the results of the next section strongly suggest that the approximation is adequate.
4 Simulation Studies
4.1 Data Generating Process
To demonstrate the performance of the new accelerated BART heuristic, which we call ABARTH, we estimate function evaluations with a hold-out set that is a quarter of the training sample size and judge accuracy according to root mean squared error (RMSE). We consider four different challenging functions, , as defined in Table 1. In all cases, for . The data is generated according to
for . We consider for .
Name | Function |
---|---|
Linear | ; |
Single index | ; ; . |
Trig + poly | |
Max |
4.2 Methods
We compare to leading tree-based machine learning algorithms: random forests, gradient boosting machines, and BART MCMC. All implementations had an
R interface and were the current fastest implementations to our knowledge: ranger (Wright and Ziegler, 2015), xgboost (Chen and Guestrin, 2016), and dbarts, respectively. For ranger and dbarts we use the software defaults. For xgboost we consider two specifications, one using the software defaults and another determined by by 5-fold cross-validated grid optimization (see Table 2); a reduced grid of parameter values was used at sample sizes .Parameter name | K | K |
---|---|---|
eta | ||
max_depth | ||
colsample_bytree | ||
min_child_weight | ||
subsample | 0.8 | 0.8 |
gamma | 0.1 | 0.1 |
4.3 Computation
The simulations were computed on Ubuntu 18.04.1 LTS with Intel i7-8700K Hexa-core 3.7GHz 12MB Cache-64-bit processing ,4.3GHz overclocking speed.
The software used was R version 3.4.4 with xgboost 0.71.2, dbarts version 0.9.1 , and ranger 0.10.1. The default hyperparameters for XGBoost are eta , colsample_bytree ,min_child_weight , and max_depth . Ranger was fit with num.trees and mtry . BART, with the package dbarts, was fit with the defaults of ntrees , alpha , beta , with a burn-in of 5,000 samples (nskip ) and 2,000 retrained posterior samples (ndpost ). The default dbarts algorithm uses an evenly spaced grid of 100 cut-point candidates along the observed range of each variable (numcuts , usequants = FALSE).
4.4 Results
The performance of the new ABARTH algorithm was even better than anticipated, showing superior speed and performance relative to all the considered alternatives on essentially every data generating processes. The full results, averaged across five Monte Carlo replications, are reported in Tables 3 where AB, cv and xgb are results of ABARTH and XGBoost with and without turning parameter by cross validation, respectively. Across all data generating processes and sample sizes, ABARTH was 31% more accurate than the cross-validated XGBoost method and nearly 10 times faster. The ABARTH method was 8 times slower than the untuned default XGBoost method, but was 3.5 times more accurate. This pattern points to one of the main benefits of the proposed method, which is that it has excellent performance using the same hyperparameter settings across all data generating processes. Importantly, these default hyperparameter settings were decided on the basis of prior elicitation experiments using different true functions than were used in the reported simulations. While XGBoost is quite fast, the tuning processes is left to the user and can increase the total computational burden by orders of magnitude.
Random forests and traditional MCMC BART were prohibitively slow at larger sample sizes. However, at several notable patterns did emerge. First was that BART and ABARTH typically gave very similar results, as would be expected. BART performed slightly better in the low noise setting and quite a bit worse in the high noise setting (likely due to inadequate burn-in period). Similarly, random forests do well in higher noise settings, while XGBoost performs better in lower noise settings.
RMSE | Seconds | ||||||
Linear | |||||||
n | AB | cv | xgb | AB | cv | xgb | |
10K | 1 | 0.69 | 1.07 | 1.29 | 3 | 26 | |
10K | 10 | 2.14 | 3.26 | 8.36 | 1 | 24 | |
50K | 1 | 0.38 | 0.80 | 1.03 | 33 | 57 | 1 |
50K | 10 | 1.52 | 2.21 | 6.49 | 9 | 56 | 1 |
250K | 1 | 0.23 | 0.61 | 0.80 | 360 | 510 | 15 |
250K | 10 | 0.94 | 1.28 | 4.62 | 87 | 495 | 15 |
Single index | |||||||
n | AB | cv | xgb | AB | cv | xgb | |
10K | 1 | 0.89 | 1.05 | 1.44 | 2 | 24 | |
10K | 10 | 2.97 | 4.06 | 11.32 | 1 | 24 | |
50K | 1 | 0.63 | 0.64 | 1.13 | 22 | 56 | 1 |
50K | 10 | 1.75 | 2.80 | 8.56 | 9 | 56 | 1 |
250K | 1 | 0.46 | 0.48 | 0.86 | 245 | 516 | 15 |
250K | 10 | 1.23 | 1.63 | 5.99 | 83 | 532 | 26 |
Trig + poly | |||||||
n | AB | cv | xgb | AB | cv | xgb | |
10K | 1 | 0.53 | 0.88 | 1.01 | 2 | 25 | |
10K | 10 | 2.05 | 2.88 | 7.03 | 1 | 24 | |
50K | 1 | 0.33 | 0.52 | 0.70 | 17 | 56 | 1 |
50K | 10 | 1.31 | 1.90 | 5.28 | 9 | 56 | 1 |
250K | 1 | 0.20 | 0.33 | 0.48 | 146 | 479 | 15 |
250K | 10 | 0.80 | 1.34 | 3.79 | 85 | 515 | 15 |
Max | |||||||
n | AB | cv | xgb | AB | cv | xgb | |
10K | 1 | 0.16 | 0.17 | 2.90 | 2 | 25 | |
10K | 10 | 0.74 | 1.11 | 2.90 | 1 | 24 | |
50K | 1 | 0.11 | 0.12 | 0.24 | 13 | 56 | 1 |
50K | 10 | 0.53 | 0.75 | 2.23 | 9 | 56 | 1 |
250K | 1 | 0.06 | 0.08 | 0.16 | 110 | 535 | 21 |
250K | 10 | 0.28 | 0.42 | 1.54 | 86 | 531 | 21 |
Function | Method | RMSE | Seconds |
---|---|---|---|
Linear | ABARTH | 2.28 | 1 |
Linear | XGBoost Tuned | 3.21 | 24 |
Linear | XGBoost Untuned | 8.61 | |
Linear | Random Forest | 2.50 | 2 |
Linear | BART | 2.74 | 41 |
Trig + Poly | ABARTH | 2.05 | 1 |
Trig + Poly | XGBoost Tuned | 2.75 | 24 |
Trig + Poly | XGBoost Untuned | 7.03 | |
Trig + Poly | Random Forest | 2.44 | 2 |
Trig + Poly | BART | 2.47 | 41 |
Single Index | ABARTH | 2.96 | 1 |
Single Index | XGBoost Tuned | 4.41 | 24 |
Single Index | XGBoost Untuned | 11.45 | |
Single Index | Random Forest | 3.68 | 2 |
Single Index | BART | 3.99 | 42 |
Max | ABARTH | 0.76 | 1 |
Max | XGBoost Tuned | 1.08 | 24 |
Max | XGBoost Untuned | 2.88 | |
Max | Random Forest | 0.95 | 2 |
Max | BART | 0.91 | 41 |
Function | Method | RMSE | Seconds |
---|---|---|---|
Linear | ABARTH | 0.65 | 3 |
Linear | XGBoost Tuned | 1.08 | 25 |
Linear | XGBoost Untuned | 1.25 | |
Linear | Random Forest | 1.40 | 2 |
Linear | BART | 0.58 | 44 |
Trig + Poly | ABARTH | 0.55 | 2 |
Trig + Poly | XGBoost Tuned | 0.87 | 24 |
Trig + Poly | XGBoost Untuned | 1.02 | |
Trig + Poly | Random Forest | 1.18 | 2 |
Trig + Poly | BART | 0.52 | 44 |
Single Index | ABARTH | 0.93 | 2 |
Single Index | XGBoost Tuned | 1.07 | 24 |
Single Index | XGBoost Untuned | 1.47 | |
Single Index | Random Forest | 1.41 | 2 |
Single Index | BART | 0.86 | 43 |
Max | ABARTH | 0.17 | 2 |
Max | XGBoost Tuned | 0.17 | 24 |
Max | XGBoost Untuned | 0.31 | |
Max | Random Forest | 0.16 | 2 |
Max | BART | 0.18 | 42 |
5 Discussion
The grow-from-root strategy proposed here opens the door for computational innovations to be married to the novel BART stochastic fitting algorithm. Further, the proposed adaptive cut-points and variable selection proposal together define a novel predictor-dependent prior, marking a distinct Bayesian model. The simulation studies clearly demonstrate the beneficial synergy realized by the proposed approach: ABARTH is a state-of-the-art nonlinear regression method with computational demands that are competitive with the current fastest alternatives. In particular, the excellent performance without the need to cross-validate recommends ABARTH as a suitable default method for function estimation and prediction tasks when little is known about the response surface.
The source of ABARTH’s superior performance is not entirely clear, but preliminary investigations point to two important factors. One, the BART splitting criterion involves (the current estimate of) the error standard deviation,
, meaning that it is adaptively regularizing within the model fitting process. Two, we conjecture that the stochastic nature of the algorithm leads to better exploration of the parameter space than iterative optimizers. With fast model fitting software now in hand, this issue can be investigated more systematically in future work.References
- Breiman (1997) Breiman, L. (1997). Arcing the edge. Technical report, Technical Report 486, Statistics Department, University of California at Berkeley.
- Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
- Breiman et al. (1984) Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). Classification and regression trees. Chapman and Hall/CRC.
- Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM.
- Chipman et al. (1998) Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443):935–948.
- Chipman et al. (2010) Chipman, H. A., George, E. I., McCulloch, R. E., et al. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
- Friedman (2001) Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232.
- Friedman (2002) Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378.
-
Ke et al. (2017)
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu,
T.-Y. (2017).
LightGBM: A highly efficient gradient boosting decision tree.
In Advances in Neural Information Processing Systems, pages 3146–3154. - Linero (2016) Linero, A. R. (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association, (just-accepted).
-
Mehta et al. (1996)
Mehta, M., Agrawal, R., and Rissanen, J. (1996).
SLIQ: A fast scalable classifier for data mining.
In International Conference on Extending Database Technology, pages 18–32. Springer. - Pratola (2016) Pratola, M. (2016). Efficent Metropolis-Hastings proposal mechanism for Bayesian regression tree models. Bayesian Analysis, 11(3):885–911.
- Wright and Ziegler (2015) Wright, M. N. and Ziegler, A. (2015). ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409.