Hierarchical Embedded Bayesian Additive Regression Trees

by   Bruna Wundervald, et al.
Maynooth University

We propose a simple yet powerful extension of Bayesian Additive Regression Trees which we name Hierarchical Embedded BART (HE-BART). The model allows for random effects to be included at the terminal node level of a set of regression trees, making HE-BART a non-parametric alternative to mixed effects models which avoids the need for the user to specify the structure of the random effects in the model, whilst maintaining the prediction and uncertainty calibration properties of standard BART. Using simulated and real-world examples, we demonstrate that this new extension yields superior predictions for many of the standard mixed effects models' example data sets, and yet still provides consistent estimates of the random effect variances. In a future version of this paper, we outline its use in larger, more advanced data sets and structures.


page 1

page 8


Particle Gibbs for Bayesian Additive Regression Trees

Additive regression trees are flexible non-parametric models and popular...

Semi-parametric Bayesian Additive Regression Trees

We propose a new semi-parametric model based on Bayesian Additive Regres...

GP-BART: a novel Bayesian additive regression trees approach using Gaussian processes

The Bayesian additive regression trees (BART) model is an ensemble metho...

Uncertainty Quantification in Ensembles of Honest Regression Trees using Generalized Fiducial Inference

Due to their accuracies, methods based on ensembles of regression trees ...

Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs

This paper develops Bayesian econometric methods for posterior and predi...

Bayesian additive regression trees and the General BART model

Bayesian additive regression trees (BART) is a flexible prediction model...

Identifying confounders using additive noise models

We propose a method for inferring the existence of a latent common cause...

1 Introduction

Bayesian Additive Regression Trees 8

is a commonly-used probabilistic machine learning approach that produces predictions based on sums of regression trees. Like most standard machine learning approaches, however, BART is not mathematically designed to deal with hierarchical structures in the data. For example, certain observations may share common grouping characteristics e.g. repeated measures and grouped data

11, where there is an intra-group variance that needs to be accounted for when modelling such data. There are, of course, some algorithm options that can fit statistical methods to grouped data 1, 11, but they are often not flexible or able to adapt to data that has been generated by a complicated underlying structure. It might also happen that certain observations have their grouping information missing 36, or predictions might be required at different levels of the data hierarchy 11, which leads to even more complicated scenarios. Occasionally it is also of interest to estimate and/or remove the variability associated with the structured component of the data. The user is often left with a difficult model selection task, as they must choose whether the random effects go into an intercept or a slope or an interaction. Due to the complexity of this task, the model space is not usually well-explored see e.g. Chapter 24 of 11 for examples of simple model checking tools for hierarchical models.

With this in mind, Hierarchical Embedded BART introduces a hierarchical component in order to properly model grouped data and estimate the manner through which the hierarchical component of the data enters into the predictions using Bayesian Additive Regression Trees. This is substantially different to the original means by which random effects have been proposed as an inclusion to the BART model; see Section 7 of 8

. By contrast, in the simplest version of our model the hierarchical component is a single categorical predictor variable which adjusts all the predictions in the terminal nodes of each tree. Thus the effects are no longer confined to individual components of the model (e.g. intercept terms) and so give much greater flexibility whilst still providing the predictions at multiple layers of the hierarchy. Moreover, we do not make it a requirement for the grouping information to exist when calculating the predictions, which allows us to predict for observations where the group is missing, bringing even more flexibility to our model. We provide full technical details of our extension below.

Our paper is structured as follows. In Section 2 we introduce the BART model mathematically, discuss the fitting algorithm and some extensions that have already been proposed. In Section 3 we outline our new HE-BART approach and discuss how this extends BART into a generalised model for hierarchical data structures, whilst retaining the attractive properties and algorithmic efficiency that BART exhibits. In Section 4 we demonstrate the performance of the method on simulated and-real world data. Finally, Section 5 discusses some of the potential future research areas and drawbacks of our approach. An appendix contains some of the more detailed maths behind the fitting algorithm.

2 Introduction to Bayesian Additive Regression Trees

2.1 The BART model

The initial Bayesian Regression Tree model was first proposed over 20 years ago 7

, and consists of an algorithm that fits CART decision trees in a Bayesian fashion. The same authors extended this method to create the Bayesian Additive Regression Tree


approach, which assumes that the generating system of a continuous random variable

can be approximated by a sum of Bayesian trees. In a standard regression setting the BART model is usually written as:


for observations , where is the fixed number of trees, represents the set of covariates; is a tree structure, and represents a set of terminal node parameters. The function returns a terminal node prediction from by passing through the tree . consists of a set of parameters for each of the terminal nodes in tree . These values provide the tree-level predictions which are summed together to give an overall predicted value. The residual term is assumed normal with residual precision . Figure 1 (left panel) shows a standard single tree that a BART model may use.

The joint posterior distribution of the trees and all the parameters is given by:



is the normally distributed likelihood as defined in Equation

1. The term is the prior distribution on the terminal node parameters across each terminal node in each tree . is the prior distribution on the tree structure, and is the prior on the residual precision. We detail all these terms below.

The prior distribution on the trees proposed by Chipman . 8

involves applying a separate term for each node in the tree and considering both the probability of a split as well as the probability of a new splitting variable being chosen. For an entire tree

, we have:

where and represent the sets of terminal and internal nodes, respectively. The probability of a node being non-terminal is given by , where denotes the depth of the node

. The recommended values for the hyperparameters are

and , which control the depth and bushiness of the trees. For the probability of the new splits, where represents how many predictors are still available to split on in node , and how many values in a given predictor are still available.

The prior distribution for the terminal node and overall parameters in the standard regression case is denoted by . This is given a prior with a standard conjugate form :

where , for which we can use the limit .

The residual precision prior is set as

where and are fixed.

Since its creation, BART has been applied in a wide variety of different application areas. The model has shown to be useful for credit risk modelling 38, survival data analysis 34, 33, ecology and evolution modelling 5, weather and avalanche forecasting 3, and genetics 37. A popular approach is its use in causal inference 14, where BART produces accurate estimates of average treatment effects and is competitive even with the true data generating model.

In addition to this, many extensions to the standard BART model have been proposed. Some of the first include BART for categorical, count and multinomial regression 22, 16

, and quantile regression

15. This was followed by the proposal of BART that adapts to smoothness and sparsity 20

, models for high-dimensional data and variable selection

18, and BART for zero-inflated and semi-continuous responses 19. More recently, we have seem works involving BART for heterocedastic data 27, for the estimation of monotone and smooth surfaces, 35, varying coefficient models 9, semiparametric BART 25, and a combination of BART with model-trees 24. As this is a fairly new class of machine learning algorithm, progress on the theoretical performance of BART has only just begun. Some of the mathematical properties of BART, including a deep review on the BART methodology can be found in Linero 17, and some more general theoretical results in Ročková  van der Pas 31, Ročková  Saha 30.

2.2 Fitting algorithm

The BART model is fitted via a backfitting MCMC algorithm 13 that holds all other trees constant while the current one is being updated. This involves calculating the full conditional distribution , where represents the set of all trees except for tree ( is analogous). This conditional depends on only via the current state of the residuals, defined as

meaning these partial residuals include the sum of the predictions for all trees except . The choice of prior distributions on the trees and terminal nodes allows for the term to be calculated in closed form which avoids the need to trans-dimensional MCMC methods, and greatly simplifies the resulting algorithm.

The algorithm samples from via two main steps:

  1. propose a new tree through one of 4 proposal moves and calculate , and

  2. sampling a new set of terminal node parameters via , for the new tree

The new trees are proposed using a Metropolis-Hastings sampler 6, where the candidate tree is obtained via: GROW (a new node is added to the true), PRUNE (a random node is removed from the tree), CHANGE (a random splitting is changed in the tree) or SWAP (a random internal node is swapped in the tree). The movement choice requires probabilities for each move, which can be equal or depend on prior beliefs about moves that should be prioritized. In 26, the authors have proposed more detailed moves for the generation of new candidate trees.

3 Hierarchical Embedded Bayesian Additive Regression Trees (HE-BART)

Our HE-BART approach merges the ideas from traditional Bayesian hierarchical modelling 11 and linear mixed effects models 23 with BART. We allow each tree to have an extra split on each terminal node corresponding, in the simplest version, to a random intercept for each member of categorical predictor variable . Thus we introduce intra-group node parameters which we write as , where is the group index. This allows us to have a group-specific prediction for each node, as well as an overall terminal node prediction . The flexibility of this structure means that there is no requirement for the user to specify where the random effect is included, for example as an intercept or as a regression slope. With HE-BART we can fit Bayesian Additive Regression Trees to any kind of grouped data where there is such a categorical predictor, such as longitudinal data, repeated measures data, multilevel data, and block designs. In addition, having the two levels of predictions is advantageous for scenarios where the group information is not available for all or a subset of the new data.

To define the HE-BART model, let be the number of trees, to be the total number of groups, and the set of node parameters at both the terminal node and sub-terminal node level. More fully, we have one set which are terminal node predictions for each tree, and one set for each group within each terminal for tree , . Assuming we have a continuous variable of interest, the fundamental HE-BART appears similar to the standard BART approach:


for observation , in group . In this specification, we have that is the tree look up function which allows for predictions based on covariates and provides both the terminal node and sub-terminal node values as desired. As before is the tree structure, but contains both the terminal node parameters and the sub-terminal node group parameters for tree . As with standard BART, the noise is assumed to be distributed as , where is the overall residual precision.

The and sets contain, respectively, the overall terminal-nodes mean parameters and the intra-group sub-terminal node mean parameters. In other words, for all trees we will have one set for each of their terminal-nodes, and for each group within each terminal-node, we will have yet another set , where not all of the groups need to be available in each terminal-node . All parameters in receive priors and have their corresponding posteriors, used to sampled from in the fitting algorithm. In Figure 1 we have the exemplification of what would a BART tree look like, and what would a HE-BART tree look like. In the Figure, the terminal circles represent the terminal nodes, which all have their own parameters. In the HE-BART tree, however, we have the addition of the intra-group parameters, represented by the hexagonal symbols.

Figure 1: Left panel: a standard BART tree. Right panel: a HE-BART tree. In a BART tree, each terminal node has only one terminal node parameter , represented in the figure by , , and . On the other hand, in a HE-BART tree, each terminal node has its own overall parameter, plus one for each of the groups available in the node (not all groups need to be present in each terminal node). For example terminal node 2 has an overall parameter , and intra-group parameters (, , ), one for each of the available groups .

3.1 Prior and posterior distributions

For HE-BART a single terminal node in tree has partial residuals


for observation in group . Now, let represent the full set of residuals for group , where we drop the dependence on terminal node and tree for simplicity of notation. We then obtain a distribution on the partial residuals , with prior . The hyperparameter is responsible for scaling up to a value that captures the intra-group precision. We use a prior, where and are fixed, and its values are sampled via a Metropolis-Hastings step. Optionally, we can use the prior , or even Penalized Complexity Priors 32. We also set the prior , where is fixed for now, but it could also get a prior/be sampled in future work. As for the overall residual precision , its prior is set as , where and are fixed.

To find the posterior distribution of , we first create the group model matrix , so that where . With further marginalisation we have:

We use this collapsed distribution to obtain the marginal posterior distribution of a new tree in the MH step. Using the priors defined above, the posterior updates for and will be given by


that depends on the overall residuals for tree , the actual number of trees and on the and precision scaling parameters, and


where is the residual average for group in the specified tree node, and is the number of observations in each of those groups. Note that, using the trick previously mentioned we have integrated out from the posterior distribution of and thus, we managed to remove the mutual dependence of the and posteriors, as the sampling distribution for does not directly depend on , only on the current tree residuals. This facilitates the posterior sampling from both distributions because the number of nodes varies from iteration to iteration (See Algorithm 1) for tree .

For the posterior update for the residual precision , first let be the overall prediction for observation at the current iteration. Then the posterior distribution will be:


where is the total number of terminal nodes, and all the other parameters are previously described here. This distribution depends on the overall prediction residuals, the grouping hyperparameters, as well as on the and residuals.

The fitting algorithm is described in Algorithm 1. We initialise all inputs , all hyper-parameters start values, and all other elements used. For each MCMC iteration, we either accept or reject a new tree for , created from one of the 4 tree moves previously described. At this step, the full set is also sampled, and the sum of trees is calculated after all parameters are sampled. When each tree is updated, a new partial residuals set is defined and used in the sampling of the next tree. After the new trees are sampled, their predictions are used to sample from the residual precision parameter . Finally, we use an MH step to either accept or reject a new sample from , which is the last parameter updated in each iteration.

Type: Metropolis within GIBBS for a hierarchical BART model

y, X, groups
Posterior distribution of trees , , and
Start values for , , , , , number of trees P, stumps , number of observations N, number of MCMC iterations , initial residual set
for  to  do
     for  to P do
  1. grow a new tree tree by either growing, pruning, changing or swapping a root node

  2. calculate

  3. sample

  4. if then do

         for  to  do: sample
              for  to  do: sample
              end for
         end for
     end for
     Sample using another Metropolis-Hastings step:
          if then do
end for
Algorithm 1 HE-BART Algorithm

4 Applications & Results

This section describes the fitting of HE-BART to some example data sets. We compare across different models, including: a Linear Mixed-Effects (LME) model, fitted using the lme4 1 R 28 package; a Bayesian LME model, fitted using the brms 4 R package, with the default values for the number of MCMC iterations (2000), MCMC chains (4), and warmup (1000), and a prior for the parameter coefficient(s); and standard BART, fitted using the dbarts 10 R package, with the default values for the number of trees (200), number of posterior samples (1000), number of burn-in samples (100). The main performance results presented are calculated on out of sample values.

4.1 Simulated data

We first test HE-BART on a simulated dataset. The response variable

is simulated as a sum of trees, for which the tree structures are created by splitting the covariate at a random value , meaning we sample values as


where the grouping information is sampled as , and is the -th created by the splitting value [TO REVIEW]. The remaining parameters are set as , , and is sampled from the prior . The single covariate is sampled as . The dataset is split into 30 different train and test sets using a 30-Fold cross validation setting 29.

To this simulated dataset, we apply four different algorithms (LME, Bayesian LME, Standard BART and HE-BART) to the 30 train sets and make predictions for the 30 test sets. In Figure 2

, we can see that HE-BART produces the smallest root mean squared errors for both the train and test sets, as its averages are the closest to zero. The RMSE tables shows this in numbers, with empirical confidence intervals, confirming that HE-BART produces the lowest values for the 30 test sets. In the same figure, we also have the comparison between the sampled values for

and its corrresponding true value, using the and used in the simulation. We observe that the this empirical density is also centered on its true value, reinforcing the previous conclusion.

(a) Sampled and true values for

, the intra-group standard deviation

(b) Boxplots of total RMSE for the 30 test sets
Source HE-BART LME Bayesian LME BART
Test set 0.832 [0.776,0.889] 0.896 [0.833,0.96] 1.637 [1.604,1.67] 0.951 [0.888,1.013]
Train set 0.729 [0.724,0.735] 0.891 [0.889,0.893] 1.628 [1.624,1.633] 0.911 [0.909,0.914]
(c) Average test RMSE table, with the corresponding empirical 95% confidence intervals
Figure 2: (a) Empirical density of sampled values for , the intra-group standard deviation, with the true value used in the simulation represented in the dotted line. The density almost has a heavy tail but it is centered around the true value; (b) Boxplots of total RMSE for the 30 test sets, using the 4 different algorithms: HE-BART and 3 of its direct competitors; Our model evidently produces the lowest RMSEs for the test set; (c) Average test RMSE values, with empirical 95% confidence intervals for the mean RMSE. The table confirms the lowest values for HE-BART, with neither the train or test set confidence intervals intersecting with the other ones, except for HE-BART and LME.

4.2 Sleep study data

This dataset consists of the first 10 days of a sleep study 2, for the sleep-deprived group. The response variable is the average reaction time per day (in milliseconds), and the covariate is the number of days of sleep deprivation, for 18 subjects. Linear mixed effect models 23 are often applied to this data 1, so we will directly compare our results.

For this dataset, we created two modelling scenarios: one with all the 18 subjects present in the training set (80% train/20% test), and one where subject IDs number 308, 309 and 351 are not present in the training set (119 observations for training, 61 for testing, or 75% train/25% test). With this, we want to assess the prediction capability of our model when new grouping categories are introduced, as we have both intra-group and outside-group predictions. The root mean squared error is also compared to standard BART results 8, Bayesian LME 11, and LME 23.

In Figure 3, we have the HE-BART predictions for IDs 308, 309 and 351, which have been selected at random to be used as prediction examples in the two modelling scenarios. The plot also shows the true observations as blue dots, and the LME fit and its corresponding confidence interval. Our model predicts better for the 3 groups: ID 308 has a highly variant pattern, which is not well captured by the LME line; ID 309 is a bit easier for LME, but HE-BART still has predictions more correlated to the true observations as it is a more interesting case, where response time seem to almost decrease with the increase of days of sleep deprivation; and ID’s 351 form a more wiggly curve, also favouring HE-BART. The RMSE for these three observations is much better for HE-BART (almost half the value for LME).

In Figure 4 we have the same information, except that the predictions are made for IDs that did not existing in the training set for both HE-BART and LME. As expected, the two fits produce high confidence (LME)/credible (HE-BART) intervals, and they are all the same for the three IDs. However, the predictions are reasonably better for HE-BART, as they are more flexible than the LME straight line. For ID 351, both models seem to predict well, but HE-BART does a little better on learning the overall behaviour of the data. As for ID 309, the two predictions are fairly poor, and we can not really say one is better than the other. And for ID 308, the two fits get more or less half of the observations well predicted, and the other half is too far from what is should be. The root mean squared error (RMSE) has, understandably, gone high up, but it is still much better for HE-BART.

Table 1 presents the root mean squared error for the full training and test sets of the two modelling scenarios. We compare our results to LME, Bayesian LME and standard BART, and in 4 cases HE-BART produces the smallest RMSE values. In the two scenarios, the training HE-BART RMSE is almost half as the LME, which should be a strong competitor when modelling this dataset. The test RMSE for the Missing IDs scenario increased considerably for all algorithms, but that is due to the reduction of the training set size, so there are fewer observations to inform the models about the underlying generating structure of the data. Even with that, HE-BART also does better in the test sets of the two scenarios.

Figure 3:

Predictions for IDs 308, 309 and 351 of the sleep study data. HE-BART predictions with credible intervals are shown in orange, and the true data points are shown in blue. The linear mixed model fit line is also shown in blue.

Figure 4: Predictions for IDs 308, 309 and 351 of the sleep study data using a model fit in data that did not have these IDs. HE-BART predictions with credible intervals are shown in orange, and the true data points are shown in blue. The linear mixed model fit line is also shown in blue.
Scenario Source HE-BART LME Bayesian LME BART
All IDs in training data Test (20%) 0.50 0.52 0.52 0.78
All IDs in training data Train (80%) 0.41 0.53 0.53 0.85
Missing IDs in training data Test (25%) 0.94 1.02 0.97 1.02
Missing IDs in training data Train (75%) 0.36 0.50 0.49 0.77
Table 1: Root mean squared errors for the training and test sets of the two modelling scenarios: having all IDs in the training set, and removing IDs 308, 309 and 351 from the training set. HE-BART produces smaller values in all 4 cases.

4.3 Code and data availability

The functions created to run the HE-BART algorithm have been fully created with R 28, and we shortly hope to produce an R package. All functions are available at https://github.com/brunaw/HE-BART, as well as the code for replicating all the examples shown here, with their corresponding auxiliary files (simulation functions, plotting, tables, etc).

5 Conclusions

We have provided a new extension of Bayesian Additive Regression Trees that allows for structured data to be used appropriately at the model fitting and prediction stage. Where a grouping variable is present we are able to provide predictions at both the group level and the level above it. The flexible tree structure thus allows us to produce excellent predictions without the need to specify how the grouping variable is included in the model structure. However we retain the ability to report and remove the group level variability by having a parameter which represents the group level standard deviation.

In simulation studies we have shown that the model performs better than other common linear and mixed effects approaches. It is not comparable to standard Bayesian machine learning methods due to their inability to handle the grouping information. In the real data example of Section 4.2 we have demonstrated that BART competes well with standard linear mixed modelling strategies despite this being the archetypal example of linear mixed effects models.

We see many potential extensions of our approach. These may include:

  1. Extending the hierarchical data structure to multiple or nested grouping variables

  2. Explicity modelling joint random effects using the multivariate normal distribution and so estimating covariances between grouping variables

  3. Including recent BART extensions, such as SOFT-BART 18 and MOTR-BART 24 which can substantially enhance the predictive capabilities of the model

In a future version of this paper we plan to include a more detailed discussion of the hyperparameter settings of the model that are robust to different circumstances, and include a more detailed worked example of the model on a more advanced data set.


This work was supported by the Science Foundation Ireland Career Development Award grant number: 17/CDA/4695. Andrew Parnell’s work was also supported by: a Science Foundation Ireland investigator award (16/IA/4520); a Marine Research Programme funded by the Irish Government, co-financed by the European Regional Development Fund (Grant-Aid Agreement No. PBA/CC/18/01); European Union’s Horizon 2020 research and innovation programme InnoVar under grant agreement No 818144; SFI Centre for Research Training in Foundations of Data Science 18/CRT/6049, and SFI Research Centre awards I-Form 16/RC/3872 and Insight 12/RC/2289_P2. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Author contributions

Bruna Wundervald has contributed to the project ideation, mathematical development, code implementation, production and evaluation of results. Andrew Parnell has contributed to the project ideation, mathematical development, code implementation, and evaluation of results. Katarina Domijan has contributed to the project ideation, mathematical development, and evaluation of results.

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

Appendix A Section title of first appendix


  • Bates . 2011 bates2011packageBates, D., Maechler, M., Bolker, B., Walker, S., Christensen, RHB., Singmann, H.Grothendieck, G.  2011. Package ‘lme4’ Package ‘lme4’. Linear mixed-effects models using S4 classes. R package version16.
  • Belenky . 2003 belenky2003patternsBelenky, G., Wesensten, NJ., Thorne, DR., Thomas, ML., Sing, HC., Redmond, DP.Balkin, TJ.  2003. Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep dose-response study Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep dose-response study. Journal of sleep research1211–12.
  • Blattenberger  Fowles 2014 blattenberger2014avalancheBlattenberger, G.  Fowles, R.  2014. Avalanche forecasting: Using bayesian additive regression trees (BART) Avalanche forecasting: Using bayesian additive regression trees (bart). Demand for Communications Services–Insights and Perspectives Demand for communications services–insights and perspectives ( 211–227). Springer.
  • Bürkner 2017 burkner2017brmsBürkner, PC.  2017. brms: An R package for Bayesian multilevel models using Stan brms: An r package for bayesian multilevel models using stan. Journal of statistical software801–28.
  • Carlson 2020 carlson2020embarcaderoCarlson, CJ.  2020. embarcadero: Species distribution modelling with Bayesian additive regression trees in r embarcadero: Species distribution modelling with bayesian additive regression trees in r. Methods in Ecology and Evolution117850–858.
  • Chib  Greenberg 1995 chib1995understandingChib, S.  Greenberg, E.  1995. Understanding the metropolis-hastings algorithm Understanding the metropolis-hastings algorithm. The american statistician494327–335.
  • Chipman . 1998 chipman1998bayesianChipman, HA., George, EI.  McCulloch, RE.  1998. Bayesian CART model search Bayesian cart model search. Journal of the American Statistical Association93443935–948.
  • Chipman . 2010 chipman2010bartChipman, HA., George, EI.  McCulloch, RE.  2010. BART: Bayesian additive regression trees Bart: Bayesian additive regression trees. The Annals of Applied Statistics41266–298.
  • Deshpande . 2020 deshpande2020vcbartDeshpande, SK., Bai, R., Balocchi, C., Starling, JE.  Weiss, J.  2020. VCBART: Bayesian trees for varying coefficients Vcbart: Bayesian trees for varying coefficients. arXiv preprint arXiv:2003.06416.
  • Dorie 2021 dbartsDorie, V.  2021. dbarts: Discrete Bayesian Additive Regression Trees Sampler. dbarts: Discrete bayesian additive regression trees sampler. https://CRAN.R-project.org/package=dbarts R package version 0.9-20
  • Gelman  Hill 2006 gelman2006dataGelman, A.  Hill, J.  2006. Data analysis using regression and multilevel/hierarchical models Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
  • Guo 2005 guo2005analyzingGuo, S.  2005. Analyzing grouped data with hierarchical linear modeling Analyzing grouped data with hierarchical linear modeling. Children and Youth Services Review276637–652.
  • Hastie  Tibshirani 2000 hastie2000bayesianHastie, T.  Tibshirani, R.  2000. Bayesian backfitting (with comments and a rejoinder by the authors Bayesian backfitting (with comments and a rejoinder by the authors. Statistical Science153196–223.
  • Hill 2011 hill2011bayesianHill, JL.  2011. Bayesian nonparametric modeling for causal inference Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics201217–240.
  • Kindo, Wang, Hanson  Pena 2016 kindo2016bayesianKindo, BP., Wang, H., Hanson, T.  Pena, EA.  2016. Bayesian quantile additive regression trees Bayesian quantile additive regression trees. arXiv preprint arXiv:1607.02676.
  • Kindo, Wang  Peña 2016 kindo2016multinomialKindo, BP., Wang, H.  Peña, EA.  2016. Multinomial probit Bayesian additive regression trees Multinomial probit bayesian additive regression trees. Stat51119–131.
  • Linero 2017 linero2017reviewLinero, AR.  2017. A review of tree-based Bayesian methods A review of tree-based bayesian methods. Communications for Statistical Applications and Methods246543–559.
  • Linero 2018 linero2018abayesianLinero, AR.  2018. Bayesian regression trees for high-dimensional prediction and variable selection Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association113522626–636.
  • Linero . 2020 linero2020semiparametricLinero, AR., Sinha, D.  Lipsitz, SR.  2020. Semiparametric mixed-scale models using shared Bayesian forests Semiparametric mixed-scale models using shared bayesian forests. Biometrics761131–144.
  • Linero  Yang 20181 linero2018bbayesianLinero, AR.  Yang, Y.  20181. Bayesian regression tree ensembles that adapt to smoothness and sparsity Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology)8051087–1110.
  • Linero  Yang 20182 linero2018bayesianLinero, AR.  Yang, Y.  20182. Bayesian regression tree ensembles that adapt to smoothness and sparsity Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology)8051087–1110.
  • Murray 2017 murray2017logMurray, JS.  2017. Log-linear Bayesian additive regression trees for categorical and count responses Log-linear bayesian additive regression trees for categorical and count responses. arXiv preprint arXiv:1701.015033.
  • Pinheiro  Bates 2000 pinheiro2000linearPinheiro, JC.  Bates, DM.  2000. Linear mixed-effects models: basic concepts and examples Linear mixed-effects models: basic concepts and examples. Mixed-effects models in S and S-Plus3–56.
  • Prado, Moral  Parnell 2021 prado2021bayesianPrado, EB., Moral, RA.  Parnell, AC.  2021. Bayesian additive regression trees with model trees Bayesian additive regression trees with model trees. Statistics and Computing3131–13.
  • Prado, Parnell . 2021 prado2021semiPrado, EB., Parnell, AC., McJames, N., O’Shea, A.  Moral, RA.  2021. Semi-parametric Bayesian additive regression trees Semi-parametric bayesian additive regression trees. arXiv preprint arXiv:2108.07636.
  • Pratola 2016 pratola2016efficientPratola, MT.  2016. Efficient Metropolis–Hastings proposal mechanisms for Bayesian regression tree models Efficient metropolis–hastings proposal mechanisms for bayesian regression tree models. Bayesian analysis113885–911.
  • Pratola . 2020 pratola2020heteroscedasticPratola, MT., Chipman, HA., George, EI.  McCulloch, RE.  2020. Heteroscedastic BART via multiplicative regression trees Heteroscedastic bart via multiplicative regression trees. Journal of Computational and Graphical Statistics292405–417.
  • R Core Team 2020 rciteR Core Team.  2020. R: A Language and Environment for Statistical Computing R: A language and environment for statistical computing []. Vienna, Austria. https://www.R-project.org/
  • Refaeilzadeh . 2009 refaeilzadeh2009crossRefaeilzadeh, P., Tang, L.  Liu, H.  2009. Cross-validation. Cross-validation. Encyclopedia of database systems5532–538.
  • Ročková  Saha 2019 rovckova2019theoryRočková, V.  Saha, E.  2019. On theory for BART On theory for bart.

    The 22nd International Conference on Artificial Intelligence and Statistics The 22nd international conference on artificial intelligence and statistics ( 2839–2848).

  • Ročková  van der Pas 2020 rovckova2020posteriorRočková, V.  van der Pas, S.  2020. Posterior concentration for Bayesian regression trees and forests Posterior concentration for bayesian regression trees and forests. The Annals of Statistics4842108–2131.
  • Simpson . 2017 simpson2017penalisingSimpson, D., Rue, H., Riebler, A., Martins, TG.  Sørbye, SH.  2017. Penalising model component complexity: A principled, practical approach to constructing priors Penalising model component complexity: A principled, practical approach to constructing priors. Statistical science3211–28.
  • R. Sparapani . 2020 sparapani2020nonparametricSparapani, R., Logan, BR., McCulloch, RE.  Laud, PW.  2020. Nonparametric competing risks analysis using bayesian additive regression trees Nonparametric competing risks analysis using bayesian additive regression trees. Statistical methods in medical research29157–77.
  • RA. Sparapani . 2016 sparapani2016nonparametricSparapani, RA., Logan, BR., McCulloch, RE.  Laud, PW.  2016. Nonparametric survival analysis using Bayesian additive regression trees (BART) Nonparametric survival analysis using bayesian additive regression trees (bart). Statistics in medicine35162741–2753.
  • Starling . 2020 starling2020bartStarling, JE., Murray, JS., Carvalho, CM., Bukowski, RK.  Scott, JG.  2020. BART with targeted smoothing: An analysis of patient-specific stillbirth risk Bart with targeted smoothing: An analysis of patient-specific stillbirth risk. The Annals of Applied Statistics14128–50.
  • Vallejo . 2011 vallejo2011comparisonVallejo, G., Fernández, M., Livacic-Rojas, P.  Tuero-Herrero, E.  2011. Comparison of modern methods for analyzing repeated measures data with missing values Comparison of modern methods for analyzing repeated measures data with missing values. Multivariate Behavioral Research466900–937.
  • Waldmann 2016 waldmann2016genomeWaldmann, P.  2016. Genome-wide prediction using Bayesian additive regression trees Genome-wide prediction using bayesian additive regression trees. Genetics Selection Evolution4811–12.
  • Zhang  Härdle 2010 zhang2010bayesianZhang, JL.  Härdle, WK.  2010. The Bayesian additive classification tree applied to credit risk modelling The bayesian additive classification tree applied to credit risk modelling. Computational Statistics & Data Analysis5451197–1205.

Author Biography

Bruna Wundervald. received her First-Class Honours bachelor’s degree in Statistics from the Federal University of Paraná, Brazil, in 2018. In the following, she started pursuing her Ph.D., also in Statistics, at Maynooth University, in Ireland. Her research interests include a wide variety of machine learning methods, including dimensionality reduction, Bayesian Additive Regression Trees (BART), variational inference, and probabilistic graphical models.

Andrew Parnell received his Ph.D. degree in Statistics from the University of Sheffield, United Kingdom. He is currently a Hamilton Professor at the Hamilton Institute at Maynooth University, Ireland. His research is in statistics and machine learning for large structured data sets in a variety of application areas, including tree-based machine learning methods, Bayesian Additive Regression Trees (BART), extreme value theory, spatial statistics, zero-inflation modeling, and missing data analysis. He has been heavily involved in the commercialization of research and is currently a principal investigator in the SFI I-Form Advanced Manufacturing Centre, a funded investigator in the SFI Insight Centre for Data Analytics, and an expert reviewer for the Intergovernmental Panel on Climate Change.

Katarina Domijan

received her Ph.D. degree in Statistics from the Trinity College in Dublin, Ireland. She is currently a lecturer in Statistics at the Department of Mathematics and Statistics of Maynooth University. Her main research interests rely on Bayesian modeling of classification problems in data with large feature spaces, including feature selection, model visualization, and interpretability.