1 Introduction
Bayesian Additive Regression Trees ^{8}
is a commonlyused probabilistic machine learning approach that produces predictions based on sums of regression trees. Like most standard machine learning approaches, however, BART is not mathematically designed to deal with hierarchical structures in the data. For example, certain observations may share common grouping characteristics e.g. repeated measures and grouped data
^{11}, where there is an intragroup variance that needs to be accounted for when modelling such data. There are, of course, some algorithm options that can fit statistical methods to grouped data ^{1}, ^{11}, but they are often not flexible or able to adapt to data that has been generated by a complicated underlying structure. It might also happen that certain observations have their grouping information missing ^{36}, or predictions might be required at different levels of the data hierarchy ^{11}, which leads to even more complicated scenarios. Occasionally it is also of interest to estimate and/or remove the variability associated with the structured component of the data. The user is often left with a difficult model selection task, as they must choose whether the random effects go into an intercept or a slope or an interaction. Due to the complexity of this task, the model space is not usually wellexplored see e.g. Chapter 24 of ^{11} for examples of simple model checking tools for hierarchical models.With this in mind, Hierarchical Embedded BART introduces a hierarchical component in order to properly model grouped data and estimate the manner through which the hierarchical component of the data enters into the predictions using Bayesian Additive Regression Trees. This is substantially different to the original means by which random effects have been proposed as an inclusion to the BART model; see Section 7 of ^{8}
. By contrast, in the simplest version of our model the hierarchical component is a single categorical predictor variable which adjusts all the predictions in the terminal nodes of each tree. Thus the effects are no longer confined to individual components of the model (e.g. intercept terms) and so give much greater flexibility whilst still providing the predictions at multiple layers of the hierarchy. Moreover, we do not make it a requirement for the grouping information to exist when calculating the predictions, which allows us to predict for observations where the group is missing, bringing even more flexibility to our model. We provide full technical details of our extension below.
Our paper is structured as follows. In Section 2 we introduce the BART model mathematically, discuss the fitting algorithm and some extensions that have already been proposed. In Section 3 we outline our new HEBART approach and discuss how this extends BART into a generalised model for hierarchical data structures, whilst retaining the attractive properties and algorithmic efficiency that BART exhibits. In Section 4 we demonstrate the performance of the method on simulated andreal world data. Finally, Section 5 discusses some of the potential future research areas and drawbacks of our approach. An appendix contains some of the more detailed maths behind the fitting algorithm.
2 Introduction to Bayesian Additive Regression Trees
2.1 The BART model
The initial Bayesian Regression Tree model was first proposed over 20 years ago ^{7}
, and consists of an algorithm that fits CART decision trees in a Bayesian fashion. The same authors extended this method to create the Bayesian Additive Regression Tree
^{8}approach, which assumes that the generating system of a continuous random variable
can be approximated by a sum of Bayesian trees. In a standard regression setting the BART model is usually written as:(1) 
for observations , where is the fixed number of trees, represents the set of covariates; is a tree structure, and represents a set of terminal node parameters. The function returns a terminal node prediction from by passing through the tree . consists of a set of parameters for each of the terminal nodes in tree . These values provide the treelevel predictions which are summed together to give an overall predicted value. The residual term is assumed normal with residual precision . Figure 1 (left panel) shows a standard single tree that a BART model may use.
The joint posterior distribution of the trees and all the parameters is given by:
(2) 
where
is the normally distributed likelihood as defined in Equation
1. The term is the prior distribution on the terminal node parameters across each terminal node in each tree . is the prior distribution on the tree structure, and is the prior on the residual precision. We detail all these terms below.The prior distribution on the trees proposed by Chipman . ^{8}
involves applying a separate term for each node in the tree and considering both the probability of a split as well as the probability of a new splitting variable being chosen. For an entire tree
, we have:where and represent the sets of terminal and internal nodes, respectively. The probability of a node being nonterminal is given by , where denotes the depth of the node
. The recommended values for the hyperparameters are
and , which control the depth and bushiness of the trees. For the probability of the new splits, where represents how many predictors are still available to split on in node , and how many values in a given predictor are still available.The prior distribution for the terminal node and overall parameters in the standard regression case is denoted by . This is given a prior with a standard conjugate form :
where , for which we can use the limit .
The residual precision prior is set as
where and are fixed.
Since its creation, BART has been applied in a wide variety of different application areas. The model has shown to be useful for credit risk modelling ^{38}, survival data analysis ^{34}, ^{33}, ecology and evolution modelling ^{5}, weather and avalanche forecasting ^{3}, and genetics ^{37}. A popular approach is its use in causal inference ^{14}, where BART produces accurate estimates of average treatment effects and is competitive even with the true data generating model.
In addition to this, many extensions to the standard BART model have been proposed. Some of the first include BART for categorical, count and multinomial regression ^{22}, ^{16}
, and quantile regression
^{15}. This was followed by the proposal of BART that adapts to smoothness and sparsity ^{20}, models for highdimensional data and variable selection
^{18}, and BART for zeroinflated and semicontinuous responses ^{19}. More recently, we have seem works involving BART for heterocedastic data ^{27}, for the estimation of monotone and smooth surfaces, ^{35}, varying coefficient models ^{9}, semiparametric BART ^{25}, and a combination of BART with modeltrees ^{24}. As this is a fairly new class of machine learning algorithm, progress on the theoretical performance of BART has only just begun. Some of the mathematical properties of BART, including a deep review on the BART methodology can be found in Linero ^{17}, and some more general theoretical results in Ročková van der Pas ^{31}, Ročková Saha ^{30}.2.2 Fitting algorithm
The BART model is fitted via a backfitting MCMC algorithm ^{13} that holds all other trees constant while the current one is being updated. This involves calculating the full conditional distribution , where represents the set of all trees except for tree ( is analogous). This conditional depends on only via the current state of the residuals, defined as
meaning these partial residuals include the sum of the predictions for all trees except . The choice of prior distributions on the trees and terminal nodes allows for the term to be calculated in closed form which avoids the need to transdimensional MCMC methods, and greatly simplifies the resulting algorithm.
The algorithm samples from via two main steps:

propose a new tree through one of 4 proposal moves and calculate , and

sampling a new set of terminal node parameters via , for the new tree
The new trees are proposed using a MetropolisHastings sampler ^{6}, where the candidate tree is obtained via: GROW (a new node is added to the true), PRUNE (a random node is removed from the tree), CHANGE (a random splitting is changed in the tree) or SWAP (a random internal node is swapped in the tree). The movement choice requires probabilities for each move, which can be equal or depend on prior beliefs about moves that should be prioritized. In ^{26}, the authors have proposed more detailed moves for the generation of new candidate trees.
3 Hierarchical Embedded Bayesian Additive Regression Trees (HEBART)
Our HEBART approach merges the ideas from traditional Bayesian hierarchical modelling ^{11} and linear mixed effects models ^{23} with BART. We allow each tree to have an extra split on each terminal node corresponding, in the simplest version, to a random intercept for each member of categorical predictor variable . Thus we introduce intragroup node parameters which we write as , where is the group index. This allows us to have a groupspecific prediction for each node, as well as an overall terminal node prediction . The flexibility of this structure means that there is no requirement for the user to specify where the random effect is included, for example as an intercept or as a regression slope. With HEBART we can fit Bayesian Additive Regression Trees to any kind of grouped data where there is such a categorical predictor, such as longitudinal data, repeated measures data, multilevel data, and block designs. In addition, having the two levels of predictions is advantageous for scenarios where the group information is not available for all or a subset of the new data.
To define the HEBART model, let be the number of trees, to be the total number of groups, and the set of node parameters at both the terminal node and subterminal node level. More fully, we have one set which are terminal node predictions for each tree, and one set for each group within each terminal for tree , . Assuming we have a continuous variable of interest, the fundamental HEBART appears similar to the standard BART approach:
(3) 
for observation , in group . In this specification, we have that is the tree look up function which allows for predictions based on covariates and provides both the terminal node and subterminal node values as desired. As before is the tree structure, but contains both the terminal node parameters and the subterminal node group parameters for tree . As with standard BART, the noise is assumed to be distributed as , where is the overall residual precision.
The and sets contain, respectively, the overall terminalnodes mean parameters and the intragroup subterminal node mean parameters. In other words, for all trees we will have one set for each of their terminalnodes, and for each group within each terminalnode, we will have yet another set , where not all of the groups need to be available in each terminalnode . All parameters in receive priors and have their corresponding posteriors, used to sampled from in the fitting algorithm. In Figure 1 we have the exemplification of what would a BART tree look like, and what would a HEBART tree look like. In the Figure, the terminal circles represent the terminal nodes, which all have their own parameters. In the HEBART tree, however, we have the addition of the intragroup parameters, represented by the hexagonal symbols.
3.1 Prior and posterior distributions
For HEBART a single terminal node in tree has partial residuals
(4) 
for observation in group . Now, let represent the full set of residuals for group , where we drop the dependence on terminal node and tree for simplicity of notation. We then obtain a distribution on the partial residuals , with prior . The hyperparameter is responsible for scaling up to a value that captures the intragroup precision. We use a prior, where and are fixed, and its values are sampled via a MetropolisHastings step. Optionally, we can use the prior , or even Penalized Complexity Priors ^{32}. We also set the prior , where is fixed for now, but it could also get a prior/be sampled in future work. As for the overall residual precision , its prior is set as , where and are fixed.
To find the posterior distribution of , we first create the group model matrix , so that where . With further marginalisation we have:
We use this collapsed distribution to obtain the marginal posterior distribution of a new tree in the MH step. Using the priors defined above, the posterior updates for and will be given by
(5) 
that depends on the overall residuals for tree , the actual number of trees and on the and precision scaling parameters, and
(6) 
where is the residual average for group in the specified tree node, and is the number of observations in each of those groups. Note that, using the trick previously mentioned we have integrated out from the posterior distribution of and thus, we managed to remove the mutual dependence of the and posteriors, as the sampling distribution for does not directly depend on , only on the current tree residuals. This facilitates the posterior sampling from both distributions because the number of nodes varies from iteration to iteration (See Algorithm 1) for tree .
For the posterior update for the residual precision , first let be the overall prediction for observation at the current iteration. Then the posterior distribution will be:
(7) 
where is the total number of terminal nodes, and all the other parameters are previously described here. This distribution depends on the overall prediction residuals, the grouping hyperparameters, as well as on the and residuals.
The fitting algorithm is described in Algorithm 1. We initialise all inputs , all hyperparameters start values, and all other elements used. For each MCMC iteration, we either accept or reject a new tree for , created from one of the 4 tree moves previously described. At this step, the full set is also sampled, and the sum of trees is calculated after all parameters are sampled. When each tree is updated, a new partial residuals set is defined and used in the sampling of the next tree. After the new trees are sampled, their predictions are used to sample from the residual precision parameter . Finally, we use an MH step to either accept or reject a new sample from , which is the last parameter updated in each iteration.
4 Applications & Results
This section describes the fitting of HEBART to some example data sets. We compare across different models, including: a Linear MixedEffects (LME) model, fitted using the lme4 ^{1} R ^{28} package; a Bayesian LME model, fitted using the brms ^{4} R package, with the default values for the number of MCMC iterations (2000), MCMC chains (4), and warmup (1000), and a prior for the parameter coefficient(s); and standard BART, fitted using the dbarts ^{10} R package, with the default values for the number of trees (200), number of posterior samples (1000), number of burnin samples (100). The main performance results presented are calculated on out of sample values.
4.1 Simulated data
We first test HEBART on a simulated dataset. The response variable
is simulated as a sum of trees, for which the tree structures are created by splitting the covariate at a random value , meaning we sample values as(8) 
where the grouping information is sampled as , and is the th created by the splitting value [TO REVIEW]. The remaining parameters are set as , , and is sampled from the prior . The single covariate is sampled as . The dataset is split into 30 different train and test sets using a 30Fold cross validation setting ^{29}.
To this simulated dataset, we apply four different algorithms (LME, Bayesian LME, Standard BART and HEBART) to the 30 train sets and make predictions for the 30 test sets. In Figure 2
, we can see that HEBART produces the smallest root mean squared errors for both the train and test sets, as its averages are the closest to zero. The RMSE tables shows this in numbers, with empirical confidence intervals, confirming that HEBART produces the lowest values for the 30 test sets. In the same figure, we also have the comparison between the sampled values for
and its corrresponding true value, using the and used in the simulation. We observe that the this empirical density is also centered on its true value, reinforcing the previous conclusion.(a) Sampled and true values for , the intragroup standard deviation 
(b) Boxplots of total RMSE for the 30 test sets 
Source  HEBART  LME  Bayesian LME  BART 

Test set  0.832 [0.776,0.889]  0.896 [0.833,0.96]  1.637 [1.604,1.67]  0.951 [0.888,1.013] 
Train set  0.729 [0.724,0.735]  0.891 [0.889,0.893]  1.628 [1.624,1.633]  0.911 [0.909,0.914] 
(c) Average test RMSE table, with the corresponding empirical 95% confidence intervals 
4.2 Sleep study data
This dataset consists of the first 10 days of a sleep study ^{2}, for the sleepdeprived group. The response variable is the average reaction time per day (in milliseconds), and the covariate is the number of days of sleep deprivation, for 18 subjects. Linear mixed effect models ^{23} are often applied to this data ^{1}, so we will directly compare our results.
For this dataset, we created two modelling scenarios: one with all the 18 subjects present in the training set (80% train/20% test), and one where subject IDs number 308, 309 and 351 are not present in the training set (119 observations for training, 61 for testing, or 75% train/25% test). With this, we want to assess the prediction capability of our model when new grouping categories are introduced, as we have both intragroup and outsidegroup predictions. The root mean squared error is also compared to standard BART results ^{8}, Bayesian LME ^{11}, and LME ^{23}.
In Figure 3, we have the HEBART predictions for IDs 308, 309 and 351, which have been selected at random to be used as prediction examples in the two modelling scenarios. The plot also shows the true observations as blue dots, and the LME fit and its corresponding confidence interval. Our model predicts better for the 3 groups: ID 308 has a highly variant pattern, which is not well captured by the LME line; ID 309 is a bit easier for LME, but HEBART still has predictions more correlated to the true observations as it is a more interesting case, where response time seem to almost decrease with the increase of days of sleep deprivation; and ID’s 351 form a more wiggly curve, also favouring HEBART. The RMSE for these three observations is much better for HEBART (almost half the value for LME).
In Figure 4 we have the same information, except that the predictions are made for IDs that did not existing in the training set for both HEBART and LME. As expected, the two fits produce high confidence (LME)/credible (HEBART) intervals, and they are all the same for the three IDs. However, the predictions are reasonably better for HEBART, as they are more flexible than the LME straight line. For ID 351, both models seem to predict well, but HEBART does a little better on learning the overall behaviour of the data. As for ID 309, the two predictions are fairly poor, and we can not really say one is better than the other. And for ID 308, the two fits get more or less half of the observations well predicted, and the other half is too far from what is should be. The root mean squared error (RMSE) has, understandably, gone high up, but it is still much better for HEBART.
Table 1 presents the root mean squared error for the full training and test sets of the two modelling scenarios. We compare our results to LME, Bayesian LME and standard BART, and in 4 cases HEBART produces the smallest RMSE values. In the two scenarios, the training HEBART RMSE is almost half as the LME, which should be a strong competitor when modelling this dataset. The test RMSE for the Missing IDs scenario increased considerably for all algorithms, but that is due to the reduction of the training set size, so there are fewer observations to inform the models about the underlying generating structure of the data. Even with that, HEBART also does better in the test sets of the two scenarios.
Scenario  Source  HEBART  LME  Bayesian LME  BART 

All IDs in training data  Test (20%)  0.50  0.52  0.52  0.78 
All IDs in training data  Train (80%)  0.41  0.53  0.53  0.85 
Missing IDs in training data  Test (25%)  0.94  1.02  0.97  1.02 
Missing IDs in training data  Train (75%)  0.36  0.50  0.49  0.77 
4.3 Code and data availability
The functions created to run the HEBART algorithm have been fully created with R ^{28}, and we shortly hope to produce an R package. All functions are available at https://github.com/brunaw/HEBART, as well as the code for replicating all the examples shown here, with their corresponding auxiliary files (simulation functions, plotting, tables, etc).
5 Conclusions
We have provided a new extension of Bayesian Additive Regression Trees that allows for structured data to be used appropriately at the model fitting and prediction stage. Where a grouping variable is present we are able to provide predictions at both the group level and the level above it. The flexible tree structure thus allows us to produce excellent predictions without the need to specify how the grouping variable is included in the model structure. However we retain the ability to report and remove the group level variability by having a parameter which represents the group level standard deviation.
In simulation studies we have shown that the model performs better than other common linear and mixed effects approaches. It is not comparable to standard Bayesian machine learning methods due to their inability to handle the grouping information. In the real data example of Section 4.2 we have demonstrated that BART competes well with standard linear mixed modelling strategies despite this being the archetypal example of linear mixed effects models.
We see many potential extensions of our approach. These may include:

Extending the hierarchical data structure to multiple or nested grouping variables

Explicity modelling joint random effects using the multivariate normal distribution and so estimating covariances between grouping variables

Including recent BART extensions, such as SOFTBART ^{18} and MOTRBART ^{24} which can substantially enhance the predictive capabilities of the model
In a future version of this paper we plan to include a more detailed discussion of the hyperparameter settings of the model that are robust to different circumstances, and include a more detailed worked example of the model on a more advanced data set.
Acknowledgments
This work was supported by the Science Foundation Ireland Career Development Award grant number: 17/CDA/4695. Andrew Parnell’s work was also supported by: a Science Foundation Ireland investigator award (16/IA/4520); a Marine Research Programme funded by the Irish Government, cofinanced by the European Regional Development Fund (GrantAid Agreement No. PBA/CC/18/01); European Union’s Horizon 2020 research and innovation programme InnoVar under grant agreement No 818144; SFI Centre for Research Training in Foundations of Data Science 18/CRT/6049, and SFI Research Centre awards IForm 16/RC/3872 and Insight 12/RC/2289_P2. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Author contributions
Bruna Wundervald has contributed to the project ideation, mathematical development, code implementation, production and evaluation of results. Andrew Parnell has contributed to the project ideation, mathematical development, code implementation, and evaluation of results. Katarina Domijan has contributed to the project ideation, mathematical development, and evaluation of results.
Financial disclosure
None reported.
Conflict of interest
The authors declare no potential conflict of interests.
Appendix A Section title of first appendix
References
 Bates . 2011 bates2011packageBates, D., Maechler, M., Bolker, B., Walker, S., Christensen, RHB., Singmann, H.Grothendieck, G. 2011. Package ‘lme4’ Package ‘lme4’. Linear mixedeffects models using S4 classes. R package version16.
 Belenky . 2003 belenky2003patternsBelenky, G., Wesensten, NJ., Thorne, DR., Thomas, ML., Sing, HC., Redmond, DP.Balkin, TJ. 2003. Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep doseresponse study Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep doseresponse study. Journal of sleep research1211–12.
 Blattenberger Fowles 2014 blattenberger2014avalancheBlattenberger, G. Fowles, R. 2014. Avalanche forecasting: Using bayesian additive regression trees (BART) Avalanche forecasting: Using bayesian additive regression trees (bart). Demand for Communications Services–Insights and Perspectives Demand for communications services–insights and perspectives ( 211–227). Springer.
 Bürkner 2017 burkner2017brmsBürkner, PC. 2017. brms: An R package for Bayesian multilevel models using Stan brms: An r package for bayesian multilevel models using stan. Journal of statistical software801–28.
 Carlson 2020 carlson2020embarcaderoCarlson, CJ. 2020. embarcadero: Species distribution modelling with Bayesian additive regression trees in r embarcadero: Species distribution modelling with bayesian additive regression trees in r. Methods in Ecology and Evolution117850–858.
 Chib Greenberg 1995 chib1995understandingChib, S. Greenberg, E. 1995. Understanding the metropolishastings algorithm Understanding the metropolishastings algorithm. The american statistician494327–335.
 Chipman . 1998 chipman1998bayesianChipman, HA., George, EI. McCulloch, RE. 1998. Bayesian CART model search Bayesian cart model search. Journal of the American Statistical Association93443935–948.
 Chipman . 2010 chipman2010bartChipman, HA., George, EI. McCulloch, RE. 2010. BART: Bayesian additive regression trees Bart: Bayesian additive regression trees. The Annals of Applied Statistics41266–298.
 Deshpande . 2020 deshpande2020vcbartDeshpande, SK., Bai, R., Balocchi, C., Starling, JE. Weiss, J. 2020. VCBART: Bayesian trees for varying coefficients Vcbart: Bayesian trees for varying coefficients. arXiv preprint arXiv:2003.06416.
 Dorie 2021 dbartsDorie, V. 2021. dbarts: Discrete Bayesian Additive Regression Trees Sampler. dbarts: Discrete bayesian additive regression trees sampler. https://CRAN.Rproject.org/package=dbarts R package version 0.920
 Gelman Hill 2006 gelman2006dataGelman, A. Hill, J. 2006. Data analysis using regression and multilevel/hierarchical models Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
 Guo 2005 guo2005analyzingGuo, S. 2005. Analyzing grouped data with hierarchical linear modeling Analyzing grouped data with hierarchical linear modeling. Children and Youth Services Review276637–652.
 Hastie Tibshirani 2000 hastie2000bayesianHastie, T. Tibshirani, R. 2000. Bayesian backfitting (with comments and a rejoinder by the authors Bayesian backfitting (with comments and a rejoinder by the authors. Statistical Science153196–223.
 Hill 2011 hill2011bayesianHill, JL. 2011. Bayesian nonparametric modeling for causal inference Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics201217–240.
 Kindo, Wang, Hanson Pena 2016 kindo2016bayesianKindo, BP., Wang, H., Hanson, T. Pena, EA. 2016. Bayesian quantile additive regression trees Bayesian quantile additive regression trees. arXiv preprint arXiv:1607.02676.
 Kindo, Wang Peña 2016 kindo2016multinomialKindo, BP., Wang, H. Peña, EA. 2016. Multinomial probit Bayesian additive regression trees Multinomial probit bayesian additive regression trees. Stat51119–131.
 Linero 2017 linero2017reviewLinero, AR. 2017. A review of treebased Bayesian methods A review of treebased bayesian methods. Communications for Statistical Applications and Methods246543–559.
 Linero 2018 linero2018abayesianLinero, AR. 2018. Bayesian regression trees for highdimensional prediction and variable selection Bayesian regression trees for highdimensional prediction and variable selection. Journal of the American Statistical Association113522626–636.
 Linero . 2020 linero2020semiparametricLinero, AR., Sinha, D. Lipsitz, SR. 2020. Semiparametric mixedscale models using shared Bayesian forests Semiparametric mixedscale models using shared bayesian forests. Biometrics761131–144.
 Linero Yang 20181 linero2018bbayesianLinero, AR. Yang, Y. 20181. Bayesian regression tree ensembles that adapt to smoothness and sparsity Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology)8051087–1110.
 Linero Yang 20182 linero2018bayesianLinero, AR. Yang, Y. 20182. Bayesian regression tree ensembles that adapt to smoothness and sparsity Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology)8051087–1110.
 Murray 2017 murray2017logMurray, JS. 2017. Loglinear Bayesian additive regression trees for categorical and count responses Loglinear bayesian additive regression trees for categorical and count responses. arXiv preprint arXiv:1701.015033.
 Pinheiro Bates 2000 pinheiro2000linearPinheiro, JC. Bates, DM. 2000. Linear mixedeffects models: basic concepts and examples Linear mixedeffects models: basic concepts and examples. Mixedeffects models in S and SPlus3–56.
 Prado, Moral Parnell 2021 prado2021bayesianPrado, EB., Moral, RA. Parnell, AC. 2021. Bayesian additive regression trees with model trees Bayesian additive regression trees with model trees. Statistics and Computing3131–13.
 Prado, Parnell . 2021 prado2021semiPrado, EB., Parnell, AC., McJames, N., O’Shea, A. Moral, RA. 2021. Semiparametric Bayesian additive regression trees Semiparametric bayesian additive regression trees. arXiv preprint arXiv:2108.07636.
 Pratola 2016 pratola2016efficientPratola, MT. 2016. Efficient Metropolis–Hastings proposal mechanisms for Bayesian regression tree models Efficient metropolis–hastings proposal mechanisms for bayesian regression tree models. Bayesian analysis113885–911.
 Pratola . 2020 pratola2020heteroscedasticPratola, MT., Chipman, HA., George, EI. McCulloch, RE. 2020. Heteroscedastic BART via multiplicative regression trees Heteroscedastic bart via multiplicative regression trees. Journal of Computational and Graphical Statistics292405–417.
 R Core Team 2020 rciteR Core Team. 2020. R: A Language and Environment for Statistical Computing R: A language and environment for statistical computing []. Vienna, Austria. https://www.Rproject.org/
 Refaeilzadeh . 2009 refaeilzadeh2009crossRefaeilzadeh, P., Tang, L. Liu, H. 2009. Crossvalidation. Crossvalidation. Encyclopedia of database systems5532–538.

Ročková Saha 2019
rovckova2019theoryRočková, V. Saha, E.
2019.
On theory for BART On theory for bart.
The 22nd International Conference on Artificial Intelligence and Statistics The 22nd international conference on artificial intelligence and statistics ( 2839–2848).
 Ročková van der Pas 2020 rovckova2020posteriorRočková, V. van der Pas, S. 2020. Posterior concentration for Bayesian regression trees and forests Posterior concentration for bayesian regression trees and forests. The Annals of Statistics4842108–2131.
 Simpson . 2017 simpson2017penalisingSimpson, D., Rue, H., Riebler, A., Martins, TG. Sørbye, SH. 2017. Penalising model component complexity: A principled, practical approach to constructing priors Penalising model component complexity: A principled, practical approach to constructing priors. Statistical science3211–28.
 R. Sparapani . 2020 sparapani2020nonparametricSparapani, R., Logan, BR., McCulloch, RE. Laud, PW. 2020. Nonparametric competing risks analysis using bayesian additive regression trees Nonparametric competing risks analysis using bayesian additive regression trees. Statistical methods in medical research29157–77.
 RA. Sparapani . 2016 sparapani2016nonparametricSparapani, RA., Logan, BR., McCulloch, RE. Laud, PW. 2016. Nonparametric survival analysis using Bayesian additive regression trees (BART) Nonparametric survival analysis using bayesian additive regression trees (bart). Statistics in medicine35162741–2753.
 Starling . 2020 starling2020bartStarling, JE., Murray, JS., Carvalho, CM., Bukowski, RK. Scott, JG. 2020. BART with targeted smoothing: An analysis of patientspecific stillbirth risk Bart with targeted smoothing: An analysis of patientspecific stillbirth risk. The Annals of Applied Statistics14128–50.
 Vallejo . 2011 vallejo2011comparisonVallejo, G., Fernández, M., LivacicRojas, P. TueroHerrero, E. 2011. Comparison of modern methods for analyzing repeated measures data with missing values Comparison of modern methods for analyzing repeated measures data with missing values. Multivariate Behavioral Research466900–937.
 Waldmann 2016 waldmann2016genomeWaldmann, P. 2016. Genomewide prediction using Bayesian additive regression trees Genomewide prediction using bayesian additive regression trees. Genetics Selection Evolution4811–12.
 Zhang Härdle 2010 zhang2010bayesianZhang, JL. Härdle, WK. 2010. The Bayesian additive classification tree applied to credit risk modelling The bayesian additive classification tree applied to credit risk modelling. Computational Statistics & Data Analysis5451197–1205.
Author Biography
Bruna Wundervald. received her FirstClass Honours bachelor’s degree in Statistics from the Federal University of Paraná, Brazil, in 2018. In the following, she started pursuing her Ph.D., also in Statistics, at Maynooth University, in Ireland. Her research interests include a wide variety of machine learning methods, including dimensionality reduction, Bayesian Additive Regression Trees (BART), variational inference, and probabilistic graphical models.
Andrew Parnell received his Ph.D. degree in Statistics from the University of Sheffield, United Kingdom. He is currently a Hamilton Professor at the Hamilton Institute at Maynooth University, Ireland. His research is in statistics and machine learning for large structured data sets in a variety of application areas, including treebased machine learning methods, Bayesian Additive Regression Trees (BART), extreme value theory, spatial statistics, zeroinflation modeling, and missing data analysis. He has been heavily involved in the commercialization of research and is currently a principal investigator in the SFI IForm Advanced Manufacturing Centre, a funded investigator in the SFI Insight Centre for Data Analytics, and an expert reviewer for the Intergovernmental Panel on Climate Change.
Katarina Domijan
received her Ph.D. degree in Statistics from the Trinity College in Dublin, Ireland. She is currently a lecturer in Statistics at the Department of Mathematics and Statistics of Maynooth University. Her main research interests rely on Bayesian modeling of classification problems in data with large feature spaces, including feature selection, model visualization, and interpretability.