Semi-parametric Bayesian Additive Regression Trees

08/17/2021
by   Estevão B. Prado, et al.
Maynooth University
0

We propose a new semi-parametric model based on Bayesian Additive Regression Trees (BART). In our approach, the response variable is approximated by a linear predictor and a BART model, where the first component is responsible for estimating the main effects and BART accounts for the non-specified interactions and non-linearities. The novelty in our approach lies in the way we change tree generation moves in BART to deal with confounding between the parametric and non-parametric components when they have covariates in common. Through synthetic and real-world examples, we demonstrate that the performance of the new semi-parametric BART is competitive when compared to regression models and other tree-based methods. The implementation of the proposed method is available at https://github.com/ebprado/SP-BART.

READ FULL TEXT VIEW PDF
06/12/2020

Bayesian Additive Regression Trees with Model Trees

Bayesian Additive Regression Trees (BART) is a tree-based machine learni...
11/03/2019

Variable Grouping Based Bayesian Additive Regression Tree

Using ensemble methods for regression has been a large success in obtain...
08/28/2020

Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs

This paper develops Bayesian econometric methods for posterior and predi...
04/14/2022

Hierarchical Embedded Bayesian Additive Regression Trees

We propose a simple yet powerful extension of Bayesian Additive Regressi...
07/03/2021

Boost-R: Gradient Boosted Trees for Recurrence Data

Recurrence data arise from multi-disciplinary domains spanning reliabili...
03/07/2019

Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics

Understanding sub-cellular protein localisation is an essential componen...
03/04/2022

Semi-parametric Makeup Transfer via Semantic-aware Correspondence

The large discrepancy between the source non-makeup image and the refere...

Code Repositories

SP-BART

R scripts and data sets that can be used to reproduce the results presented in the paper "Semi-parametric Bayesian Additive Regression Trees".


view repo

1 Introduction

Generalised Linear Models GLMs 30, 26 are frequently used in different applications to predict a univariate response due to the ease of interpretation of the parameter estimates as well as the large availability of software that facilitates simple analyses. A common assumption in GLMs is that the covariates specified (including potential interaction terms) have a linear relationship with the mean of the response after transformation through the link function. Extensions such as Generalised Additive Models GAMs 16, 42 require the specification of the main and interaction effects via a sum of (potentially non-linear) predictors. In GAMs, the non-linear relationship is usually captured via basis expansions of the covariates and constrained by a smoothing parameter. However, in problems where the numbers of covariates and/or observations are large, the linearity assumption may not be verified and, more importantly, it may not be simple to specify the covariates and their interactions that impact most on the response. Semi-parametric models 14 have been proposed for situations where a mixture of both linear and non-linear trends and interactions are required for the data at hand.

Bayesian Additive Regression Trees BART 4

is a statistical model based on an ensemble of trees that can be used for classification and regression problems. Through a backfitting algorithm, BART sequentially generates a set of trees that, when summed together, return the predicted values. To control how deep or shallow a tree can be, a prior is placed on the tree structure. In addition, the covariates and split-points that are used to define the tree structure (i.e. splitting rules) are randomly selected without the optimisation of a loss function, such as in Random Forests

3

and Gradient Boosting

11. Compared to regression models, BART is more flexible in the sense that it does not assume linearity between covariates and the response, and it does not require the specification of a linear predictor. In contrast, however, BART does not provide an easy way to quantify the effect of a covariate on the response as in a regression model, which in some applications may be the main goal.

In this work, we introduce a BART extension to a semi-parametric regression setting. In this new method, which we name SP-BART, we combine a linear predictor where the effects of covariates of primary interest can be quantified, along with a BART component which deals with covariates that are not of primary interest but may be related to the response. Our approach aims to tackle two issues found when analysing datasets with a large number of covariates using black-box type algorithms. Firstly, there exists a difficulty when quantifying the effects of some covariates on the response. It is known that algorithms similar to BART and Random Forest are flexible and can produce good prediction, but for many situations (e.g. studies in medicine 43, 13, 19), the prediction is not the most important aspect. Instead, knowing how the covariates impact the response is crucial; but this quantification is not possible with the vanilla BART or Random Forest. The second issue is that usually regression models are utilised to measure the effects that some covariates may have on a response. However, these models require the specification of the interaction terms, which is complicated to pre-specify in large or high-dimensional datasets, though is a requirement for most standard GLM settings. Lastly, they assume linearity between the response and the covariates, which is an overly rigid assumption in some cases.

In our novel semi-parametric BART, we have made structural changes in the way the trees are grown to avoid confounding between the estimates from the linear and non-linear components in the model. In particular, in the linear predictor we allow the specification of fixed and random effects, as in a Linear Mixed Model, in which the intercept and parameter estimates can vary by a grouping factor. In contrast, the interactions and non-linearities are dealt with by the BART component. Through two simulation experiments, we compare SP-BART with its main competitors and show that it is able to recover the true effects in either the presence or absence of interactions. In the real-world applications, we demonstrate that SP-BART presents excellent performance when compared to other semi-parametric models.

This paper is organised as follows. In Section 1.1, we present some related works. In Section 2, we introduce the BART model. In Section 3, we state the proposed semi-parametric BART. In Sections 4 and 5, we compare the performance of SP-BART with other algorithms on synthetic and real-world datasets. To conclude, in Section 6 we present a discussion.

1.1 Related work

BART has been used and extended to different applications, and its theoretical properties have also gained attention more recently. For instance, BART has been applied to credit risk modelling 44, survival/competing analysis 39, 38, biomarker discovery 17, plant-based genetics 37, and causal inference 19, 12, 13

. Furthermore, it has also been extended to high-dimensional data

23, 18, polychotomous response 21, 29, zero-inflated and semi-continuous data 29, 24

, heteroscedastic data

32, and to estimate linear, smooth and monotone functions 31, 40, 5. Regarding theoretical developments, we highlight the works of 35, 34, and 25, who provide results related to the convergence of the posterior distribution generated by the BART model.

A related work to our novel model is the semi-parametric BART proposed by 43. In their model, the design matrix is split into two disjoint subsets and , which contain covariates of primary and non-primary interest, respectively. The covariates of primary interest in are specified in a linear predictor and the others are exclusively used by BART. Our work differs in that i) we do not assume that and are disjoint, i.e. we assume that or even that

. Moreover, ii) we change the way the trees in BART are grown by introducing a ‘double grow’ and ‘double prune’ move to account for confounding, and iii) we place a prior on the variance-covariance matrix of the main effects so that we are able to analyse the covariance/correlation among them.

6 introduce Varying Coefficient BART (VCBART), which combines the idea of varying coefficient models 15 with BART and extends the work of 13

to a framework with multiple covariates. In VCBART, the response is modelled via a linear predictor where the effect of each covariate is approximated by a BART model based on a set of modifiers (i.e. covariates that are not of primary interest). In addition, they provide theoretical results related to the asymptotic concentration of the VCBART posterior distribution around the true posterior considering non-i.i.d. errors. Although VCBART and SP-BART have aspects in common, such as the use of a linear predictor along with BART, our work is structurally different as we do not estimate the parameters in the linear predictor via BART. Instead, they are obtained in the same fashion as a Bayesian linear regression approach.

2 Bart

Introduced by 4, BART is a Bayesian tree-based model that was initially proposed in the context of regression and classification problems. Unlike some regression models that assume linearity between the covariates and response, and require the specification of the main effects and interaction terms via a linear predictor, BART deals with non-linearities and interactions without any pre-specification. In BART, a univariate response is approximated by a sum of trees with:

where is a function that assigns the predicted value to all observations that fall into the terminal node of the tree , is the -th row of the design matrix X, represents the topology of the tree , and

denotes the vector with all predicted values of the tree

. In relation to the number of trees , 4 recommend as a default , although they point out that can also be selected via cross-validation depending on the application.

Unlike other tree-based algorithms where a loss function is minimised to define the splitting rules in the growing process, in BART the splitting rules are uniformly defined (i.e. the covariates and their split points are selected at random based on a uniform distribution). In addition, in the BART model the structure of the trees can be modified by four moves: grow, prune, change, and swap (see Figure

1). For instance, in the grow move a terminal node is randomly selected and two new terminal nodes are created below it. During a prune move, a parent of two terminal nodes is picked at random and its children are removed. In the change movement, an internal node is randomly selected and its splitting rule is changed. Finally, in the swap move the splitting rules associated to two parents of terminal nodes are exchanged, with the parent nodes being selected at random.

Figure 1: An example of a tree generated from BART in 4 different instances. In principle, BART does not generate only one tree but rather a set of trees that summed together are responsible for the final prediction. The tree is represented as , where denotes the number of the iteration in which the tree is updated. The splitting rules (covariates and their split points) are presented in the internal nodes (squares). The predicted values are shown inside the terminal nodes (circles). illustrates the tree at iteration one with three terminal nodes (circles) and two internal nodes (squares). From to , the growing move is illustrated, as in is split into and in by using . In addition, the pruning process can be seen when reverts to . The change move is shown when comparing and , as the splitting rule that defines and is changed from to . Finally, the swap movement is illustrated in the comparison of and .

At the beginning of the BART model, it is common to set all trees as stumps. Hence, within an MCMC structure, each tree is updated in turn by a grow, prune, change, or swap move and then compared to its previous version considering the partial residuals and the structure of both trees via a marginal likelihood calculation. This comparison is carried out via a Metropolis-Hastings step, and is needed to select only splitting rules that improve the final prediction, since they are chosen based on a uniform distribution. After this, all node-level parameters () are generated and the residual variance () is updated; see Appendix A for details. As a Bayesian model, BART places priors on the parameters of interest assuming that and , where

represents the Inverse Gamma distribution and

with . In addition, a branching process prior is considered to control the depth of the trees in the form of , where and . With this prior, each internal node is observed at depth

with probability

; 4 recommend and , which tends to favour shallow trees. In practice, BART is available through the R packages bartMachine 20, BART 27, and dbarts 7.

3 New semi-parametric BART

In our novel semi-parametric approach, we separate the design matrix X into two subsets: and . The first matrix contains the covariates we wish to include in a linear component to quantify the main effects, and is a matrix that contains covariates that might contribute for predicting the response, but are not of primary interest. Unlike 43, who assume that , we consider that and may have covariates in common. In this sense, we consider a linear predictor inside the BART framework with:

where , , and . In 4 and 43, the set of trees used in BART can be modified by four moves (grow, prune, change and swap). However, to allow for and sharing covariates, we propose a change in the moves in order to minimise confounding between the linear and BART components. In this sense, if , we propose a ‘double grow’ step for any covariate that is common to both matrices. For instance, if is a stump and is selected to define a splitting rule, then another covariate (e.g., ) will also be picked and the proposed tree will have at least and in its structure. The idea of this modification is to induce interaction for those covariates that are in both components and let the linear component capture their main effects. If no double growing is done when and have at least one covariate in common, both components (linear predictor and BART) try to estimate the effect of the covariates that are in the linear predictor and inevitably there is some degree of confounding. However, when we introduce the double grow move, we allow the linear component to estimate the main effects and force BART to work specifically on the interactions and non-linearities.

Equations (1) and (2) present the full conditionals for and . These expressions are needed due to the addition of the linear predictor into the BART model; see Appendix B for full details. An outline algorithm for the process is given as:

  1. Update the linear predictor with as

    (1)
    (2)
  2. Then, sequentially update all trees, one at a time, via .

  3. Finally, update

    (3)

    where and .

In step 1, the parameter estimates and the variance-covariance matrix are updated taking into account the difference between the response and the prediction from all trees. In steps 2 and 3, each tree is modified considering the brand new parameter estimates and the error variance is updated. The main benefits of our approach are i) lower RMSEs when compared to GLMs and GAMs (as demonstrated below) and ii) we are able to obtain interactions and non-linearities without requiring any pre-specification. Regarding the computational cost, we have found that the proposed method adds a negligible time overhead to traditional implementations of BART, especially for small to medium datasets.

4 Simulation experiments

In this section, we compare our new SP-BART with GAM, semi-parametric BART of 43 (hereafter referred to as separated semi-BART), and VCBART in terms of bias using two sets of synthetic data. The results were generated using R 33 version 3.6.3 and the R packages mgcv 42, semibart 43, and VCBART 6. For both, our SP-BART and separated semi-BART, we use 50 trees, 2,000 MCMC iterations as burn-in, and 2,000 as post-burn-in. As the GAMs require the specification of the terms that should be included in the linear predictor, for both scenarios we specify the true structure used to simulate the data. This gives the GAMs an unfair advantage over the other approaches, but does provide a baseline that the BART-based approaches can aim for.

4.1 Friedman dataset

In this first scenario, we compare SP-BART with GAM, separated semi-BART, and VCBART considering the Friedman equation 10:

(4)

where , , . This equation is used for benchmarking tree-based methods using synthetic data, and has been used in many other papers 4, 6, 23. In this experiment, we set , , , totalling four scenarios. To evaluate the models’ performances, we use the bias of the parameter estimates as the accuracy measure considering 10 Monte Carlo repetitions. As the Friedman equation considers only 5 covariates to generate the response, the additional are noise and have no impact on . In this simulation, we aim to estimate the linear effects associated to and (denoted by and , respectively) using the linear predictor, i.e. we set up so that it contains only the covariates and . In contrast, we let BART take care of the non-linear and interaction terms by setting to contain all covariates (including and ).

Figure 2: Simulation results for the Friedman equation considering (number of observations), (number of covariates), and (error variance). The y-axis exhibits the bias related to the parameter estimates and for GAM, separated semi-BART, VCBART, and new SP-BART. The boxplots summarise the bias obtained from 10 Monte Carlo repetitions. Recall that the GAM has been given the true model structure so its superior performance is expected.

Figure 2 shows the results of bias obtained from GAM, separated semi-BART, VCBART and new SP-BART for each combination of and . We can see that when , the bias of the parameter estimates is low and SP-BART is able to recover the true effects. As GAM requires the specification of all terms that are estimated by the model, we set up the true structure of the Friedman equation so that it can be used as a reference in the comparison. Furthermore, the estimates from SP-BART and separated semi-BART are very similar. This similarity is expected and can be attributed to the fact that the effect of and have no interactions with other covariates. In this sense, the trees in SP-BART tend not to contain and as both effects are captured solely by the linear predictor.

4.2 Estimating main effects in the presence of interactions

In the scenario above, we have shown that new SP-BART correctly estimates the main effects when they do not have any interaction with other effects. However, in practice the covariates of primary interest may interact with other effects, and this should be taken into account. In this scenario, we compare the methods using the regression function:

(5)

where , , , and represents the tree structure that can be seen in Figure 3. As in the Friedman equation, we consider that , , , where the additional covariates have no impact on the response. We are now interested in estimating the effects associated to and (denoted by and , respectively), which can be carried out by specifying to contain only and . In turn, contains all covariates (including and ).

Figure 3: An example of a tree generated from . In if-else format this can be written as , where denotes the indicator function.

Figure 4 shows the bias of the parameter estimates for and . Again, we set up the true structure for the GAM as it is unable to capture interactions that are not specified. We can see that separated semi-BART gives extremely large bias for both parameters when . When , the bias is more evident for . This bias occurs because and are not used as splits by the BART model in the separated semi-BART, which affects the estimation of the effect associated to . In contrast, our new SP-BART estimates and with a very low bias, regardless of the number of covariates () and the amount of variability (). Furthermore, it can be seen that VCBART and SP-BART provide similar bias for both parameters, and match well with the baseline GAM model. However, it is worth mentioning that VCBART uses a BART model to estimate each parameter in the linear predictor. In this example, VCBART used trees in total to estimate and , as we set up 50 trees for each parameter. In this sense, the greater the number of parameters to be estimated in the linear predictor, the more computationally intensive VCBART becomes, since the total number of trees used to estimate all covariate effects is a function of the number of covariates in the linear predictor and of the number of trees used to approximate each effect.

Figure 4: Simulation results for the equation (5) considering (number of observations), (number of covariates), and (error variance). The y-axis exhibits the bias related to the parameter estimates and for GAM, separated semi-BART, VCBART, and SP-BART. The boxplots summarise the bias obtained from 10 Monte Carlo repetitions.

5 Results on real datasets

5.1 Timss 2019

In this section, we illustrate SP-BART on the Trends in International Mathematics and Science Study (TIMSS) dataset. TIMSS is an international series of assessments which takes place every four years. The TIMSS 2019 data provides students’ achievement in mathematics and science at the fourth and eighth grade level in 64 countries 28, 8. The dataset also contains information from surveys of students, teachers and school principals. Here, we are interested in quantifying the impact of some covariates on students’ mathematics score (variable BSMMAT01). In our analysis, we selected data from Ireland (4118 observations), students from the eighth grade level, and pre-selected 20 covariates; see Appendix C for full details.

Initially, we compare SP-BART with the Bayesian Causal Forest (BCF) proposed by 13. To do so, we consider the covariate ‘school discipline problems’ in the linear predictor of the SP-BART and as the treatment variable for BCF. Originally, this covariate is a factor with 3 levels (‘hardly any problems’, ‘minor problems’, and ‘moderate to severe problems’), but we binarise it by grouping the first two categories into one as BCF deals specifically with a binary covariate as the treatment effect. The goal is to quantify the impact of discipline problems on student’s mathematics score along with another 19 covariates; see Appendix C. In Table 1, we show the posterior distributions of the parameter estimates for BCF and SP-BART. We can see that the effect of discipline problems based on both methods is negative, which means that students who study in schools with moderate to severe discipline issues tend to have lower mathematics scores than students who study in schools with hardly or minor discipline problems. In addition, we observe that the range of the credible interval for BCF is greater than SP-BART.

Method Mean 5th percentile 95th percentile
BCF -34.53 -63.81 -6.15
SP-BART -48.20 -60.18 -38.74
Table 1: Descriptive measures of the posterior distribution for the effect of the covariate ‘discipline problems’ on the students’ mathematics score. The parameter estimate corresponds to the factor level ‘moderate to severe problems‘, as the reference level contains the categories ‘hardly any problems’ and ‘minor problems’.

One of BCF’s limitations is that it allows only one binary covariate as the variable of interest, which is not sensible for some problems. However, we can use VCBART as it is an extension to BCF that permits us to evaluate the effect of more covariates of interest, regardless of whether they are continuous, factor, or binary. In this sense, we can consider more covariates in the linear predictor, here parents’ education level (factor with 6 levels), minutes spent on homework (factor with 6 levels), and school discipline problems (factor with 3 levels). For the BART component, we selected 17 other covariates (see Appendix C) plus the three present in the linear predictor. As there are common variables in both components, we are able to capture interactions and non-linearity involving the covariates of primary and non-primary interest without any pre-specification. For instance, if we were to fit a linear (mixed) model to these data, it would be necessary to specify all main effects terms as well as the interactions between them.

Table 2 presents the parameter estimates and the corresponding credible interval associated with the covariates parents’ education, minutes spent on homework, and school discipline problems. We can see that in general the higher the parents’ education level, the greater the mathematics scores. For instance, parents who studied at a ‘university or higher’ or did a ‘post-secondary’ qualification have children with higher scores than children with parents that had formal education up to secondary level, on average. In addition, we can see that students with parents who have ‘primary, secondary or no formal education’ tend to have lower scores on average. Furthermore, schools with ‘moderate to severe‘ discipline problems have students with lower scores when compared with schools with ‘hardly’ or ‘minor’ discipline issues. As we consider that the linear and BART components may have covariates in common, we verified that covariates that are in the linear predictor interact with some covariates of primary and non-primary interest. To detect the interaction effects, we adopted the approach of 20 in which variables are considered to interact with each other if they are adjacent in the same tree. For instance, we have observed in the results of SP-BART that parents’ education shows interactions with minutes spent on homework. Similarly, school discipline problems appears to interact with absenteeism and minutes spent on homework. To analyse which interactions are important in VCBART, it would be needed to examine all trees for all covariates in the linear predictor. For instance, if there are 3 covariates in the linear predictor, it would be necessary to evaluate 3 times 50 = 150 trees (along the MCMC iterations), as the effect associated to each covariate is approximated by 50 trees (by default).

SP-BART VCBART
Covariate Category Estimate 95CI Estimate 95CI
Parents’ education University or higher 17.59 (-0.37;35.34) (10.06;34.12)
Post-secondary but not University 17.31 (-0.55;34.17) (-11.78;23.84)
Upper Secondary -2.68 (-21.13;15.63) (-15.67;34.02)
Lower Secondary -11.02 (-29.94;6.99) (-20.83;8.81)
Primary, secondary or no school -16.69 (-36.45;4.24) (-61.01;39.74)
Not informed -4.50 (-22.31;12.74) (-27.62;8.64)
Minutes spent on homework no homework -19.21 (-65.77;21.53) (-129.09;18.44)
1 to 15 minutes -6.33 (-37.67;18.4) (-35.52;36.87)
16 to 30 minutes 10.49 (-8.67;30.15) (-5.67;16.44)
31 to 60 minutes 10.14 (-10.55;33.82) (4.14;17.76)
61 to 90 minutes 7.68 (-13.91;29.55) (-1.79;27.22)
More than 90 minutes -2.76 (-24.87;20.15) (-86.68;56.45)
School Discipline Problems Hardly any problems 14.86 (-8.36;39.95) (-7.84;27.42)
Minor problems 3.05 (-21.6;28.41) (5.46;11)
Moderate to severe problems -17.92 (-43.62;9.64) (-30.24;30.19)
Table 2: The mean and the credible interval based on a training set (80) of the TIMSS 2019 dataset for the effect of parents’ education, minutes spent on homework, and school discipline problems on mathematics score. The out-of-sample Root Mean Squared Error (RMSE) for SP-BART and VCBART was 51.3 and 50.5, respectively. A post-processing was applied on the parameter estimates so that they sum zero for each covariate.

5.2 Pima Indians Diabetes

As another example to illustrate the applicability of SP-BART, we analyse the well-known Pima Indians Diabetes dataset from the UCI repository 2, which is available in R through the mlbench package 22. The goal is to predict whether or not a patient has diabetes based on blood pressure, glucose concentration, age, body mass index, and 4 other covariates. The dataset used is the corrected version of the original data, which considers NA rather than zero for values of the covariates glucose, blood, triceps, insulin and mass 22. As the response variable is binary, we used a probit link function following the data augmentation approach of 1. Here, we are interested in measuring the effects of glucose and age through the linear predictor. For the BART component, set all the 8 covariates available, including glucose and age.

In Table 3, we present the parameter estimates and the 95 credible interval for glucose and age. For these data, we compare SP-BART with separated semi-BART as the implementation of VCBART cannot deal with binary response variables. The estimates for both covariates indicate that as they increase, the probability of observing a positive diagnosis of diabetes is also increased, regardless of the method being used. However, when we look at the misclassification rate, we see that our SP-BART and separated semi-BART have and , respectively. We recall that the (separated) semi-BART of 43 assumes that the covariates that are of primary interest (i.e., glucose and age) are not in the BART component. However, in SP-BART we assume that the linear predictor and the BART components may have common variables, so we also ran the separated semi-BART with glucose and age in both the linear and BART components, which returned a slightly better misclassification rate of . Here, we observed that the inclusion of glucose and age into the design matrix used by BART generates trees that occasionally use only glucose or age. In this case, both the linear predictor and the BART model try to estimate the effects of these covariates, which is not sensible as it brings confounding/bias to the parameter estimates and does not contribute to the predictive power. Once again, we can see the benefits that arise from sharing covariates among the components along with the double grow move.

SP-BART separated Semi-BART
Covariate Estimate 95CI Estimate 95CI
Intercept 0.0623 (-1.117; 1.152) (-4.6225; -2.0874)
Glucose 0.0352 (0.027; 0.043) (0.0164; 0.0295)
Age 0.0569 (0.020; 0.093) (0.0006; 0.0399)
Table 3: The mean and the credible interval based on the full dataset for the effect of the covariates glucose (mg/dL) and age (years) on the diagnosis of diabetes. SP-BART is compared with separated semi-BART. The misclassification rates for SP-BART and separated semi-BART 43 are and , respectively. The misclassification rates are based on the full dataset as it was not possible to predict on an out-of-sample dataset using the R package of the separated semi-BART.

6 Conclusions

In this work, we have extended BART to a semi-parametric framework that does not contain many of the restrictions found in other so-called ‘semi-parametric‘ versions of BART. In our novel SP-BART, the main effects are estimated via a linear predictor and the interactions and non-linearities are dealt with by BART. The main novelty of SP-BART is the modification of the grow and prune moves to force the BART component to work on the interactions that are not covered by the linear predictor. The double growing/pruning moves aim to induce interaction between covariates that are both in the linear predictor and in the subset available for BART. In addition, via simulation studies and real data applications, we have demonstrated that SP-BART is able to estimate the main effects with low bias, and at same time, it does not require the specification of interactions. We have implemented SP-BART as an R package, which is currently available at https://github.com/ebprado/SP-BART.

In future work, we believe that other BART-based models, such as SBART 23, BART for Gamma and Log-normal hurdle data 24, and log-linear BART 29, could be extended to a semi-parametric approach. In addition, theoretical results underlying SP-BART could be developed in order to explore the properties of the proposed method in relation to the convergence of the posterior distribution.

Acknowledgements

Estevão Prado’s work was supported by a Science Foundation Ireland Career Development Award grant number 17/CDA/4695 and SFI research centre 12/RC/2289P2. Andrew Parnell’s work was supported by: a Science Foundation Ireland Career Development Award (17/CDA/4695); an investigator award (16/IA/4520); a Marine Research Programme funded by the Irish Government, co-financed by the European Regional Development Fund (Grant-Aid Agreement No. PBA/CC/18/01); European Union’s Horizon 2020 research and innovation programme under grant agreement No 818144; SFI Centre for Research Training 18CRT/6049, and SFI Research Centre awards 16/RC/3872 and 12/RC/2289P2.

Conflict of interest

The authors declare no potential conflict of interests.

Appendix A BART implementation

In this section, we provide the mathematical details of the BART model following 31 and 41. First, we define the likelihood function associated to the response variable as

where is the function that returns the predicted value given the design matrix X and the tree structure , and represents the vector that stores the predicted values of . In order to obtain the full conditionals for and , 4 assume the following priors

where with . To control the tree structure/depth, 4 place the following prior on :

where is the depth of the node , and

are hyperparameters that control the shape of the tree, and

and denote the sets of internal and terminal nodes of tree , respectively. Hence, the joint posterior distribution can be written as

(6)

where is a set of rules that define node of the tree . 4 noted that it is possible to sample from (6) in two step substituting the response variable y by the partial residuals .

  1. A new tree is proposed by a grow, prune, change, or swap step and then compare to its previous version via

    where , , and is the number of observations that belong to node of the tree . This sampling is carried out through a Metropolis-Hastings step.

  2. Since are i.i.d, their posterior is given by

    (7)

Finally, the full conditional of is given as

(8)

where is the predicted response. In Algorithm 1, the full structure of the BART model is presented.

1

Algorithm 1 BART model
1:Inputs: y, X, number of trees (), and the number of MCMC iterations (iMCMC).
2:Initialise: and set all hyperparameters of the prior distributions.
3:for (do
4:     for ( do
5:         Compute .
6:         Propose a new tree by a grow, prune, change or swap move.
7:         Compare the current () and proposed () trees via a Metropolis-Hastings step as .
8:         Sample . If , then , otherwise .
9:         Update all node-level parameters via (7), with .
10:     end for
11:     Update using (8).
12:     Update the predicted response .
13:end for
14:Result: samples of the posterior distribution of .

Appendix B Semi-parametric BART implementation

In this section, we provide details for the implementation of SP-BART. Here, we consider that y denotes a univariate response and that the likelihood associated to it is given as

where , , and . In addition, we define the partial residuals as . In Algorithm 2, the structure of SP-BART is presented. First, the response and the design matrices and are set up along with the numbers of trees (e.g., ) and MCMC iterations. Initially, all trees are initialised as stump as well as the hyperparameters of the prior distributions associated to , , , and . Second, the parameter vector and the variance-covariance matrix are updated. Hence, candidate trees () are sequentially proposed, one at time, and compared with their previous versions () via a Metropolis-Hastings step. Later, the node-level parameters are generated and then the variance is updated. For a large enough number of iterations, samples from the posterior distribution of the trees are obtained after convergence.

1

Algorithm 2 SP-BART
1:Inputs: y, , , number of trees (), and the number of MCMC iterations (iMCMC).
2:Initialise: and set all hyperparameters of the prior distributions.
3:for (do
4:     Update the parameter vector using (1).
5:     Update the covariance matrix via (2).
6:     for ( do
7:         Compute .
8:         Propose a new tree by a grow, prune, change or swap move.
9:         Compare the current () and proposed () trees via a Metropolis-Hastings step as .
10:         Sample . If , then , otherwise .
11:         Update all node-level parameters via (7), with .
12:     end for
13:     Update using (3).
14:     Update the predicted response .
15:end for
16:Result: samples of the posterior distribution of .

Appendix C TIMSS dataset

In Section 5.1, we selected students at eighth grade level and data from Ireland. For simplicity and to demonstrate the new methodology, we did not consider multiple plausible values of student achievements in the sciences and ignored sampling weights. However, to rigorously evaluate the effects and generalise the results, it is necessary to consider five different plausible values and sampling weights; see 36 and 9 for details. Below, we present the 20 covariates that we pre-selected to illustrate the SP-BART. To select these covariates, we ran the vanilla BART on TIMSS dataset and then we chose the most 20 used variables. In the comparisons with BCF and VCBART, all the 20 covariates were included in , which is the matrix used by BART component in SP-BART. In the first comparison with BCF, contained the binarised version of the covariate ‘Does the school have discipline problems?’ along with a column with ones representing the intercept. When comparing SP-BART with VCBART, contained a column of ones representing an intercept and the covariates ‘Parents’ education level’, ‘Minutes spent on homework’, and ‘School discipline problems’.

Covariate Label Source
BSBG10 How often student is absent from school Student
BSBM26AA How often teacher gives homework Student
BSBM17A Does the student know what the teacher expects them to do? Student
BSBG11B How often student feels hungry Student
BSBM20I Does the student think it is important to do well in mathematics? Student
BSBG05A Does the student have a tablet or a computer at home? Student
BCBG13BC Does the school have library resources? Principal
BSBG13D Does the student think that teachers are fair in their school? Student
BCBG13AD Does the school have heating systems? Principal
BCBG06C How many instructional days in one calendar week does the school have? Principal
BSBG05I Country-specific indicator of wealth Student
BSBM19A Does the student do usually well in mathematics? Student
BSBM42BA Minutes spent on homework Student
BSBM43BA For how many of the last 12 months have the student attended extra lessons or tutoring in mathematics? Student
BCDGDAS Does the school have discipline problems? Principal
BSBG11A How often does the student feel tired? Student
BCBG13AB Does the school have shortage of supplies? Principal
BCDGMRS Are the instructions affected by the material resources shortage? Principal
BSDGEDUP Parents’ highest education Level Student
ITSEX Sex of students Student
Table 4: Covariates pre-selected for the analysis of the TIMSS 2019 dataset. The column "Source" indicates whether the covariate is from the student questionnaire or from the questionnaire completed by the school principal.

References

  • Albert  Chib 1993 albert1993bayesianAlbert, JH.  Chib, S.  1993. Bayesian analysis of binary and polychotomous response data Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association88422669–679.
  • Blake 1998 uci_repoBlake, C.  1998. UCI repository of machine learning databases Uci repository of machine learning databases. http://www.ics.uci.edu/m̃learn/MLRepository.html.
  • Breiman 2001 breiman2001randomBreiman, L.  2001. Random forests Random forests. Machine learning4515–32.
  • Chipman . 2010 chipman2010bartChipman, HA., George, EI.  McCulloch, RE.  2010. BART: Bayesian additive regression trees Bart: Bayesian additive regression trees. The Annals of Applied Statistics41266–298.
  • Chipman . 2021 chipman2021mbartChipman, HA., George, EI., McCulloch, RE.  Shively, TS.  2021. mBART: Multidimensional Monotone BART mbart: Multidimensional monotone bart. Bayesian Analysis111–30.
  • Deshpande . 2020 deshpande2020vcbartDeshpande, SK., Bai, R., Balocchi, C., Starling, JE.  Weiss, J.  2020. VCBART: Bayesian trees for varying coefficients Vcbart: Bayesian trees for varying coefficients. arXiv preprint arXiv:2003.06416.
  • Dorie 2020 dbartsDorie, V.  2020. dbarts: Discrete Bayesian Additive Regression Trees Sampler. dbarts: Discrete bayesian additive regression trees sampler. https://CRAN.R-project.org/package=dbarts R package version 0.9-19
  • Fishbein . 2021 timss2019Fishbein, B., Foy, P.  Yin, L.  2021. IMSS 2019 User Guide for the International Database. Imss 2019 user guide for the international database. Boston, USA: TIMSS & PIRLS International Study Center. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: https://timssandpirls.bc.edu/timss2019/international-database/.
  • Foy 2017 foy2017Foy, P.  2017. TIMSS 2015 User Guide for the International Database Timss 2015 user guide for the international database. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement (IEA).
  • Friedman 1991 friedman1991multivariateFriedman, JH.  1991. Multivariate adaptive regression splines Multivariate adaptive regression splines. The annals of statistics1–67.
  • Friedman 2001 friedman2001greedyFriedman, JH.  2001. Greedy function approximation: a gradient boosting machine Greedy function approximation: a gradient boosting machine. Annals of statistics1189–1232.
  • Green  Kern 2012 green2012modelingGreen, DP.  Kern, HL.  2012. Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees Modeling heterogeneous treatment effects in survey experiments with bayesian additive regression trees. Public opinion quarterly763491–511.
  • Hahn . 2020 hahn2020bayesianHahn, PR., Murray, JS., Carvalho, CM. .  2020. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis.
  • Harezlak . 2018 harezlak2018semiparametricHarezlak, J., Ruppert, D.  Wand, MP.  2018. Semiparametric regression with R Semiparametric regression with r. Springer.
  • T. Hastie  Tibshirani 1993 hastie1993varyingHastie, T.  Tibshirani, R.  1993. Varying-coefficient models Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological)554757–779.
  • TJ. Hastie  Tibshirani 1990 hastie1990generalizedHastie, TJ.  Tibshirani, RJ.  1990. Generalized additive models Generalized additive models ( 43). CRC press.
  • Hernández . 2015 hernandez2015bayesianHernández, B., Pennington, SR.  Parnell, AC.  2015. Bayesian methods for proteomic biomarker development Bayesian methods for proteomic biomarker development. EuPA Open Proteomics954–64.
  • Hernández . 2018 hernandez2018bayesianHernández, B., Raftery, AE., Pennington, SR.  Parnell, AC.  2018. Bayesian additive regression trees using Bayesian model averaging Bayesian additive regression trees using bayesian model averaging. Statistics and computing284869–890.
  • Hill 2011 hill2011bayesianHill, JL.  2011. Bayesian nonparametric modeling for causal inference Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics201217–240.
  • Kapelner  Bleich 2016 JSSv070i04Kapelner, A.  Bleich, J.  2016. bartMachine: Machine Learning with Bayesian Additive Regression Trees bartmachine: Machine learning with bayesian additive regression trees. Journal of Statistical Software, Articles7041–40.
  • Kindo . 2016 kindo2016multinomialKindo, BP., Wang, H.  Peña, EA.  2016. Multinomial probit Bayesian additive regression trees Multinomial probit bayesian additive regression trees. Stat51119–131.
  • Leisch . 2009 mlbenchLeisch, F., Dimitriadou, E., Leisch, MF.  No, Z.  2009. Package ‘mlbench’ Package ‘mlbench’. CRAN.
  • Linero 2018 linero2018bayesianLinero, AR.  2018. Bayesian regression trees for high-dimensional prediction and variable selection Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association113522626–636.
  • Linero . 2020 linero2018semiparLinero, AR., Sinha, D.  Lipsitz, SR.  2020. Semiparametric mixed-scale models using shared Bayesian forests Semiparametric mixed-scale models using shared bayesian forests. Biometrics761131–144.
  • Linero  Yang 2018 lineroAnDyang2018Linero, AR.  Yang, Y.  2018. Bayesian regression tree ensembles that adapt to smoothness and sparsity Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology)8051087–1110.
  • McCullagh  Nelder 1989 mccullagh1989generalizedMcCullagh, P.  Nelder, J.  1989. Generalized Linear Models, Second Edition Generalized linear models, second edition. Chapman & Hall.
  • McCulloch . 2020 BART_RpkgMcCulloch, R., Sparapani, R., Spanbauer, C., Gramacy, R.  Pratola, M.  2020. BART: Bayesian Additive Regression Trees Bart: Bayesian additive regression trees []. https://CRAN.R-project.org/package=BART R package version 2.8
  • Mullis . 2020 mullis2020timssMullis, I., Martin, M., Foy, P., Kelly, D.  Fishbein, B.  2020. TIMSS 2019 International Results in Mathematics and Science. Timss 2019 international results in mathematics and science. Boston, USA: TIMSS & PIRLS International Study Center. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: https://timssandpirls.bc.edu/timss2019/international-results/.
  • Murray 2021 murray2017logMurray, JS.  2021. Log-Linear Bayesian Additive Regression Trees for Multinomial Logistic and Count Regression Models Log-linear bayesian additive regression trees for multinomial logistic and count regression models. Journal of the American Statistical Association116534756-769.
  • Nelder  Wedderburn 1972 nelder1972generalizedNelder, JA.  Wedderburn, RW.  1972. Generalized linear models Generalized linear models. Journal of the Royal Statistical Society: Series A (General)1353370–384.
  • Prado . 2021 prado2021bayesianPrado, EB., Moral, RA.  Parnell, AC.  2021. Bayesian additive regression trees with model trees Bayesian additive regression trees with model trees. Statistics and Computing3131–13.
  • Pratola . 2020 pratola2017heteroscedasticPratola, MT., Chipman, HA., George, EI.  McCulloch, RE.  2020. Heteroscedastic BART via Multiplicative Regression Trees Heteroscedastic bart via multiplicative regression trees. Journal of Computational and Graphical Statistics292405-417.
  • R Core Team 2020 RR Core Team.  2020. R: A Language and Environment for Statistical Computing R: A language and environment for statistical computing []. Vienna, Austria. https://www.R-project.org/
  • Ročková  Saha 2019 rockova2018theoryRočková, V.  Saha, E.  2019. On theory for BART On theory for bart.

    The 22nd International Conference on Artificial Intelligence and Statistics The 22nd international conference on artificial intelligence and statistics ( 2839–2848).

  • Ročková  van der Pas 2020 rockova2017posteriorRočková, V.  van der Pas, S.  2020. Posterior concentration for bayesian regression trees and forests Posterior concentration for bayesian regression trees and forests. The Annals of Statistics4842108–2131.
  • Rutkowski . 2010 rutkowski2010internationalRutkowski, L., Gonzalez, E., Joncas, M.  von Davier, M.  2010. International large-scale assessment data: Issues in secondary analysis and reporting International large-scale assessment data: Issues in secondary analysis and reporting. Educational researcher392142–151.
  • Sarti . 2021 sarti2021bayesianSarti, DA., Prado, EB., Inglis, A., dos Santos, AAL., Hurley, C., de Andrade Moral, R.  Parnell, A.  2021. Bayesian Additive Regression Trees for Genotype by Environment Interaction Models Bayesian additive regression trees for genotype by environment interaction models. bioRxiv.
  • R. Sparapani . 2019 sparapani2019nonparametricSparapani, R., Logan, BR., McCulloch, RE.  Laud, PW.  2019. Nonparametric competing risks analysis using Bayesian additive regression trees Nonparametric competing risks analysis using bayesian additive regression trees. Statistical methods in medical research0962280218822140.
  • RA. Sparapani . 2016 sparapani2016nonparametricSparapani, RA., Logan, BR., McCulloch, RE.  Laud, PW.  2016. Nonparametric survival analysis using Bayesian additive regression trees (BART) Nonparametric survival analysis using bayesian additive regression trees (bart). Statistics in medicine35162741–2753.
  • Starling . 2020 starling2020bartStarling, JE., Murray, JS., Carvalho, CM., Bukowski, RK.  Scott, JG.  2020. Bart with targeted smoothing: An analysis of patient-specific stillbirth risk Bart with targeted smoothing: An analysis of patient-specific stillbirth risk. The Annals of Applied Statistics14128–50.
  • Tan  Roy 2019 tan2019bayesianTan, YV.  Roy, J.  2019. Bayesian additive regression trees and the General BART model Bayesian additive regression trees and the general bart model. Statistics in medicine38255048–5069.
  • Wood 2017 wood2017generalizedWood, SN.  2017. Generalized additive models: an introduction with R Generalized additive models: an introduction with r. CRC press.
  • Zeldow . 2019 zeldow2019semiparametricZeldow, B., Re III, VL.  Roy, J.  2019. A semiparametric modeling approach using Bayesian additive regression trees with an application to evaluate heterogeneous treatment effects A semiparametric modeling approach using bayesian additive regression trees with an application to evaluate heterogeneous treatment effects. The Annals of Applied Statistics1331989.
  • Zhang  Härdle 2010 zhang2010bayesianZhang, JL.  Härdle, WK.  2010. The Bayesian additive classification tree applied to credit risk modelling The bayesian additive classification tree applied to credit risk modelling. Computational Statistics & Data Analysis5451197–1205.