Bayesian Additive Regression Trees (BART) (Chipman et al., 2010) is a nonparametric, Bayesian model that uses sum-of-trees to flexibly fit relationships between an outcome and a set of covariates. BART automatically accounts for nonlinear relationships and incorporates interactions. The excellent predictive performance of BART and easy accessibility of BART software (Chipman et al., 2010; McCulloch et al., 2017; Linero, 2018; Sparapani et al., 2016; Logan et al., 2017; Kapelner and Bleich, 2016; Chipman and McCulloch, 2016) have lead to its popular use. However, the resulting regression function can be very cumbersome, making the task of extracting and interpreting individual covariate effects more complicated.
For example, in this article we consider data from Type 2 diabetes patients obtained from the Clinical Research Data Warehouse of the Froedtert Health and the Medical College of Wisconsin. Hemoglobin A1C (HbA1C) measures the severity of Diabetes, and lowering it to a healthier level, usually 7%, is the primary goal of physicians and patients (Koenig, 1980; Association, 2014)
. HbA1C measurements, 6 months apart, were taken on 2673 patients. Many factors contribute to the lowering of HbA1C; these include health variables such as baseline HbA1C, depression, hypertension, BMI, HDL, LDL, systolic blood pressure (BP), diastolic BP, and demographic variables such as age, gender, race, insurance, marital status, and religion. Treatments such as metformin and insulin are commonly used in controlling HbA1C levels in type 2 diabetes patients. The treatment variable used in this article is an indicator of whether a patient was taking a treatment for diabetes or not regardless of which treatment they took. In addition to being able to predict HbA1C accurately given these variables, researchers are also interested in understanding the effect of different variables on HbA1C. As fitting a BART model with all of the above variables would result in a complicated function, breaking it down into additive parts, if appropriate, could simplify the interpretation and improve the practical utility of the model. This amounts to fitting a model with additive nonparametric functions with each function fit with its own set of covariates. A special case of this additive model fits a parametric model to one of the additive components with one (treatment for example) or more variables, while allowing the remaining variables to have a flexible form. If additive, this results in a more interpretable treatment effect; if not additive, this indicates a need for individualized treatment considerations.
Some approaches for assessing additivity of two or more univariate predictors have been proposed (Zhang et al., 2016; Eubank et al., 1995; Barry, 1993). However, formal methods of additivity assessment of nonparametric functions of disjoint sets of multiple variables are still lacking; such a method would be needed, for example, to assess the additivity of health and demographic factors in the type 2 diabetes example. Out-of-sample prediction error (OSPE) has been used as a criterion to compare the prediction performance of one model versus another. However, it is computationally intensive to calculate. Log Pseudo Marginal Likelihood (LPML) (Geisser and Eddy, 1979; Branscum et al., 2015; Christensen et al., 2011) is one of the criteria used in the Bayesian framework to compare two models with vague, or even improper, priors based on their prediction performance. LPML, based on the predictive distribution, is easy to compute and has been also widely adopted for model selection (Abbot and Marohasy, 2017; Ansari and Jedidi, 2000; Barcella et al., 2017; Booth et al., 2011; Cardoso and Tempelman, 2004; Chang et al., 2006; Chen et al., 2014; Hanson and Yang, 2007; Heydari et al., 2016; Hu et al., 2015). We propose using the LPML criterion to test nonadditivity in BART models.
This paper is outlined as follows: In section 2, we review the BART models for continuous and binary outcomes. We also review the Log Pseudo Marginal Likelihood (LPML) criterion and Out-Of-Sample Prediction error (OSPE) criteria and how they are used for model comparison. Section 3 contains the proposed additive BARTs models and additive treatment BART models for both binary and continuous outcomes. This section also describes how LPML and OSPE are used to assess additivity in BARTs models. In section 4, we present a systematic approach for designing simulations used in nonadditivity testing; this is based on formulating the strength of the interaction term relative to the other components in the model. Section 5 contains our simulation settings and results. In section 6, we apply our approach to data from a Type 2 diabetes study to investigate additivity between health and demographic variables, and between a treatment and all other variables. We also apply the proposed approach to data from patients undergoing a hematopoietic cell transplant and assess whether the effect of their characteristics on their one year survival is additive with the treatment effect. We end the paper with a conclusion and a discussion of possible extensions of this approach.
Here we review BART models for continuous and binary outcomes as well as two model prediction performance criteria used in this article.
2.1 Bayesian Additive Regression Trees (BART); continuous response
Consider the problem of making inference about a function that predicts a continuous outcome
using a vector of predictorsgiven that
One way to solve this inference problem is to approximate by a nonparametric function. With BART, is approximated by a sum-of-trees model;
where is a binary tree, the tree’s terminal nodes parameters, and a tree function that assigns a to as illustrated in Figure 1. Under equation (1), is then calculated as the sum of all terminal nodes, , assigned to by each tree function , where . Figure 1 depicts such a sum. Chipman et al. (Chipman et al., 2010) showed the great predictive ability of BART compared to other ensemble menthods via simulations and applied examples and developed a BART R package to easily fit this model (McCulloch et al., 2017).
The independent prior for the error variance is. Notationally, we specify the prior for the unknown function, , as . This prior is made up of two components: a prior for the tree structures, and a prior for the terminal nodes parameters, , so that
The prior for is specified by describing a probabilistic process for growing a tree, where P(node at depth d is non-terminal)= , and . Using default values of , trees with 2 or 3 terminal nodes are favored as shown in table 1. At each internal node, a splitting variable and splitting value are chosen uniformly among all the available variables and cutoff values, respectively. The terminal nodes are given independent normal priors, with , where is a tuning parameter with default value of .
2.2 Bayesian Additive Regression Trees (BART); binary response
In clinical research, studies with binary endpoints arise often. Chipman et al. (Chipman et al., 2010) provide a detailed extension of the above BART model to handle binary endpoints using a probit link and the data augementation idea of Albert and Chib (Albert and Chib, 1993)
. However, this model does not directly provide the covariate effects in terms of odds ratios. The current BART R package(McCulloch et al., 2017) already includes an implementation of the BART model for binary outcomes using a logit link and data augumentation approach of Holmes and Held (Holmes et al., 2006). Consider a binary outcome , then the logit model set up for classification is
represents the standard Logistic cumulative distribution function. The Holmes and Held augmentation idea introduces latent variablessuch that:
The prior hyperparameter for the terminal node parameters is modified to, but the prior for is otherwise unchanged. Inferences for are available using Gibbs sampler for draws
using a truncated normal distribution as described in(Holmes et al., 2006), using the approach detailed in (Chipman et al., 2010) for parameters in the sum-of-trees model with as the outcome and following the rejection sampling algorithm detailed in (Holmes et al., 2006).
2.3 Log Pseudo Marginal Likelihood (LPML)
In Bayesian model comparison, Bayes factors (Kass and Raftery, 1995; Han and Carlin, 2001; Gilks et al., 1995; Christensen et al., 2011) have been used as a measure of evidence in favor of one model as opposed to another, and clear threshold values that indicate the strength of this evidence have been established (Jeffreys, 1961; Kass and Raftery, 1995). Assume that the data are conditionally independent given a model and model parameters . Then the marginal likelihood is given by :
Bayes factors are calculated as the ratio of the marginal likelihoods from two competing models. They can be interpreted as the ratio of posterior odds to prior odds for the model in the numerator of the Bayes factor. However, is not available analytically for improper or even vague priors.
Geisser and Eddy (Geisser and Eddy, 1979) suggested replacing by the pseudo marginal likelihood (PML)
where is the conditional predictive ordinate (CPO). This CPO is the predictive density calculated at the observed given all data except the observation. Thus, the log pseudo marginal likelihood (LPML) is computed as
Given two competing models, a model that maximizes the LPML is chosen. Moreover, a pseudo Bayes factor (PsBF) is calculated by exponentiation of the difference between LPML statistics of the two competing models, i.e., PsBF is a ratio of pseudo marginal likelihoods. Threshold values as those in (Jeffreys, 1961) are used to choose the data supported model.
LPML and PsBF gained widespread use (Dey et al., 1997; Chen et al., 2002; O’Malley et al., 2003; Chang et al., 2004; Li et al., 2006; Chang et al., 2006; Hanson and Yang, 2007; Bhattacharya and Sengupta, 2009; Hanson et al., 2011; Heydari et al., 2016; Cheng et al., 2017) after Gelfand and Dey (Gelfand and Dey, 1994) established asymptotic properties of the PML and showed that CPO
can easily be estimated using posterior samplesby
represents the probability density function (pdf) if the outcome is continuous and represents the probability mass function for a binary outcome. Note that this is readily available for BART models for both continuous and binary response given(and in the former case) simply as the normal density and the Bernoulli mass function.
2.4 Out-of-sample prediction error (OSPE)
The out-of-sample prediction error (OSPE), based on 5-fold cross-validation, is another criterion commonly used in assessing a model’s predictive performance. The dataset is randomly split into 5 equal (or nearly equal) sized folds. For each , a training dataset is constructed using all data except those in the k fold, and a test data set using data points from the k fold. The model is fit to the training data, and used to evaluate the prediction error using the test dataset. Finally, one averages these fold-based errors to obtain the OSPE. Thus, the OSPE is calculated as
where is i observation from the test dataset and the fitted value. For competing models, the model with the lowest OSPE is preferred. A ratio of OSPEs from two competing models is sometimes used for quantifying model comparison.
3 Additive nonparametric models
BART models described in sections 2.1 and 2.2 are known to work well when prediction is the primary goal. In addition to predictions, researchers are usually interested in the interpretation of at least some aspects of the fitted model, e.g., individual covariate effects. The predictive function from BART is not available analytically and requires post processing to recover aspects such as individual covariate effects. While this is possible and often undertaken, to further assist in interpretation we introduce a sum of two simpler BART models which we refer to as the additive BARTs model.
3.1 Additive BARTs model; continuous outcome
Consider the additive BARTs model written as:
where , are two disjoint sets of the covariates and , are the flexible functions fit with , , respectively. Each function can be written as a sum of trees as in the original BART model. The parameters of this additive BARTs model are . We place BART priors on and ; i.e, and . We use the same prior specification as for a single BART model, except that the terminal nodes are given a normal prior , with
We add to the specification in (Chipman et al., 2010) this factor of 2 in front of so that the additive effect of all the trees in the two additive BARTs model remains stochastically equivalent to a corresponding single ensemble model of trees.
To make inference based on this additive BARTs model, we use the Blocked Gibbs Sampling technique (Gilks et al., 1995) to sample from the posterior distribution . Define , , , and . This algorithm first draws samples for , then samples for , and finally samples of from their posterior distributions. To accomplish this, consider the residuals:
Notice that () depends on () only through the residuals , and vice versa. Hence, the Blocked Gibbs sampling algorithm first draws samples for as described in (Chipman et al., 2010) for a single BART model applied to . In a similar manner, we make draws for using . Given all the trees and terminal nodel parameters, we then obtain draws for
from an inverse gamma distribution.
To get draws for the first step of this algorithm, we use the property described in (Chipman et al., 2010) that shows that the conditional distributions depend on only through the residuals
Similary for step 2, we get the residuals
Thus we have the steps:
Step 3: .
In step 1, samples for are drawn using the Metropolis-Hastings algorithm as described in (Chipman et al., 2010) and is made of independent draws of the terminal nodes parameters from a normal distribution. In a similar manner, we draw parameters from step 2. Finally, is drawn from an inverse gamma distribution. The above algorithm produces a sequence of draws which converges in distribution to the posterior distribution . The resulting sequence of sum-of-trees functions
converges to the true posterior distribution . To obtain the predicted value of Y at a given value of the covariate vector , we use the average of the after burn-in sample
which is an approximation of the posterior mean with being the number of Gibbs samples after burn-in.
3.2 Additive treatment BART model; continuous outcome
The additive BARTs model introduced in section 3.1 is general in its specification and does not directly provide estimates and interpretation of individual covariate effects. If we are interested in the estimation and interpretation in the effect of a treatment, however, an alternative additive model is readily available almost as a special case of the two BARTs model. To elaborate, we propose fitting a BART model to the set of covariates and a parametric model for the treatment. This model allows for detection of interaction between the treatment and the set of covariates. A model without interaction between the treatment and other variables is less cumbersome and has a directly interpretable treatment effect. To specify this model, let be the treatment variable and a vector of other covariates. Then, define the additive treatment BART model as:
with and . Then the likelihood for this model is given by:
We assume that the prior distribution for the treatment effect is . By using the standard conjugacy argument, the posterior distribution for is with
To get posterior samples for the parameters in this model, we follow a similar Blocked Gibbs Sampling algorithm as that outlined in section 3.1. We first get samples for using followed by sampling from its normal posterior distribution using the residuals . Finally, we sample for from an inverse gamma distribution.
3.3 Additive BARTs model; binary outcome
Akin to Section 3.1, we introduce an additive BART model for a binary response to assist in interpretation. Consider a binary outcome , then the additive logit model set up is
Here, represents the standard Logistic cumulative distribution function. Again taking the Holmes and Held augmentation idea (Holmes et al., 2006), we use latent variables as:
In order to make inference about this additive model, we need samples of ; ; and . Consider the residuals and . Notice that depends on only through the residuals . The rest of the steps for the Blocked Gibbs sampling algorithm follow a similar approach as that described in Section 3.1 to get samples of ; ; and . Holmes and Held (Holmes et al., 2006) provide an algorithm to get samples of . Samples of are obtained using a truncated normal sampler.
3.4 Additive treatment BART model; binary outcome
Here we modify the model introduced in Section 3.3 by replacing the specification for by a simple parametric form for the treatment effect. Consequently, notationally replacing by as there is only one nonparametric function, the changes to the two BARTS binary model are:
where ’s are latents as in Section 3.3. Posterior sampling also proceeds as there except that with
3.5 Using LPML and OSPE to assess nonadditivity
Considering the ease with which LPML is calculated given the posterior samples, we propose using the LPML criterion described in section 2.3 in conjunction with all the BART models described above to assess whether a model is additive or not. To accomplish this, we first fit data to single BART model (nonadditive), and obtain . Next, we fit an additive BARTs model from Sections 3.1 or 3.2 (Sections 3.3 or 3.4 if the outcome is binary) with the covariate space divided into two disjoint sets of covariates, and obtain . Given and , we calculate the Pseudo Bayes Factor as . We interpret the PsBF following the guidelines proposed by Jeffreys (Jeffreys, 1961) shown in table 2 ; i.e, if the PsBF is above 3, we conclude that there is substantial evidence in favor of the nonadditive model, and so on. Given the OSPE from nonadditive and OSPE from additive BARTs models, we calculate . If is above 1, we conclude that the model is additive. The existence of these threshold values for PsBF provides an important advantage over other criteria such as AIC or OSPE which do not provide clear graded thresholds for model selection. Moreover, given the cross-validation required in computing OSPE, it is computationally expensive compared to PsBF which is very fast to calculate.
|BF (PsBF)||Strength of evidence|
|to 3||Barely worth mentioning|
|3 to 10||Substantial|
|10 to 30||Strong|
|30 to 100||Very strong|
When these additivity assessments compare one model with a single BART and another with two additive BARTs, we recommend that in the latter model each BART use half the number of trees used in the single BART model. This maintains comparability of the priors for the two models. We followed this recommendation in all calculations in this article.
4 Designing simulations to assess additivity
In the next section, we use simulations to demonstrate the performance of LPML and OSPE criteria to assess additivity in BARTs models. However, the variability of each term in the model relative to others affects whether additivity can be detected or not. Moreover, large random error term can cause the effect of other terms in the model to be undetectable. Hence, we propose a systematic approach to design simulations in which the variability of each term is standardized relative to others.
4.1 Simulation design for a continuous outcome
Consider generating data from a nonadditive model with continuous outcome and disjoint sets of covariates and , where we express
with jointly independent. For a corresponding additive model, we define ; i.e, we assume and .
A major factor in determining the degree of nonadditivity is the variability in each of the terms of this model relative to the total variance function. We propose an approach that standardizes the variance of each term relative to the total variance function. To simplify the notation, we use , and . Then the variance function is given by:
We propose the standardization ratios , , where is the variability attributed to random noise, the variability due to both main and interaction effects of variables, and the variability attributed to the interaction between and variables. For additive models with no interaction between the two sets of variables, represents the variability due to the main effects of variables. However, for models with interactions between and , represents the variability due to both the main and interaction effects of variables. Hence, our standardization ratios are defined as:
Setting fixed values of , we obtain a system of quadratic equations in , and and solve them analytically to get appropriate sulutions for , , and .
4.2 Simulation design for a binary outcome
We generate and with and generated independently and center the above functions by: , , and . This centering is used to ensure thefunction values are located in a reasonable range on the logit scale.
Consider data generated from the following model with an interaction between and :
with jointly independent, and being a link function such as the logit or probit function.
Let , and . Then
Motivated by prior specifications outlined in (Chipman et al., 2010) concerning BART probit for classification, let
Following a similar approach as that described in Section 4.1 to standardize the variance of each term relative to the total variance, we define the following normalization ratio that represents the variability due to the effects of ,
5 Simulation Settings and Results
We use the systematic approach of Section 4 to conduct simulations tailored for additivity testing, and compare performance of LPML and OPSE in detecting additivity. For continuous outcomes, we generate data using
For binary data, we use
Choices of are described in the next subsection along with settings of . The second subsection describes simulation results.
5.1 Simulation Settings
We consider four different scenarios for our simulations as shown by the different choices of and given in Table 3. The proposed scenarios vary by degree of complexity, from simple linear functions in Scenario 1 to nonlinear functions in Scenario 4. For the choice of , we use a commom with in assessing additivity. In the additive treatment case, we use representing a binary treatment.
We set variance standardization ratios of , , and , and used the approach described in Section 4 to solve for values of (i) , with for a continuous response and (ii), , for a binary response. Given these solutions and the functions , we generate the outcome or latent values as described in the introduction of Section 5 depending on whether the response is continuous or binary. Note that values represent increasing strength of the interaction between the two sets of covariates, and are bounded by . Hence, is zero for additive models and is non-zero for nonadditive models. The setting results in and indicating that data is generated from an additive model.
We considered sample sizes of for continuous outcomes and for binary responses. For each scenario, we fit an additive and a nonadditive model to the data, and calculated their respective LPMLs. Then, for we computed the pseudo Bayes factor as and for we computed the pseudo Bayes factor as so that reflects correct decision. To show the performance of LPML in detecting nonadditivity, we repeated the calculations 200 times for each scenario. In a similar manner, we calculated the OSPEs for the two competing models, and computed their ratio as with reflecting support for additivity.
5.2 Simulation Results
In this section, we discuss simulation results. Section 5.2.1 demonstrates the performance of LPML in detecting additivity in BART models using boxplots. In section 5.2.2, we compare the performance of LPML to that of OSPE with regard to their probability of correctly assessing whether data came from an additive or nonadditive model.