Bayesian additive regression trees (BART; chipman2010bart) is a probabilistic machine learning model that has proved successful in both regression and classification settings (zhao2018bayesian; zhang2020application; janizadeh2021novel). BART uses a non-parametric approach which learns through sums of trees (chipman1998bayesian) where each terminal node contribution is constrained by a regularising prior distribution. The target function
is obtained by aggregating the small contributions of each tree, which is similar in flavour to the small step updates of gradient boosting algorithms(friedman2001greedy).
Considering a univariate response and training observations denoted as , the standard BART model is given by
where the function assigns a sampled value to within terminal node of the tree across all
trees and the vectorcollects the sampled mean parameters from the terminal nodes in tree . Here,
denotes the normal distribution andis a residual precision term. In standard BART, terminal node parameters are assigned a
prior, where the hyperparameters are selected to shrink the influence of each tree.
Our novel GP-BART method modifies the standard BART by using the function (replacing ) which assigns sampled values to the observations in node of tree , rather than the single value used by BART. This is achieved by assuming a Gaussian process (GP) prior over each terminal node with constant mean and a covariance function whose parameters are defined at the tree level. The proportion of the GP’s contribution is governed by a penalty parameter , such that GP-BART reduces to the standard BART model if and reduces to a sum of GPs if .
In recent years, several extensions and modifications to the original BART model have been proposed to cover different types of data and assumptions (hill2020bayesian). To deal with the lack of smoothness, linero2018bayesian presented a soft version of the BART model by advocating probabilistic split rules at the tree-building stage (linero2018bayesian). starling2020bart also presented a BART extension that guarantees the smoothness over a single target covariate, applying Gaussian process priors for each terminal node over the targeted variable (starling2020bart). prado2021bayesian proposed model trees BART that considers piecewise sums of linear functions at the terminal node level instead of piecewise sums of constants, adding flexibility. Our GP-BART considers GP models at the terminal node level, and can be seen as a sum of piecewise GPs which are inherently smooth. Our GP-BART can also be seen as an ensemble of treed Gaussian process models (tGP; gramacy2008bayesian), previously developed by gramacy2008bayesian.
We envisage our novel GP-BART framework being particularly appropriate for spatial data where smoothness in space is expected for certain covariate combinations, and thus useful in situations where Gaussian processes are commonly used (e.g., banerjee2008gaussian; gelfand2016spatial; andugula2017gaussian; xie2018integrated). For situations where the trees contain split rules over spatial variables, we introduce a further novelty to allow for rotated splits. Traditional tree-based models can be interpreted as hyper-rectangles since each one of the nodes is given in parallel-axis directions. This behavior leads to a staircase decision boundary which can inhibit the model’s ability to approximate true boundaries. garcia2007nonlinear propose non-linear projections of the tree models used in ensemble approaches to overcome this limitation, while menze2011oblique describe an oblique forest model which selects optimal oblique directions using linear discriminant analysis (menze2011oblique). More recently, blaser2016random proposed random rotation ensembles where the direction of rotation is selected randomly, yielding a more general decision boundary. In our GP-BART approach, the randomness from the rotated splitting rules permits the tree search to explore a wider sample space of the tree distribution and avoid poor mixing (wu2007bayesian). Figure 1 summarises the main idea of our proposed statistical model, highlighting the modified terminal node priors and the rotated splitting rules.
contains the sampling algorithm and describes prediction settings and uncertainty estimation. Sections4 and 5 provide comparisons between GP-BART and other methods in simulated and real-data benchmarking scenarios, respectively. Finally, Section 6 presents conclusions regarding the proposed algorithm, some limitations, and potential future work.
2 Gaussian processes Bayesian additive regression trees
For simplicity, we begin with the notation for a single tree model. Let be a binary splitting tree with terminal nodes and let denote the sets of parameters associated with each terminal node’s GP. Each GP, denoted by , is characterised by a constant mean vector and a covariance function , where and are the respective length and rate parameters of the chosen stationary kernel.
Since the tree follows a binary structure, each new node will be determined by splitting rules of the form vs. for continuous predictors (where is a scalar uniformly randomly selected from the observed values of ), and vs. for categorical ones (where represents one of the possible outcome levels for that variable). For a single tree with terminal nodes, the model is written as , where the function assigns the predicted values from to the observations belonging to terminal node .
Expanding such a model into a sum-of-trees structure is achieved via
where the parameters now characterise the terminal node GPs of each tree , now denoted by , where is again a constant vector, is now specific to each tree, and the function now assigns the predicted values from . The GP-BART model can be interpreted as a piecewise sum of non-linear GPs whereby each of the trees will make a small contribution to the overall , whereas BART can be interpreted as a less flexible piecewise sum of constants. Consequently, GP-BART typically requires fewer trees than the standard BART model.
2.1 Prior Model
As in standard BART, we require prior distributions for the tree structure and terminal node parameters; i.e., . We assume is fixed and select the following shrinkage priors assuming independence between trees and terminal nodes:
We follow chipman2010bart in our selection of priors for and
and adopt data-driven priors for the node-level GP parameters in such a way that considerable probability is assigned around the range of the observedgiven the induced prior from the sum of GPs. Associated hyperparameters are omitted from Equations (1) and (2), for brevity, but we now fully define each prior in turn.
2.1.1 The tree structure
The prior is specified using the standard setting given by (chipman1998bayesian). Thus, the tree prior distribution is implicitly defined by a generating stochastic process. The tree generation is initialised with a root node and a splitting variable is uniformly randomly selected among the possible covariates in . In the standard BART algorithm, the structure is learned via grow, prune, change, and swap moves. Thus, new trees are proposed by growing a new terminal node, removing a pair of terminal nodes, changing the split rule for an internal node, and swapping the split rules for a pair of internal nodes, where the type of move is chosen at random. Each proposed tree is then accepted or rejected via Metropolis-Hastings (MH); see chipman1998bayesian for further details.
We introduce two additional moves which we name ‘grow-project’ and ‘change-project’. Notably, these moves can also be incorporated to improve prediction in the standard BART. For a given pair of randomly selected continuous covariates, they provide random projections of the feature space when growing or changing nodes. Such rules in the projection space yield rotated rules in the original space. It is possible to add the rotated components of the coordinate system using the rotation matrix
where the angle is uniformly randomly selected among a grid of values within the interval at each split. A uniform prior over the range of the given covariate is defined and the splitting rule for that node is sampled.
Finally, the probability of an individual node at depth being non-terminal is controlled by the hyperparameters and through
resulting in a tree prior which is a product of the probabilities of each node, since Equation (4) assumes independence between nodes. We fix the default values and in order to prioritise shallower trees (chipman2010bart).
2.1.2 The prior on the Gaussian processes
The main contribution of the GP-BART model is to define
as a GP prior over the set of observations belonging to terminal node of tree , where is a vector of ones of length , such that the mean vector is constant. Here, is specified as a node-specific stationary matrix of exponentiated-quadratic covariance terms, whose elements are given by
where denotes a generic column among the set of predictors belonging to the corresponding node and denotes the squared Euclidean norm. We normalise all predictors to improve the numerical stability of the kernel. Notably, the trees themselves are unaffected by this choice, as the splitting rules governing their structure are invariant to monotone transformations.
We set to exploit conjugacy and enable all parameters to be marginalised out. The penalty parameter plays a key role in this model. It can be interpreted in terms of a mixture of and with respective weights and , such that governs the contribution of the GPs to the fit of the model. When , the model reduces to a sum of GPs without the partitioned behaviour from the tree structure, while recovers the standard BART.
chipman2010bart showed that the induced prior distribution on over all trees in a BART model allows for some expert knowledge to be incorporated about the contribution of each tree which can help to guide the choices of hyperparameter values. However, the presence of the GP priors on in GP-BART yields a different induced prior which we write as
Following the chipman2010bart approach, the key idea is to select the hyperparameters such that is between and
with high probability. The confidence interval for, has boundaries
for a chosen . We adopt , which represents an approximate confidence interval. Following chipman2010bart, we re-scale such that and . Therefore, it is reasonable to assume that and find the values of and as solutions to the system of Equations (6) where
Thus, is assumed to be fixed. However, the parameter would vanish from the model in the special case where (BART), while the model reduces to a sum of GPs with constant mean in the special case where .
As we increase the number of trees , the scale of each GP decreases, regularising the model by setting each tree contribution to be small, thereby reducing the chance of only one single GP dominating the model. Likewise, the precision of the parameter is proportional to the number of trees, shrinking the mean of each terminal node as more tree components are added into the model. The parameter is not affected by the number of trees and usually is set at a default value of , which implies that both prior components should have the same weight in the model formulation.
2.2 The prior on the length parameter
The prior for the length parameter is built using a matrix of distances between the covariates in the training set. As shown in Equation (5), controls the rate of decay with respect to the distances between pairs of predictors. Therefore, it is reasonable to assume that the prior for should be constrained into the range of the minimum and maximum (non-squared) distances observed within our training sample. This motivates our choice of the uniform prior
where denotes the matrix of pairwise Euclidean distances. The aforementioned normalisation of each predictor in also aids the elicitation of this prior, by minimising the range of and ensuring it is not unduly dominated by one covariate.
2.3 The prior on the residual precision
A conjugate gamma distributionwith expectation is assumed for the residual precision parameter. To select the hyperparameters, we follow chipman2010bart in setting the shape and rate such that , where is a high-probability value (we typically use ) and
is the precision calculated from an ordinary linear regression ofagainst the same set of predictors . The intuition behind this estimation strategy comes from the idea that, given the non-linearity of the GP and the piecewise additive component from BART, we can be optimistic that the precision of the model is greater than that of a linear model.
3 Computational algorithms for inference and prediction
Given the observed , the posterior distribution for the trees and their parameters is given by
We define the notation of a generic set as the the set of all elements except , such that corresponds to the set of trees except with respective terminal node parameters . The key feature necessary to sample from Equation (8) is the “Bayesian Backfitting” algorithm (hastie2000bayesian), which enables iterative sampling of the -th tree and its parameters. hastie2000bayesian showed that the distribution can be rewritten in terms of the partial residuals
The general structure of the sampler is thus given by:
The algorithm is initialized with stumps (i.e., trees with a single root node), with all mean parameters , all length parameters , and residual precision parameter . Initially, only grow and grow-project moves are proposed. Thereafter, once trees have reached sufficient depth , new trees are sequentially proposed by randomly selecting one of the six available moves: grow, grow-project, prune, change, change-project, and swap, and then accepted or rejected according to a MH step.
The changes in the number of parameters in , as these moves modify the tree depth, do not affect the sampling of since is fixed, is specified at the tree level, and all mean parameters are marginalised out, yielding a tractable tree posterior proportional to , which does not depend on any varying-dimensional parameters at the terminal node level. The terminal node parameters are updated by a Gibbs sampling scheme, with the associated full conditional distributions
being an identity matrix of appropriate dimension.
Lastly, we sample the length parameters using MH steps with a uniform proposal distribution as per Equation (7). Once all trees are updated, the precision parameter is sampled using a Gibbs step, with the full conditional given by
where represents the sum of the predictions across all terminal nodes from all sampled trees.
3.1 Algorithm specifications and initialisation
The penalty parameter and the number of trees have default values of and respectively, since these quantities presented reasonable predictive performance in various scenarios demonstrated in Sections 4 and 5. Alternatively, those parameters also could be selected via cross-validation, though the computational cost of doing so may be prohibitive. The inclusion, or not, of the rotated splitting rules is also a setting of the model that can be calibrated by the user. By default, GP-BART includes the rotated splits since it improves the model’s prediction in general, especially for spatial data.
For the MCMC sampling steps, a standard number of iterations , of which the first are discarded, were found to yield a sufficient number of samples to reliably characterise the posterior in all applications herein. Since GP-BART reduces to the standard BART when , we set for the first iterations of the burnin phase and restore it to thereafter. As BART tends to produce deeper trees, this discourages the GP-BART phase of the sampler from starting with trees with large numbers of observations in terminal nodes and reduces the computational burden associated with the matrix inversions required for each terminal node, since such operations are bypassed while and their complexity thereafter is .
We present the full structure of the GP-BART sampler in Algorithm 1, where the matrix of covariates and response vector from the training set enter as inputs. Trees, partial residuals, and hyperparameters are then initialised. Sequentially, for each MCMC sample, a proposed tree is accepted, if valid, with probability . Here, we consider trees with terminal nodes containing fewer than observations to be invalid. Lastly, the remaining parameters are sampled using Equations (10)–(12).
3.2 Prediction in GP-BART
The trees in GP-BART models can provide out-of-sample predictions for a set of new observations . For a given terminal node in tree for a particular MCMC sample, the joint posterior distribution of the training observations (expressed as partial residuals) and the predictions is given by
with , , and . Here, and denote the number of observations assigned to terminal node of tree
for the training samples and new data, respectively. This posterior predictive distribution can be marginalised and given in terms of
Ultimately, the function assigns the vector to the associated new observations on a per-iteration basis, such that the estimates from GP-BART are given by
where indexes the draws from the posterior distribution after the burnin iterations. The overall prediction for a new observation is then given by the average of the estimates ; i.e.,
The posterior samples from Equation (13) can also be used to quantify the uncertainty in the predictions. For instance, the endpoints of a prediction interval (PI) for a predicted value can be obtained from the upper and lower quantiles of .
In this section, we present a simulation study to evaluate the performance of GP-BART. The simulated data is composed by a summation of trees of depth , built using the variables ). The covariates are also simulated, with each being drawn from a
distribution. The values associated with each terminal node follow a multivariate normal distribution with specific mean and covariance parameters. We generate the response variable via
with number of trees , each with two terminal nodes. Here, the mean parameters are all constant vectors of the form , with respective values given by , , , , , and . Within each terminal node, multivariate normal spatial terms and additional noise terms are also added. We set , , , , , and . Figure 2 shows the simulated data surface for datasets of size , respectively, highlighting the different partitioning behaviour and smoothness within each dataset.
We compare the performance of our GP-BART model to other tree-based methods, namely BART (chipman2010bart), SoftBART (linero2018bayesian), and tGP (gramacy2008bayesian), as well as the universal kriging model (cressie2015statistics) and latent Gaussian models using integrated nested Laplace approximation (INLA; lindgren2015bayesian). We evaluate the results using -fold cross-validation; each fold is treated as a test set and prediction accuracy is quantified using the root-mean-square error (RMSE) metric on each test set.
The models are fitted using the R packages BART bart2021package, tgp gramacy2010categorical, fields fields2021package, and INLA lindgren2015bayesian, with their default settings, along with an implementation of the SoftBART model provided on the GitHub repository of linero2018bayesian. All hyperparameters for the GP-BART model were specified using their default values previously described throughout Sections 2 and 3. To qualitatively compare the methods, we analyse the prediction surface generated by each algorithm for the datasets of size , shown in Figure 2, using the predictions over the test sets in the -fold setting. The corresponding plots are provided in Figures 3, 4, and 5, respectively.
Though the provided plots indicate clear differences between each model type, each model’s behaviour is similar across the sample sizes. GP-BART’s prediction surfaces appear most similar to the original data shown in Figure 2 in each case. Indeed, GP-BART successfully identifies diagonal partitions due to its rotated splits, while BART, SoftBART, and tGP only produce splits parallel to the axes. Though BART and SoftBART uncover differences among the terminal node regions nonetheless, the assumption of spatial independence renders such predictions unreliable. The tGP, kriging, and INLA predictions capture the spatial features well, but their failure to identify the partitions results in blurred prediction surfaces in areas where the data splits.
A quantitative comparison is shown in the boxplots in Figure 6, which reflect the previous qualitative interpretations. Here, GP-BART presents substantially lower RMSE than its competitors, particularly for larger . To assess uncertainty calibration, we calculate the coverage of the PI for each method; i.e., the proportion of observations that are within the PI for each prediction, across all folds. From the results shown in Figure 7, the GP-BART model is the only one whose boxplots overlap the coverage line in all cases. Thus, its performance in terms of uncertainty quantification is superior to the other models considered.
The MH acceptance rates of the newly proposed moves used for learning the tree structures is another relevant aspect of GP-BART which allows us to study the efficacy of the novel grow-project and change-project moves. Table 1 shows the proportion of new trees that were accepted after the burnin phase for each of the six available move types under GP-BART, for each simulated dataset, over all folds. Though trees are rarely grown after the burnin phase, we can still see from these results that the grow-project and change-project moves present acceptance rates which are comparable to or higher than their respective standard counterparts. This is most likely attributable to the diagonal splitting rules used in the generation of each dataset.
Table 1 also shows that the swap move behaves differently from the other moves. This happens because a given tree must contain at least two internal nodes in order for new trees to be proposed according to a swap move. Here, however, most trees have a small number of internal nodes, given that the trees used to generate the data each contain only two terminal nodes. Thus, the acceptance rates of approximately zero can be explained by the sampled trees rarely reaching a sufficient depth for swap moves to be proposed.
To highlight the effect of the proposed moves and the use of GPs over the terminal nodes, four different, restricted versions of GP-BART are compared:
without any projection moves or GPs (i.e., the standard BART model);
without GPs, but with the addition of the two new moves;
without the new moves, but with GPs;
the standard GP-BART with both rotated splitting rules and GPs.
We defer the results for the other sample sizes, which lead to similar conclusions, to Appendix A and consider only the most computationally demanding setting here, for brevity. This comparison is summarised in Figures 8 and 9, in which the letters above are used to distinguish the model versions.
The prediction surface (A) in Figure 8 suggests that the standard BART cannot adequately capture different behaviours in the terminal node regions due to the lack of smoothness and non-linearity compared with GP-BART. Panel (B) compares reasonably well with (D), which highlights the benefits of the rotated splitting rules. However, further insights are gleaned from Figure 9, which indicate that only versions (C) and (D) have satisfactory performance for both presented metrics. In particular, these versions exhibit vastly superior coverage properties, which highlights the benefits of using GPs at the terminal node level. However, the standard GP-BART still presents the lowest median for the RMSE estimation and the best coverage ratio.
In this Section, we appraise the predictive performance of GP-BART compared to BART, SoftBART, tGP, kriging, and INLA on diverse real datasets, as a larger and more challenging test of GP-BART’s capabilities. For illustration, we use four public data sets containing spatial features; i.e., with inherent dependence over the observations. These datasets are:
The Auckland dataset consists of rows describing infant mortality in Auckland, with two spatial covariates and the target variable (bivand2018spdep).
The Baltimore dataset comprises observations of house sales prices, two spatial features, and covariates characterising each property (bivand2018spdep).
The Boston dataset contains observations of the median values of owner-occupied suburban homes, two spatial features, and related covariates. We model a corrected version boston_housing_correct of the original data boston_housing.
Swmud is a dataset of seabed mud content in the southwest Australia Exclusive Economic Zone with observations of two sets of spatial coordinates and mud content as the target variable li2010predicting.
Our implementation of each algorithm follows their respective default settings, including those previously described throughout Sections 2 and 3 for GP-BART. As before, -fold cross-validation is used to evaluate performance on each dataset. For the Baltimore and Boston datasets, we restrict the GPs to the strictly spatial features (i.e., the exact coordinates of the instances), while all covariates are used to build the trees. The results are summarised in Figure 10 and Figure 11, which show the RMSE values and PI coverage ratios over all datasets, respectively.
According to Figure 10, GP-BART presents the lowest median RMSE for all four datasets. The differences are most pronounced for the Swmud data, in the comparison between GP-BART and BART, and the Boston data, for which kriging and INLA perform notably worse than all tree-based methods.
Figure 11 shows that the PIs produced by GP-BART are the most consistent, since the boxplots overlap with the horizontal line indicating coverage in most cases. Apart from the Auckland data, for which tGP and kriging exhibit better coverage, the medians are close to the expected value of . Indeed, tGP is the only other method which yields reasonable intervals for all four datasets. Though GP-BART achieves lower RMSE values, this highlights the benefits of using GPs to produce observation-specific predictions at the terminal node level in tree-based methods, either using a single tree (tGP) or especially an additive ensemble of trees (GP-BART).
Another aspect of performance evaluation for each model is illustrated in Figure 12, which presents the average RMSE rank for each of the test partitions from the cross-validation. Ranks are defined here such that the algorithm yielding the lowest mean RMSE is given a rank of , while the algorithm with the worst prediction performance is given the highest possible rank of , for each test partition. From Figure 12, we can see that GP-BART has by far the lowest average RMSE rank for the Boston dataset. For the Auckland and Baltimore datasets, GP-BART’s performance in this regard is highly-competitive but exactly equivalent to INLA and BART, respectively. For the Swmud dataset, kriging and INLA outperform all other methods, but GP-BART achieves the next-lowest mean rank.
In this paper, we proposed GP-BART as an extension to the standard BART model. Our model uses Gaussian processes (GPs) to make observation-specific predictions at the terminal node level, and is thus able to capture non-linear relations and spatial dependence through the covariance structure of the GPs. In addition, our novel model also allows the use of projected splitting rules to build rotated partitions, which enable more flexibility in the tree representations.
The performance of GP-BART was evaluated over a number of simulated scenarios, where the model outperformed BART, restricted versions of GP-BART itself without the use of GPs and/or novel projected splitting rules, and another unrelated BART extension. Our simulation studies also highlighted GP-BART’s superior performance relative to some spatial models, namely basic kriging and INLA. When tested on real applications, using out-of-sample data via -fold cross-validation, GP-BART displayed competitive predictive capabilities, beating many of the established methods. We also compared the calibration properties of our method in terms of coverage; again, GP-BART performed as well or better than competing methodologies.
There are several potential issues remaining with the model, which may also provide opportunities for future research:
Careful choices have been made regarding the specification of prior distributions for the model parameters because the trees and the GPs can compete to explain the variability in the data. We have set up the model so that both GPs and trees have an equal opportunity to contribute to the fit. However, a more substantial study might suggest general rules as to how these parameters might be elicited in light of certain dataset properties.
The model can be computationally challenging to fit for larger datasets, since the calculation of each terminal node’s contribution to the overall likelihood involves inverting each associated covariance matrix. Though our strategy of initialising GP-BART by a warm-up step that uses BART on the first phase of MCMC iterations substantially reduces this cost, approximations for the GP — using for example the Nyström method (williams2002observations) or the methods of quinonero2007approximation — might prove to be advantageous in future work.
In the applications herein, we have focused on the use of GP-BART for spatial datasets, but there is nothing to prohibit the model being used in generic machine learning tasks. However, we have restricted the GPs to be covariance-stationary through our use of exponentiated-quadratic kernels, which are governed only by scalar rate and (tree-specific) length parameters. A superior approach may introduce non-stationarity to the autocovariance and hence produce more flexible GP surfaces.
We hope to report on these developments as part of our future research plans.
Mateus Maia’s work was supported by Science Foundation Ireland Career Development Award grant number 17/CDA/4695 and SFI research centre award 12/RC/2289P2. Andrew Parnell’s work was supported by: a Science Foundation Ireland Career Development Award (17/CDA/4695); an investigator award (16/IA/4520); a Marine Research Programme funded by the Irish Government, co-financed by the European Regional Development Fund (Grant-Aid Agreement No. PBA/CC/18/01); European Union’s Horizon 2020 research and innovation programme under grant agreement No. 818144; SFI Centre for Research Training 18CRT/6049, and SFI Research Centre awards 16/RC/3872 and 12/RC/2289P2.
A Performance evaluation for restricted versions of GP-BART
The results of a comparison between different versions of GP-BART for simulated data with are illustrated in Figure 8, showing predicted surfaces, and Figure 9, showing boxplots of the RMSE values and PI coverage ratios. For completeness, we provide here the analogous plots for the other sample sizes considered in the simulation study, with predicted surfaces and boxplots for the data in Figures A.1 and A.2, respectively, and equivalent plots for the data in Figures A.3 and A.4. Recall that the restricted versions of GP-BART under consideration are: (A) without any projection moves or GPs (equivalent to the standard BART model); (B) without GPs, but with the addition of the two new moves; (C) without the new moves, but with GPs; and (D) the standard GP-BART with both rotated splitting rules and GPs. With the sole exception of (B) presenting a lower median RMSE for the data in Figure A.2, the results for both metrics otherwise show superior performance for the standard version of GP-BART for both sample sizes. This clearly demonstrates the efficacy of the novel grow-project and change-project moves and the use of GP priors over terminal nodes and reinforces the conclusions drawn from the corresponding figures for the simulation study in Section 4.