GP-BART: a novel Bayesian additive regression trees approach using Gaussian processes

04/05/2022
by   Mateus Maia, et al.
0

The Bayesian additive regression trees (BART) model is an ensemble method extensively and successfully used in regression tasks due to its consistently strong predictive performance and its ability to quantify uncertainty. BART combines "weak" tree models through a set of shrinkage priors, whereby each tree explains a small portion of the variability in the data. However, the lack of smoothness and the absence of a covariance structure over the observations in standard BART can yield poor performance in cases where such assumptions would be necessary. We propose Gaussian processes Bayesian additive regression trees (GP-BART) as an extension of BART which assumes Gaussian process (GP) priors for the predictions of each terminal node among all trees. We illustrate our model on simulated and real data and compare its performance to traditional modelling approaches, outperforming them in many scenarios. An implementation of our method is available in the R package rGPBART available at: https://github.com/MateusMaiaDS/gpbart

READ FULL TEXT VIEW PDF

Authors

page 13

page 14

page 15

page 18

page 22

page 28

page 29

06/12/2020

Bayesian Additive Regression Trees with Model Trees

Bayesian Additive Regression Trees (BART) is a tree-based machine learni...
06/09/2020

Bayesian Probabilistic Numerical Integration with Tree-Based Models

Bayesian quadrature (BQ) is a method for solving numerical integration p...
03/26/2022

Influential Observations in Bayesian Regression Tree Models

BCART (Bayesian Classification and Regression Trees) and BART (Bayesian ...
12/02/2020

Spatial Multivariate Trees for Big Data Bayesian Regression

High resolution geospatial data are challenging because standard geostat...
01/04/2021

Using BART for Multiobjective Optimization of Noisy Multiple Objectives

Techniques to reduce the energy burden of an Industry 4.0 ecosystem ofte...
05/07/2015

DART: Dropouts meet Multiple Additive Regression Trees

Multiple Additive Regression Trees (MART), an ensemble model of boosted ...
04/14/2022

Hierarchical Embedded Bayesian Additive Regression Trees

We propose a simple yet powerful extension of Bayesian Additive Regressi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian additive regression trees (BART; chipman2010bart) is a probabilistic machine learning model that has proved successful in both regression and classification settings (zhao2018bayesian; zhang2020application; janizadeh2021novel). BART uses a non-parametric approach which learns through sums of trees (chipman1998bayesian) where each terminal node contribution is constrained by a regularising prior distribution. The target function

is obtained by aggregating the small contributions of each tree, which is similar in flavour to the small step updates of gradient boosting algorithms

(friedman2001greedy).

Considering a univariate response and training observations denoted as , the standard BART model is given by

where the function assigns a sampled value to within terminal node of the tree across all

trees and the vector

collects the sampled mean parameters from the terminal nodes in tree . Here,

denotes the normal distribution and

is a residual precision term. In standard BART, terminal node parameters are assigned a

prior, where the hyperparameters are selected to shrink the influence of each tree.

Our novel GP-BART method modifies the standard BART by using the function (replacing ) which assigns sampled values to the  observations in node of tree , rather than the single value used by BART. This is achieved by assuming a Gaussian process (GP) prior over each terminal node with constant mean and a covariance function whose parameters are defined at the tree level. The proportion of the GP’s contribution is governed by a penalty parameter , such that GP-BART reduces to the standard BART model if and reduces to a sum of GPs if .

In recent years, several extensions and modifications to the original BART model have been proposed to cover different types of data and assumptions (hill2020bayesian). To deal with the lack of smoothness, linero2018bayesian presented a soft version of the BART model by advocating probabilistic split rules at the tree-building stage (linero2018bayesian). starling2020bart also presented a BART extension that guarantees the smoothness over a single target covariate, applying Gaussian process priors for each terminal node over the targeted variable (starling2020bart). prado2021bayesian proposed model trees BART that considers piecewise sums of linear functions at the terminal node level instead of piecewise sums of constants, adding flexibility. Our GP-BART considers GP models at the terminal node level, and can be seen as a sum of piecewise GPs which are inherently smooth. Our GP-BART can also be seen as an ensemble of treed Gaussian process models (tGP; gramacy2008bayesian), previously developed by gramacy2008bayesian.

We envisage our novel GP-BART framework being particularly appropriate for spatial data where smoothness in space is expected for certain covariate combinations, and thus useful in situations where Gaussian processes are commonly used (e.g., banerjee2008gaussian; gelfand2016spatial; andugula2017gaussian; xie2018integrated). For situations where the trees contain split rules over spatial variables, we introduce a further novelty to allow for rotated splits. Traditional tree-based models can be interpreted as hyper-rectangles since each one of the nodes is given in parallel-axis directions. This behavior leads to a staircase decision boundary which can inhibit the model’s ability to approximate true boundaries. garcia2007nonlinear propose non-linear projections of the tree models used in ensemble approaches to overcome this limitation, while menze2011oblique describe an oblique forest model which selects optimal oblique directions using linear discriminant analysis (menze2011oblique). More recently, blaser2016random proposed random rotation ensembles where the direction of rotation is selected randomly, yielding a more general decision boundary. In our GP-BART approach, the randomness from the rotated splitting rules permits the tree search to explore a wider sample space of the tree distribution and avoid poor mixing (wu2007bayesian). Figure 1 summarises the main idea of our proposed statistical model, highlighting the modified terminal node priors and the rotated splitting rules.

for tree= grow=south, draw, minimum size=3ex, inner sep=3pt, s sep=7mm, l sep=6mm [, [, edge label=node[midway, font=, left]FALSE [, circle, edge label=node[midway,left, font=]FALSE,] [, circle, edge label=node[midway,right, font=]TRUE,] ] [, circle, edge label=node[midway,right, font=]TRUE,] ]

(a)

for tree= grow=south, draw, minimum size=3ex, inner sep=3pt, s sep=7mm, l sep=6mm [, [, [, circle] [, circle] ] [, circle] ]

(b)

for tree= grow=south, draw, minimum size=3ex, inner sep=3pt, s sep=3mm, l sep=2mm [, [, [, circle] [, circle] ] [, [, circle] [, circle] ] ]

(c)

for tree= grow=south, draw, minimum size=3ex, inner sep=3pt, s sep=14mm, l sep=12mm [, [, circle] [, circle] ]

(d)
Figure 1: Graphical representation of four example trees from a GP-BART model, where denotes a specific covariate in , the matrix of predictors in the training set. The splitting rules in each tree can take the form of a univariate cut-off for continuous covariates, a subset of factor levels for categorical covariates, or rotated split rules obtained by random projections of a pair of continuous covariates. Gaussian process priors are assumed for the predicted values for each terminal node in each tree, such that a priori.

The remainder of this paper is structured as follows. Section 2 describes the GP-BART model, with mathematical formulations and key specifications. Section 3

contains the sampling algorithm and describes prediction settings and uncertainty estimation. Sections

4 and 5 provide comparisons between GP-BART and other methods in simulated and real-data benchmarking scenarios, respectively. Finally, Section 6 presents conclusions regarding the proposed algorithm, some limitations, and potential future work.

2 Gaussian processes Bayesian additive regression trees

For simplicity, we begin with the notation for a single tree model. Let be a binary splitting tree with terminal nodes and let denote the sets of parameters associated with each terminal node’s GP. Each GP, denoted by , is characterised by a constant mean vector and a covariance function , where and are the respective length and rate parameters of the chosen stationary kernel.

Since the tree follows a binary structure, each new node will be determined by splitting rules of the form vs. for continuous predictors (where is a scalar uniformly randomly selected from the observed values of ), and vs. for categorical ones (where represents one of the possible outcome levels for that variable). For a single tree with terminal nodes, the model is written as , where the function assigns the predicted values from to the observations belonging to terminal node .

Expanding such a model into a sum-of-trees structure is achieved via

where the parameters now characterise the terminal node GPs of each tree , now denoted by , where is again a constant vector, is now specific to each tree, and the function now assigns the predicted values from . The GP-BART model can be interpreted as a piecewise sum of non-linear GPs whereby each of the trees will make a small contribution to the overall , whereas BART can be interpreted as a less flexible piecewise sum of constants. Consequently, GP-BART typically requires fewer trees than the standard BART model.

2.1 Prior Model

As in standard BART, we require prior distributions for the tree structure and terminal node parameters; i.e., . We assume is fixed and select the following shrinkage priors assuming independence between trees and terminal nodes:

(1)

where

(2)

We follow chipman2010bart in our selection of priors for and

and adopt data-driven priors for the node-level GP parameters in such a way that considerable probability is assigned around the range of the observed

given the induced prior from the sum of GPs. Associated hyperparameters are omitted from Equations (1) and (2), for brevity, but we now fully define each prior in turn.

2.1.1 The tree structure

The prior is specified using the standard setting given by (chipman1998bayesian). Thus, the tree prior distribution is implicitly defined by a generating stochastic process. The tree generation is initialised with a root node and a splitting variable is uniformly randomly selected among the possible covariates in . In the standard BART algorithm, the structure is learned via grow, prune, change, and swap moves. Thus, new trees are proposed by growing a new terminal node, removing a pair of terminal nodes, changing the split rule for an internal node, and swapping the split rules for a pair of internal nodes, where the type of move is chosen at random. Each proposed tree is then accepted or rejected via Metropolis-Hastings (MH); see chipman1998bayesian for further details.

We introduce two additional moves which we name ‘grow-project’ and ‘change-project’. Notably, these moves can also be incorporated to improve prediction in the standard BART. For a given pair of randomly selected continuous covariates, they provide random projections of the feature space when growing or changing nodes. Such rules in the projection space yield rotated rules in the original space. It is possible to add the rotated components of the coordinate system using the rotation matrix

(3)

where the angle is uniformly randomly selected among a grid of values within the interval at each split. A uniform prior over the range of the given covariate is defined and the splitting rule for that node is sampled.

Finally, the probability of an individual node at depth being non-terminal is controlled by the hyperparameters and through

(4)

resulting in a tree prior which is a product of the probabilities of each node, since Equation (4) assumes independence between nodes. We fix the default values and in order to prioritise shallower trees (chipman2010bart).

2.1.2 The prior on the Gaussian processes

The main contribution of the GP-BART model is to define

as a GP prior over the set of observations belonging to terminal node of tree , where is a vector of ones of length , such that the mean vector is constant. Here, is specified as a node-specific stationary matrix of exponentiated-quadratic covariance terms, whose elements are given by

(5)

where denotes a generic column among the set of predictors belonging to the corresponding node and denotes the squared Euclidean norm. We normalise all predictors to improve the numerical stability of the kernel. Notably, the trees themselves are unaffected by this choice, as the splitting rules governing their structure are invariant to monotone transformations.

We set to exploit conjugacy and enable all parameters to be marginalised out. The penalty parameter plays a key role in this model. It can be interpreted in terms of a mixture of and with respective weights and , such that governs the contribution of the GPs to the fit of the model. When , the model reduces to a sum of GPs without the partitioned behaviour from the tree structure, while recovers the standard BART.

chipman2010bart showed that the induced prior distribution on over all trees in a BART model allows for some expert knowledge to be incorporated about the contribution of each tree which can help to guide the choices of hyperparameter values. However, the presence of the GP priors on in GP-BART yields a different induced prior which we write as

Following the chipman2010bart approach, the key idea is to select the hyperparameters such that is between and

with high probability. The confidence interval for

, has boundaries

(6)

for a chosen . We adopt , which represents an approximate confidence interval. Following chipman2010bart, we re-scale such that and . Therefore, it is reasonable to assume that and find the values of and as solutions to the system of Equations (6) where

Thus, is assumed to be fixed. However, the parameter would vanish from the model in the special case where (BART), while the model reduces to a sum of GPs with constant mean in the special case where .

As we increase the number of trees , the scale of each GP decreases, regularising the model by setting each tree contribution to be small, thereby reducing the chance of only one single GP dominating the model. Likewise, the precision of the parameter is proportional to the number of trees, shrinking the mean of each terminal node as more tree components are added into the model. The parameter is not affected by the number of trees and usually is set at a default value of , which implies that both prior components should have the same weight in the model formulation.

2.2 The prior on the length parameter

The prior for the length parameter is built using a matrix of distances between the covariates in the training set. As shown in Equation (5), controls the rate of decay with respect to the distances between pairs of predictors. Therefore, it is reasonable to assume that the prior for should be constrained into the range of the minimum and maximum (non-squared) distances observed within our training sample. This motivates our choice of the uniform prior

(7)

where denotes the matrix of pairwise Euclidean distances. The aforementioned normalisation of each predictor in also aids the elicitation of this prior, by minimising the range of and ensuring it is not unduly dominated by one covariate.

2.3 The prior on the residual precision

A conjugate gamma distribution

with expectation is assumed for the residual precision parameter. To select the hyperparameters, we follow chipman2010bart in setting the shape and rate such that , where is a high-probability value (we typically use ) and

is the precision calculated from an ordinary linear regression of

against the same set of predictors . The intuition behind this estimation strategy comes from the idea that, given the non-linearity of the GP and the piecewise additive component from BART, we can be optimistic that the precision of the model is greater than that of a linear model.

3 Computational algorithms for inference and prediction

Given the observed , the posterior distribution for the trees and their parameters is given by

(8)

We define the notation of a generic set as the the set of all elements except , such that corresponds to the set of trees except with respective terminal node parameters . The key feature necessary to sample from Equation (8) is the “Bayesian Backfitting” algorithm (hastie2000bayesian), which enables iterative sampling of the -th tree and its parameters. hastie2000bayesian showed that the distribution can be rewritten in terms of the partial residuals

(9)

The general structure of the sampler is thus given by:

The algorithm is initialized with stumps (i.e., trees with a single root node), with all mean parameters , all length parameters , and residual precision parameter . Initially, only grow and grow-project moves are proposed. Thereafter, once trees have reached sufficient depth , new trees are sequentially proposed by randomly selecting one of the six available moves: grow, grow-project, prune, change, change-project, and swap, and then accepted or rejected according to a MH step.

The changes in the number of parameters in , as these moves modify the tree depth, do not affect the sampling of since is fixed, is specified at the tree level, and all mean parameters are marginalised out, yielding a tractable tree posterior proportional to , which does not depend on any varying-dimensional parameters at the terminal node level. The terminal node parameters are updated by a Gibbs sampling scheme, with the associated full conditional distributions

(10)
(11)
where

with

being an identity matrix of appropriate dimension.

Lastly, we sample the length parameters using MH steps with a uniform proposal distribution as per Equation (7). Once all trees are updated, the precision parameter is sampled using a Gibbs step, with the full conditional given by

(12)

where represents the sum of the predictions across all terminal nodes from all sampled trees.

3.1 Algorithm specifications and initialisation

The penalty parameter and the number of trees have default values of and respectively, since these quantities presented reasonable predictive performance in various scenarios demonstrated in Sections 4 and 5. Alternatively, those parameters also could be selected via cross-validation, though the computational cost of doing so may be prohibitive. The inclusion, or not, of the rotated splitting rules is also a setting of the model that can be calibrated by the user. By default, GP-BART includes the rotated splits since it improves the model’s prediction in general, especially for spatial data.

For the MCMC sampling steps, a standard number of iterations , of which the first are discarded, were found to yield a sufficient number of samples to reliably characterise the posterior in all applications herein. Since GP-BART reduces to the standard BART when , we set for the first iterations of the burnin phase and restore it to thereafter. As BART tends to produce deeper trees, this discourages the GP-BART phase of the sampler from starting with trees with large numbers of observations in terminal nodes and reduces the computational burden associated with the matrix inversions required for each terminal node, since such operations are bypassed while and their complexity thereafter is .

We present the full structure of the GP-BART sampler in Algorithm 1, where the matrix of covariates and response vector from the training set enter as inputs. Trees, partial residuals, and hyperparameters are then initialised. Sequentially, for each MCMC sample, a proposed tree is accepted, if valid, with probability . Here, we consider trees with terminal nodes containing fewer than observations to be invalid. Lastly, the remaining parameters are sampled using Equations (10)–(12).

Input: , , , , and all hyperparameters of the priors.
Initialise: tree stumps s.t. , , and .
for iterations from to  do
       for trees from to  do
             Calculate the partial residuals via Equation (9);
             Propose a new tree by a grow, grow-project, prune, change, change-project, or swap move;
             Accept and update with probability
for terminal nodes from to  do
                   Update via Equation (10);
                   Update via Equation (11).
             end for
            Update using MH.
       end for
       Update via Equation (12).
end for
Output: Approx. samples from .
Algorithm 1 GP-BART sampling algorithm

3.2 Prediction in GP-BART

The trees in GP-BART models can provide out-of-sample predictions for a set of new observations . For a given terminal node in tree for a particular MCMC sample, the joint posterior distribution of the training observations (expressed as partial residuals) and the predictions is given by

with , , and . Here, and denote the number of observations assigned to terminal node of tree

for the training samples and new data, respectively. This posterior predictive distribution can be marginalised and given in terms of

where

Ultimately, the function assigns the vector to the associated new observations on a per-iteration basis, such that the estimates from GP-BART are given by

(13)

where indexes the draws from the posterior distribution after the burnin iterations. The overall prediction for a new observation is then given by the average of the estimates ; i.e.,

The posterior samples from Equation (13) can also be used to quantify the uncertainty in the predictions. For instance, the endpoints of a prediction interval (PI) for a predicted value can be obtained from the upper and lower quantiles of .

4 Simulations

In this section, we present a simulation study to evaluate the performance of GP-BART. The simulated data is composed by a summation of trees of depth , built using the variables ). The covariates are also simulated, with each being drawn from a

distribution. The values associated with each terminal node follow a multivariate normal distribution with specific mean and covariance parameters. We generate the response variable via

with number of trees , each with two terminal nodes. Here, the mean parameters are all constant vectors of the form , with respective values given by , , , , , and . Within each terminal node, multivariate normal spatial terms and additional noise terms are also added. We set , , , , , and . Figure 2 shows the simulated data surface for datasets of size , respectively, highlighting the different partitioning behaviour and smoothness within each dataset.

Figure 2: Simulated data with observations, respectively.

We compare the performance of our GP-BART model to other tree-based methods, namely BART (chipman2010bart), SoftBART (linero2018bayesian), and tGP (gramacy2008bayesian), as well as the universal kriging model (cressie2015statistics) and latent Gaussian models using integrated nested Laplace approximation (INLA; lindgren2015bayesian). We evaluate the results using -fold cross-validation; each fold is treated as a test set and prediction accuracy is quantified using the root-mean-square error (RMSE) metric on each test set.

The models are fitted using the R packages BART bart2021package, tgp gramacy2010categorical, fields fields2021package, and INLA lindgren2015bayesian, with their default settings, along with an implementation of the SoftBART model provided on the GitHub repository of linero2018bayesian. All hyperparameters for the GP-BART model were specified using their default values previously described throughout Sections 2 and 3. To qualitatively compare the methods, we analyse the prediction surface generated by each algorithm for the datasets of size , shown in Figure 2, using the predictions over the test sets in the -fold setting. The corresponding plots are provided in Figures 3, 4, and 5, respectively.

Figure 3: Predicted surfaces for the simulated scenario with observations from the first panel of Figure 2 using different methods over the test datasets.
Figure 4: Predicted surfaces for the simulated scenario with observations from the second panel of Figure 2 using different methods over the test datasets.
Figure 5: Predicted surfaces for the simulated scenario with observations from the third panel of Figure 2 using different methods over the test datasets.

Though the provided plots indicate clear differences between each model type, each model’s behaviour is similar across the sample sizes. GP-BART’s prediction surfaces appear most similar to the original data shown in Figure 2 in each case. Indeed, GP-BART successfully identifies diagonal partitions due to its rotated splits, while BART, SoftBART, and tGP only produce splits parallel to the axes. Though BART and SoftBART uncover differences among the terminal node regions nonetheless, the assumption of spatial independence renders such predictions unreliable. The tGP, kriging, and INLA predictions capture the spatial features well, but their failure to identify the partitions results in blurred prediction surfaces in areas where the data splits.

A quantitative comparison is shown in the boxplots in Figure 6, which reflect the previous qualitative interpretations. Here, GP-BART presents substantially lower RMSE than its competitors, particularly for larger . To assess uncertainty calibration, we calculate the coverage of the PI for each method; i.e., the proportion of observations that are within the PI for each prediction, across all folds. From the results shown in Figure 7, the GP-BART model is the only one whose boxplots overlap the coverage line in all cases. Thus, its performance in terms of uncertainty quantification is superior to the other models considered.

Figure 6: Comparisons between the RMSE obtained by the competing models for the simulated data using -fold cross validation over different sample sizes.
Figure 7: Comparisons between the coverage ratios obtained by the competing models for the simulated data using -fold cross validation over different sample sizes.

The MH acceptance rates of the newly proposed moves used for learning the tree structures is another relevant aspect of GP-BART which allows us to study the efficacy of the novel grow-project and change-project moves. Table 1 shows the proportion of new trees that were accepted after the burnin phase for each of the six available move types under GP-BART, for each simulated dataset, over all folds. Though trees are rarely grown after the burnin phase, we can still see from these results that the grow-project and change-project moves present acceptance rates which are comparable to or higher than their respective standard counterparts. This is most likely attributable to the diagonal splitting rules used in the generation of each dataset.

Table 1 also shows that the swap move behaves differently from the other moves. This happens because a given tree must contain at least two internal nodes in order for new trees to be proposed according to a swap move. Here, however, most trees have a small number of internal nodes, given that the trees used to generate the data each contain only two terminal nodes. Thus, the acceptance rates of approximately zero can be explained by the sampled trees rarely reaching a sufficient depth for swap moves to be proposed.

Move
Grow
Grow-project
Prune
Change
Change-project
Swap
Table 1: Acceptance rates for the six tree-proposal moves available under GP-BART for the three simulated datasets, obtained by dividing the number of times the given move was accepted by the total number of trees across all folds in all retained posterior samples.

To highlight the effect of the proposed moves and the use of GPs over the terminal nodes, four different, restricted versions of GP-BART are compared:

  1. without any projection moves or GPs (i.e., the standard BART model);

  2. without GPs, but with the addition of the two new moves;

  3. without the new moves, but with GPs;

  4. the standard GP-BART with both rotated splitting rules and GPs.

We defer the results for the other sample sizes, which lead to similar conclusions, to Appendix A and consider only the most computationally demanding setting here, for brevity. This comparison is summarised in Figures 8 and 9, in which the letters above are used to distinguish the model versions.

The prediction surface (A) in Figure 8 suggests that the standard BART cannot adequately capture different behaviours in the terminal node regions due to the lack of smoothness and non-linearity compared with GP-BART. Panel (B) compares reasonably well with (D), which highlights the benefits of the rotated splitting rules. However, further insights are gleaned from Figure 9, which indicate that only versions (C) and (D) have satisfactory performance for both presented metrics. In particular, these versions exhibit vastly superior coverage properties, which highlights the benefits of using GPs at the terminal node level. However, the standard GP-BART still presents the lowest median for the RMSE estimation and the best coverage ratio.

Figure 8: Comparison between the predicted surfaces under the different versions of GP-BART for the simulated data. The surface for (D), the standard version of GP-BART, is qualitatively close to the observed data in the third panel of Figure 2.
Figure 9: Boxplots of the RMSE values (left) and coverage ratios of the PIs (right) across the different versions of the GP-BART model for the simulated data. The standard GP-BART, (D), has the best performance in terms of both RMSE and coverage.

5 Applications

In this Section, we appraise the predictive performance of GP-BART compared to BART, SoftBART, tGP, kriging, and INLA on diverse real datasets, as a larger and more challenging test of GP-BART’s capabilities. For illustration, we use four public data sets containing spatial features; i.e., with inherent dependence over the observations. These datasets are:

  1. The Auckland dataset consists of rows describing infant mortality in Auckland, with two spatial covariates and the target variable (bivand2018spdep).

  2. The Baltimore dataset comprises observations of house sales prices, two spatial features, and covariates characterising each property (bivand2018spdep).

  3. The Boston dataset contains observations of the median values of owner-occupied suburban homes, two spatial features, and related covariates. We model a corrected version boston_housing_correct of the original data boston_housing.

  4. Swmud is a dataset of seabed mud content in the southwest Australia Exclusive Economic Zone with observations of two sets of spatial coordinates and mud content as the target variable li2010predicting.

Our implementation of each algorithm follows their respective default settings, including those previously described throughout Sections 2 and 3 for GP-BART. As before, -fold cross-validation is used to evaluate performance on each dataset. For the Baltimore and Boston datasets, we restrict the GPs to the strictly spatial features (i.e., the exact coordinates of the instances), while all covariates are used to build the trees. The results are summarised in Figure 10 and Figure 11, which show the RMSE values and PI coverage ratios over all datasets, respectively.

According to Figure 10, GP-BART presents the lowest median RMSE for all four datasets. The differences are most pronounced for the Swmud data, in the comparison between GP-BART and BART, and the Boston data, for which kriging and INLA perform notably worse than all tree-based methods.

Figure 11 shows that the PIs produced by GP-BART are the most consistent, since the boxplots overlap with the horizontal line indicating coverage in most cases. Apart from the Auckland data, for which tGP and kriging exhibit better coverage, the medians are close to the expected value of . Indeed, tGP is the only other method which yields reasonable intervals for all four datasets. Though GP-BART achieves lower RMSE values, this highlights the benefits of using GPs to produce observation-specific predictions at the terminal node level in tree-based methods, either using a single tree (tGP) or especially an additive ensemble of trees (GP-BART).

Figure 10: Comparison between the RMSE values for the benchmarking datasets across the six competing methods using -fold cross validation.
Figure 11: Comparison between the coverage ratios of the PIs for the benchmarking datasets across the six competing methods using -fold cross validation.

Another aspect of performance evaluation for each model is illustrated in Figure 12, which presents the average RMSE rank for each of the test partitions from the cross-validation. Ranks are defined here such that the algorithm yielding the lowest mean RMSE is given a rank of , while the algorithm with the worst prediction performance is given the highest possible rank of , for each test partition. From Figure 12, we can see that GP-BART has by far the lowest average RMSE rank for the Boston dataset. For the Auckland and Baltimore datasets, GP-BART’s performance in this regard is highly-competitive but exactly equivalent to INLA and BART, respectively. For the Swmud dataset, kriging and INLA outperform all other methods, but GP-BART achieves the next-lowest mean rank.

Figure 12: RMSE ranks for all six competing models over the four benchmark datasets, averaged over all test partitions of the -fold cross validation. The ranks range from to , with lower ranks being associated with lower mean RMSE values.

6 Discussion

In this paper, we proposed GP-BART as an extension to the standard BART model. Our model uses Gaussian processes (GPs) to make observation-specific predictions at the terminal node level, and is thus able to capture non-linear relations and spatial dependence through the covariance structure of the GPs. In addition, our novel model also allows the use of projected splitting rules to build rotated partitions, which enable more flexibility in the tree representations.

The performance of GP-BART was evaluated over a number of simulated scenarios, where the model outperformed BART, restricted versions of GP-BART itself without the use of GPs and/or novel projected splitting rules, and another unrelated BART extension. Our simulation studies also highlighted GP-BART’s superior performance relative to some spatial models, namely basic kriging and INLA. When tested on real applications, using out-of-sample data via -fold cross-validation, GP-BART displayed competitive predictive capabilities, beating many of the established methods. We also compared the calibration properties of our method in terms of coverage; again, GP-BART performed as well or better than competing methodologies.

There are several potential issues remaining with the model, which may also provide opportunities for future research:

  • Careful choices have been made regarding the specification of prior distributions for the model parameters because the trees and the GPs can compete to explain the variability in the data. We have set up the model so that both GPs and trees have an equal opportunity to contribute to the fit. However, a more substantial study might suggest general rules as to how these parameters might be elicited in light of certain dataset properties.

  • The model can be computationally challenging to fit for larger datasets, since the calculation of each terminal node’s contribution to the overall likelihood involves inverting each associated covariance matrix. Though our strategy of initialising GP-BART by a warm-up step that uses BART on the first phase of MCMC iterations substantially reduces this cost, approximations for the GP — using for example the Nyström method (williams2002observations) or the methods of quinonero2007approximation — might prove to be advantageous in future work.

  • In the applications herein, we have focused on the use of GP-BART for spatial datasets, but there is nothing to prohibit the model being used in generic machine learning tasks. However, we have restricted the GPs to be covariance-stationary through our use of exponentiated-quadratic kernels, which are governed only by scalar rate and (tree-specific) length parameters. A superior approach may introduce non-stationarity to the autocovariance and hence produce more flexible GP surfaces.

We hope to report on these developments as part of our future research plans.

Acknowledgements

Mateus Maia’s work was supported by Science Foundation Ireland Career Development Award grant number 17/CDA/4695 and SFI research centre award 12/RC/2289P2. Andrew Parnell’s work was supported by: a Science Foundation Ireland Career Development Award (17/CDA/4695); an investigator award (16/IA/4520); a Marine Research Programme funded by the Irish Government, co-financed by the European Regional Development Fund (Grant-Aid Agreement No. PBA/CC/18/01); European Union’s Horizon 2020 research and innovation programme under grant agreement No. 818144; SFI Centre for Research Training 18CRT/6049, and SFI Research Centre awards 16/RC/3872 and 12/RC/2289P2.

References

Appendix

A Performance evaluation for restricted versions of GP-BART

The results of a comparison between different versions of GP-BART for simulated data with are illustrated in Figure 8, showing predicted surfaces, and Figure 9, showing boxplots of the RMSE values and PI coverage ratios. For completeness, we provide here the analogous plots for the other sample sizes considered in the simulation study, with predicted surfaces and boxplots for the data in Figures A.1 and A.2, respectively, and equivalent plots for the data in Figures A.3 and A.4. Recall that the restricted versions of GP-BART under consideration are: (A) without any projection moves or GPs (equivalent to the standard BART model); (B) without GPs, but with the addition of the two new moves; (C) without the new moves, but with GPs; and (D) the standard GP-BART with both rotated splitting rules and GPs. With the sole exception of (B) presenting a lower median RMSE for the data in Figure A.2, the results for both metrics otherwise show superior performance for the standard version of GP-BART for both sample sizes. This clearly demonstrates the efficacy of the novel grow-project and change-project moves and the use of GP priors over terminal nodes and reinforces the conclusions drawn from the corresponding figures for the simulation study in Section 4.

Figure A.1: Comparison between the predicted surfaces under the different versions of GP-BART for the simulated data. The surface for (D), the standard version of GP-BART, is qualitatively close to the observed data in the first panel of Figure 2.
Figure A.2: Boxplots of the RMSE values (left) and coverage ratios of the PIs (right) across the different versions of the GP-BART model for the simulated data. The performance of (D), the standard GP-BART, is competitive in terms of RMSE and superior in terms of coverage.
Figure A.3: Comparison between the predicted surfaces under the different versions of GP-BART for the simulated data. The surface for (D), the standard version of GP-BART, is qualitatively close to the observed data in the second panel of Figure 2.
Figure A.4: Boxplots of the RMSE values (left) and coverage ratios of the PIs (right) across the different versions of the GP-BART model for the simulated data. The standard GP-BART, (D), has the best performance in terms of both RMSE and coverage.