Influential Observations in Bayesian Regression Tree Models

03/26/2022
by   Matthew T. Pratola, et al.
The Ohio State University
0

BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular Bayesian regression models widely applicable in modern regression problems. Their popularity is intimately tied to the ability to flexibly model complex responses depending on high-dimensional inputs while simultaneously being able to quantify uncertainties. This ability to quantify uncertainties is key, as it allows researchers to perform appropriate inferential analyses in settings that have generally been too difficult to handle using the Bayesian approach. However, surprisingly little work has been done to evaluate the sensitivity of these modern regression models to violations of modeling assumptions. In particular, we will consider influential observations, which one reasonably would imagine to be common – or at least a concern – in the big-data setting. In this paper, we consider both the problem of detecting influential observations and adjusting predictions to not be unduly affected by such potentially problematic data. We consider two detection diagnostics for Bayesian tree models, one an analogue of Cook's distance and the other taking the form of a divergence measure, and then propose an importance sampling algorithm to re-weight previously sampled posterior draws so as to remove the effects of influential data in a computationally efficient manner. Finally, our methods are demonstrated on real-world data where blind application of the models can lead to poor predictions and inference.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/05/2022

GP-BART: a novel Bayesian additive regression trees approach using Gaussian processes

The Bayesian additive regression trees (BART) model is an ensemble metho...
06/29/2018

Fully Nonparametric Bayesian Additive Regression Trees

Bayesian Additive Regression Trees (BART) is fully Bayesian approach to ...
06/16/2016

The Effect of Heteroscedasticity on Regression Trees

Regression trees are becoming increasingly popular as omnibus predicting...
02/16/2015

Particle Gibbs for Bayesian Additive Regression Trees

Additive regression trees are flexible non-parametric models and popular...
12/02/2020

Spatial Multivariate Trees for Big Data Bayesian Regression

High resolution geospatial data are challenging because standard geostat...
07/10/2019

Approximate Bayesian inference for spatial flood frequency analysis

Extreme floods cause casualties, widespread property damage, and damage ...
01/22/2019

Bayesian additive regression trees and the General BART model

Bayesian additive regression trees (BART) is a flexible prediction model...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the contemporary approach to data-driven problem solving, statistical models have received increasing attention and popularity as a means for arriving at answers to complex research, science and business questions. As datasets have increased in size with the transition to the “big-data” era, the complexity and scalability of statistical models have seen rapid advances in order to address the needs of these modern problems. Popular examples of such models include neural networks

(ghugare:etal:2014)

, random forests

(leo:2001) and localized Gaussian Processes (gramacy:2015). In problems where uncertainty quantification is deemed necessary, Bayesian methods have come to the fore, such as the Bayesian variants of neural networks (mackay1995probable), Bayesian localized GPs (liu:etal:2021) and Bayesian Regression Tree models (chipman:etal:1998; chipman:etal:2010; Pratola:2016; horiguchi:etal:2021; horiguchi:etal:2022).

Despite the increasing popularity and capability of these modern statistical tools, there has been a conspicuous disconnect in terms of tools that support the application of such complex models when compared to their humble, small-dataset, low-dimensional ancestors. For example, in linear regression, students are taught an extensive array of tools for validating modeling assumptions in the classical setting, such as residual diagnostics, outlier detection and influence metrics

(weisberg2013applied; cook:weisberg:1982). The Bayesian linear model has also received attention earlier in the literature (chaloner:brant:1988; zellner:moulton:1985; johnson:geisser:1983; zellner:1975). Yet surprisingly, such supporting tools have not received the same attention in the development of modern variants of statistical models. The assumption, it seems, is that in the big-data setting such issues are of lesser concern. We have found this assumption to be incorrect.

Our focus in this paper is on the single-tree Bayesian classification and regression tree (BCART) model (chipman:etal:1998), and the ensemble-of-trees Bayesian Additive Regression Tree (BART) model of chipman:etal:2010 in particular. This class of models is currently receiving much attention in the research community, and has been used in a wide variety of problems including medical studies (tan2019bayesian), causal analysis (hahn2020bayesian), computer experiments (Pratola:etal:2014) and applied optimization (horiguchi:etal:2022)

. BCART and BART models have contributed to this popularity due to their ability to scale to moderately sized big-data applications while retaining the ability to fully quantify statistical uncertainties due to the elegant exploitation of conjugacies in the MCMC sampler. Our work arose out of a simple curiosity: can BCART or BART models be negatively affected by a potentially problematic observation, i.e. an observation that can be influential or is an outlier (or both)? On the one hand, since such Bayesian tree models fit simple localized models, it may appear that any problematic behavior due to a bad observation would be localized, and perhaps of not serious concern when working with large datasets. On the other hand, big-data usually is also high-dimensional, and in high-dimensions our notions of what constitutes a large sample size may not match our intuition.

Figure 1:

Effect of two problematic observations on posterior predictions of BCART (left panel,

trees) and BART (right panel, trees) models. Observation ‘a’ is a large outlier but has less influence due to its location in the center of the regression domain, while observation ‘b’ is both a large outlier and has higher influence due to its location at the edge of the regression domain. The resulting fits demonstrate the effect of removing these observations on the resulting posterior predictions (solid line) versus leaving them in (dotted line). The grey solid line denotes the true mean function.

Figure 1 demonstrates a simple example of this scenario. In fact, this scenario is rather favorable as it is low-dimensional and there is plenty of data – both of these choices make it easy to visualize the behavior of BART with and trees. Yet despite this seemingly favorable situation (i.e. one where we might not expect any serious effect due to influence or outliers), there is a suprisingly strong effect of two problematic observations, denoted in the figure by the ‘a’ and ‘b’ symbols. Certainly, the effect is localized as expected, so the overall fit may be reasonable for much of the regression domain of interest. Yet, in the local region containing the problematic observation, the predictions are severely affected and it is reasonable to expect this type of issue to become worse in more realisitic, higher-dimensional applications.

In the classical linear regression model, where the approach to handling such problematic observations is to detect and remove such observations before proceeding to the final model fitting and inference stages. The popular classical tools include calculating the leverage of observations based on the diagonal entries of the hat-matrix, and calculating Cook’s distance, defined for observation as

where represents the model’s prediction when observation is held out from the training data, and

is the usual least-squares estimate of error variance,

. Cook’s distance can itself can be factored into a term that represents detecting observations problematic due to a large potential for influence (leverage) and a term detecting influence due to a large residual. These terms are combined product-wise to arrive at implying that good observations are those with low residual and low leverage while problematic observations could be problematic for either or both of these issues.

In the modern Bayesian context, it is less clear how to handle such problematic data. For instance, do we care about point predictions or do we care about the posterior distribution? In the former setting, an analogue of the classical Cook’s distance may be quite reasonable. In the latter setting, some theoretical work suggests that if the problematic observation is known, one may not need to completely remove it from the analysis; instead the posterior can be adjusted to correct the effect of the problematic data on the resulting posterior. This implies that there are in fact two procedures needed to appropriately handle problematic observations in our modern Bayesian regression setting:

  1. identification of problematic observations;

  2. model (posterior) adjustment given identified problematic observations.

In this work, we propose two approaches for the identification problem (i). First, a direct extension of Cook’s distance to the regression tree model setting is outlined, and has the benefit of providing an easy and sensible interpretation. Second, an alternative divergence-based metric is also proposed. The divergence approach has the benefit of identifying observations that affect the posterior distribution. For the adjustment problem (ii), we explore two alternatives: the simple (but wasteful) dropped-observation approach, and an importance-sampling approach that reweights posterior expectations to account for the problematic observation without going so far as to completely remove it.

The paper proceeds as follows. In Section 2 we review the the BCART and BART models. In Section 3, we outline our proposed Cook’s distance metric for trees as well as the divergence-based metric. In Section 4 we derive importance samplers for reweighting BCART and BART posteriors to account for problematic observations. We then apply these tools to a variety of simulated datasets and to real-world data involving biomass fuels in Section 5. Finally, we conclude with a discussion in Section 6.

2 Bayesian Regression Trees and BART

For high-dimensional regression, most statistical and machine learning techniques focus on the estimation of

. It is typically assumed that with the data generated according to the homoscedastic process

(1)

where and is a

-dimensional vector of predictor variables.

BART models the unknown mean function with an ensemble of Bayesian regression trees. Such regression trees provide a simple yet powerful non-parametric specification of multidimensional regression bases, where the form of the basis elements are themselves learned from the observed data. Each Bayesian regression tree is a recursive binary tree partition that is made up of interior nodes, , and a set of parameter values, , associated with the terminal nodes. Each interior tree node, , has a left and right child, denoted and . In addition, all nodes also have one parent node, except for the tree root. One may also refer to a node by a unique integer identifier , counting from the root where the left child node is and the right child node is . For example, the root node is node , with node and node being ’s children. Figure 2 summarizes our notation.

Figure 2: Labeling for a single regression tree . Nodes are denoted by circles and labeled using the symbol . Lines denote branches connecting the nodes. Nodes can also be identified as left and right children (e.g. ) or as parent (e.g. ). Terminal nodes have no branches below them and contain associated parameter values . Note that later in the paper will also index one member of an ensemble of trees, its use will be clear from context.

Internal nodes of regression trees have split rules depending on the predictors and “cutpoints” that are the particular predictor values at which the internal nodes split. This modeling structure is encoded in , which accounts for the split rules at each internal node of a tree and the topological arrangement of nodes and edges forming the tree. Given the design matrix of predictors having dimension , each column represents a predictor variable and each row corresponds to the observed settings of these predictors. At a given internal node, the split rule is then of the form where is the chosen split variable and is the chosen cutpoint for split variable

The Bayesian formulation proceeds by specifying discrete probability distributions on the split variables

taking on a value in

and specifying discrete probability distributions on the set of distinct possible cutpoint values, where

is the total number of discrete cutpoints available for variable . For a discrete predictor, will equal the number of levels the predictor has, while for a continuous predictor a choice of is common (chipman:etal:2010). The internal modeling structure of a tree, can then be expressed as .

The Bayesian formulation is completed by specifying prior distributions on the parameters at the terminal nodes. For terminal nodes in a given tree, the corresponding parameters are . Taken all together, the Bayesian regression tree defines a function which maps input to a particular

The original BART model is then obtained as the ensemble sum of such Bayesian regression trees plus an error component, where is the observation collected at predictor setting , and is the variance of the homoscedatic process. Combining all the parameters together as , the BART prior is factored as For the terminal node parameters, normal priors are specified as, where is the th terminal node component for tree , and an inverse chi-squared prior is specified for the variance, where denotes the distribution . For a prior on the tree structure, we specify a stochastic process that describes how a tree is drawn. A node at depth spawns children with probability for and As the tree grows, gets bigger so that a node is less likely to spawn children and more likely to remain a terminal node, thus penalizing tree complexity. Details on specifying the parameters of the prior distributions are discussed in detail in chipman:etal:2010, while typically the choice trees appears to be reasonable in many situations. Meanwhile, choosing results in the BCART model.

The use of normal priors on the terminal node ’s, and an inverse chi-square prior on the variance, greatly facilitates the posterior simulation via an MCMC algorithm as they are conditionally conjugate. Selecting the split variables and cutpoints of internal tree nodes is performed using a Metropolis-Hastings step by growing and pruning each regression tree. The growing/pruning are performed using so-called birth and death proposals, which either split a current terminal node in on some variable at some cutpoint , or collapse two terminal nodes in to remove a split. For complete details of the MCMC algorithm, the reader is referred to chipman:etal:1998; denison:etal:1998; chipman:etal:2010; Pratola:2016.

3 Influence Diagnostics for Trees

We now outline two diagnostic tests for the detection of problematic observations. The first is a direct application of Cook’s distance to Bayesian regression trees, while the second is a divergence-based approach.

3.1 Conditional Cook’s Distance for Regression Trees

Conditioning on the tree , a single regression tree can be expressed in the usual linear form as where is the total number of terminal nodes in the tree and is the indicator function taking the value when maps to the hyperrectangle defined by terminal node , and otherwise. The analogous formula for Cook’s distance in the single-tree case by regressing on (see Supplement) becomes

(2)

where is the regression residual for observation and is the number of observations in the terminal node to which observation maps. Note here that in comparison to the classical Cook’s distance, we have replaced with the parameter itself, for which we have samples. This form of provides helpful interpretations. For instance, it is a decreasing function of the number of terminal nodes, but on the other hand it increases as node purity increases (i.e. as becomes small) and in particular will blow-up when Also, we see that increases if the residual of observation

is large relative to the standard error, and this effect increases like the square for every unit increase in standard devation of the residual for observation

. To arrive at an overall estimate, we take the posterior sample mean over our MCMC draws of ,

(3)

where each is the conditional Cook’s distance as defined in equation (2).

For the sum-of-trees BART model, we can extend this idea in a few ways. One simple approach is to report the average across all of the tree’s in BART’s sum. That is, if is the Cook’s distance calculated as in equation (2) above for tree , then one could report

Another practical alternative would be to report the average maximum over the trees,

The exact solution can be found by converting each sum-of-trees function into a single tree representation as in horiguchi:etal:2021 to calculate the conditional Cook’s distances for the BART sum-of-trees ensemble. In this case, the linear form is expressed as where the superscript denotes the supertree representation of the ensemble. Note that each here is itself the sum of parameters from the original BART representation. If is the number of terminal nodes in tree then the number of terminal nodes in the supertree , where typically this inequality is ‘’. Let be the number of observations in the terminal node of tree to which observation maps, and similarly let be the number of observations in the supertree’s terminal node to which observation maps. Typically, since the hyperrectangle defined by the supertree terminal node to which observation maps is the intersection of the corresponding hyperrectangles from the trees in the additive form, i.e. We can then calculate the conditional Cook’s distance as and then report the posterior average of these values. Note that in this exact calculation, we see that it is likely for the influential or outlying observation to have a much smaller than any single tree , which serves to inflate the Cook’s distance more agressively than in the above approximations. However, the approximations and may be preferrable for their computational simplicity.

3.2 Kullback-Liebler Divergence Diagnostic

Recall the Kullback-Liebler divergence from distribution to is defined as

where with equality iff In our context, we propose to take the reference distribution to be the posterior involving all the data,

and the distribution is taken to be the posterior when the potentially problematic data is held out. If we consider the simplest case of holding out a single obseration , then

The KL divergence diagnostic has a simple Bayesian interpretation when evaluating the potential for observations to be problematic: if then observation is not very influential on the posterior distribution, whereas if then observation is unduly influential on the posterior distribution.

In practice, we can estimate this metric quite simply using posterior samples from our full-data fit. For the theoretical divergence, we have

(4)

where the first term is due to the i.i.d. form of the likelihood. Since , we know that the divergence is minimized when both the first and second term are zero. The first term captures the contribution of to the posterior distribution of , and can be approximated using the full-data posterior samples, as The second term is a bit more involved, but can be easily estimated by recognizing it as and appealing to an importance sampling trick as summarized in Proposition 0.

Proposition 0: Suppose and

are probability density functions such that

whenever Consider where the expectation is with respect to Let be independent. Then, as

Substituting our estimators for each term of the theoretical KL divergence, we arrive at our overall KL-divergence based criterion,

(5)

where is the terminal node parameter in tree from posterior sample to which observation maps, and is the number of obsevations mapping to the terminal node to which belongs in tree and posterior sample

Alternatively, since we know that the divergence is minimized when both terms in (3.2) are minimized. Note that the second term in (3.2) is constant for a given dataset and free of Since we do not directly compare values of for different held-out observations (see Section 3.3), we need not concern ourselves with the second term and instead can concern ourselves only with small values of the first term. That is, for some selected by the modeler such that where is chosen by the modeler, then interest lies in how much larger the fitted model divergence is from this reference divergence level, i.e. where the terms involving the normalization constants, cancel. This suggests the diagnostic which can be approximated using the full-data posterior samples as

(6)

and comparing to (6) evaluated using The advantage of (6) over (5) is the reduced computations and, in particular, more stable computations since we need not take the sum of exponential terms seen in the second term of (5). In practice, we also find that (6) more easily lends itself to a simple decision rule for detecting problematic observations based on the choice of as described.

Overall, the advantage of these KL-divergence based diagnostics is that they tell us something about the sensitivity of the entire posterior distribution whereas the tree-based Cook’s distance diagnostic (2) only tells us about the sensitivity of the mean function. Nonetheless, we again see that (5) and (6) also exhibit inflationary behavior when we get into the degenerate situation of small, as determined by the minimum number of observations per terminal node parameter, which usually has a default value of in most BART implementations. The interpretation of these diagnostics is then clear: and are large in the non-degenerate case when observation is far away from the other observations in its terminal node (since the density of will be small), or infinite in the degenerate case.

3.3 Detecting Influential Observations

In order to apply the above diagnostics, one requires a rule that will flag observations as potentially problematic. Since we are operating in the Bayesian realm, a simple approach would be to take high-quantile values of the posterior samples of the diagnostics, such as the 97.5% and 99% quantiles, and use these as decision boundaries for detecting influentials. However, this is approach is less than ideal since even in the case where there are no influential observations in a dataset, this approach will nonetheless flag 2.5% or 1% of observations as being problematic.

As motivated by the discussion for the KL criterion (6), we prefer the following alternative based on the notion of how large a residual would have to be in order to be considered problematic. For Gaussian data, a residual that is standard deviations away would likely be the most conservative level most analysts would use to flag influentials, and standard deviations might be a more typical choice. This implies substituting or in the diagnostics and respectively. For

one additionally needs to impute values for the tree complexity and node purity terms. One approach would be to substitute posterior averages from the fitted model. Alternatively, we can impute values based on the priors; this suggests

for the tree complexity term and since the default value of is typically , this suggests for the node purity term. Calibrating the decision rules in this way is intuitive and interpretable for the practitioner, and in our applications appears to work quite well. See, for instance, Figures 3-6 that apply this rule using and cutoffs in Section 5. An exception is the statistic, which generally shows little difference between and cutoffs, lending a preference for in practice.

4 Adjusting Predictions via Importance Sampling

While one could use the proposed diagnostics to detect problematic observations and then refit the model with such observations removed from the dataset, for Bayesian models implemented using MCMC sampling algorithms (such as BART), this is a computationally wasteful approach. Instead, bradlow:zaslavsky:1997 propose to estimate functions of interest, using importance sampling as

Let be the importance sampling weights of interest when observation is to be dropped and Then,

where the renormalization in the denominator removes the dependence on the proportionality constant Intuitively, this importance sampling approach adjusts our posterior samples used in predicting as if we had instead sampled from The weights also have a clear connection to the KL-divergence diagnostic proposed in Section 3, with the difference being that the diagnostic is based on the log density ratio whereas the weights are calculated on the density ratio scale. However, it turns out that direct application of these weights to posterior quantities of interest does not behave well due to the high-dimensional parameter space of treed models, particularly the richer models such as BART. This is because in a high-dimesional parameter space, the localized parameters affected by the problematic observation tend to be uncorrelated with poor draws for the rest of the high-dimensional parameter, and so downweighting entire posterior realizations from such high-dimensional parameter spaces tends to remove good samples for the rest of the parameter space. This problem with the “global reweighting” scheme can only be overcome by collecting extremely large numbers of posterior samples, which is computationally prohibitive. Fortunately, careful investigation of the situation in the prediction setting yields an effective reweighting scheme.

4.1 Re-weighting Bayesian Tree Predictions

As we are typically interested in prediction, we will focus on a reweighting scheme for posterior quantities involving the terminal node parameters. The conditional independence structure of Bayesian trees allow us to simplify the calculation of the importance sampling weights while increasing their effectiveness in practice. Suppose the observation to be removed, , belongs to terminal node with mean parameter in the current tree defined by Furthermore, let be the set of internal nodes with associated split rules that define the path from back to the root node, and let represent the remaining tree parameters. Note that implicitly defines a hyperrectangle in the input space that maps to terminal node with associated prediction Then we have the following.

Proposition 1: Consider functions such as predictions involving only terminal node Then, the weights are given by where is the number of observations from the full dataset that map to terminal node and is the minimum number of observations allowed per terminal node. Similarly, consider functions such as predictions not involving terminal node . Then the weights are

Note that this result still holds in the case that In words, Proposition 1 says that when we hold-out , the weights for predictions involving the subregion of the input space defined by involves re-weighting the predictions by the inverse density in if the node would have been valid with the case deleted, otherwise the prediction receives zero weight. Meanwhile, weights for predictions involving terminal nodes other than effectively receive a weight of 1, indicating no ill effect of the case deletion and lending the interpretation that influence in Bayesian tree models has a local effect in terms of prediction.

The conditional independence structure of Bayesian trees allows this idea to be extended to functionals of other tree parameters, or more than single-case deletion, although the practical calculation may be come unwieldy as the factorization of the tree becomes more complex.

4.2 Re-weighting BART Predictions

We can extend the idea of reweighting to draws from the additive tree model of BART. Consider using BART’s -tree ensemble to predict at some new input setting The added complexity in this situation arises from the possibility that all terminal nodes from the trees that will be used to predict the response at may have full, partial or no dependence on the problematic observation That is, all terminal nodes may have contained , or some subset of the terminal nodes may have contained , or perhaps none. Proposition 2 describes the weighting scheme in this case.

Proposition 2: Consider functions , such as predictions involving only the terminal nodes, to which input maps. Suppose maps to the terminal nodes Let Then, if at least one of the terminal nodes in is in the weight is Analogously, for predictions at which do not map to any terminal node in the corresponding weight is

Proposition 2 essentially says that predictions involving any subset of the terminal nodes to which maps will be reweighted, and the weights are essentially the same except for the indicator function verifying the constraint. That is, the union of the rectangular regions defined by the terminal nodes to which maps will be reweighted when predicting at a new that lies somewhere in this union. This means calculating the weights is relatively more complex than the single-tree case described earlier, and it also suggests that the weighting will often be inefficient much as the original method of bradlow:zaslavsky:1997 was when applied to the single-tree case. This is motivated by the fact that BART prefers shallow trees, and so the union of regions involving the across all trees may in fact be quite large.

It is tempting, then, to consider a more localized variant – a weighting scheme that only involves predictions that fall in the intersection of regions defined by the terminal nodes to which maps. In fact, such an approach can again be supported by recalling that the BART likelihood involving a sum-of-trees mean function can, conditionally, be equivalently described by a single “super-tree” mean function (horiguchi:etal:2021), that is

where represents the analogous super-tree representation, i.e. Note that the prior remains the same, even though we reinterpret the likelihood’s sum-of-trees as a new, equivalent, single super-tree. Suppose again that is the problematic observation, observed at input and let be the terminal node in to which maps, noting that there is a single unique such terminal node in Let represent the hyperrectangle defined by . Then we have the following.

Proposition 3: Let be a prediction input of interest. Let be the hyperrectangles in each tree of the BART ensemble such that Let be the hyperrectangle defined as the intersection of all the ’s, which corresponds to the supertree terminal node to which belongs. Supppose also that the input for influential observation also maps to Then to predict the response for all the weights are

where is the th terminal node in tree to which maps in the original sum-of-trees representation.

Note in this version of the weighting scheme, the observation only maps to a single terminal node in the supertree representation, and this node corresponds to the intersection of rectangular regions defined by all of the terminal nodes involving in the original sum-of-trees representation. As such, Proposition 3 defines a more localized weighting scheme, and is also easier to manage from an implementation perspective.

4.2.1 A union of intersections

The practical implementation of Proposition 3 results in a different localized region, say for each of the posterior realizations. This makes predictions more computationally expensive. A practical alternative is to take some sort of “average”localized region as the single region to reweight, simplifying posterior prediction calculations. A natural choice is the union of the individual regions, say Not only does this simply the calculation of posterior predictions, it also results in only requiring a single region to be saved from model training, reducing the amount of memory required to store the model. Since each is itself a region defined by the intersection resulting from the supertree, we refer to this method as union-int.

4.2.2 An L1 distance alternative

Since the reweighting region defined by is simply a hyperrectangle, it is tempting to consider a less involved approach. One possibility is, upon identifying a problematic observation, to take an L1 region around this point, say as the region to be reweighted. That is, take for some well-chosen scalar constant We refer to this method as and briefly consider this empirical alternative in Section 5.2. However, it turns out to not be practically useful as choosing a good is itself an expensive optimization.

5 Examples

5.1 Motivating Examples

To motivate our influence metrics, we start with a simple 1-dimensional and 2-dimensional test function. The 1D function is cubic, having an input domain and response values calculated as The observations are generated as The 2D function is taken to be the popular Branin function. The Branin function is a smoothly varying response surface computed over the 2-dimensional domain as

where (picheny2013benchmark). The function exhibits steep slopes in some regions of the input space, particularly along the edges of the domain. The observations were generated as For both of these simple functions we fit the BART model using the default options, in particular trees, , , and a minimum of observations per terminal node.

5.1.1 Diagnostics

First, we consider the influence diagnostics and investigate two scenarios: no influential observations and influential observations with a residual of 3 s.d. We refer to Cook’s distance (equation 2) as cooks, the KL-divergence metric (equation 6) as KL-1 and the KL-divergence metric (equation 7) as KL-2. As a reference, we compute cooks and KL-2 (3.2) by plugging in and s.d. residuals with posterior mean estimates of other relevant quantities to serve as a gauge of severity of the calculated diagnostics of each observation. For KL-1 we use the estimated posterior th and th quantiles. For the scenarios where influential observations were constructed, two such observations were formed: one at the center of the regression domain, and one at an edge of the domain.

1D Cubic
With no influential observations, the results of computing the three discussed diagnostics are shown in Figure 3. This figure shows that mean cooks and KL-2 diagnostics do a good job when there is nothing to detect. The maximum cooks and KL-1 diagnostics appear overly sensitive as some observations are suggested as problematic. For KL-1 this is not surprising as there will always be some diagnostic values falling above the empirical th and th quantiles. Note that some observations also evaluate to infinity for the KL-divergence diagnostics. These observations violate the constraint of the model, and the locations of these observations tend to occur at the edges of the prediction domain, or where there are gaps in the data (not shown).

The results once we add in the influential observations are shown in Figure 4. We can see that both the mean cooks and KL-1 diagnostics are able to pick up the influential observations accurately (the KL-1 for the observation at in fact evaluates to infinity, so is not shown in panel (iv)). We also note that observation #2, located at , exerts greater influence than the observation at , as one would expect. The maximum cooks diagnostic also easily detects the two influentials, but also suggests a few other observations might be problematic when they are not. Similarly, KL-1 easily detects the influentials but again indicates its propensity to suggest problematic observations where there are none. Note again that the observations for which the KL-divergence diagnostics evaluate to infinity tend to occur at the edges of the domain, where the constraint is most likely to be violated.

Figure 3: Influence diagnostics for the 1D cubic test function with no constructed influentials. Panel (i) displays the mean cooks diagnostic; (ii) displays the maximum cooks diagnostic; (iii) displays the KL-1 divergence diagnostic (excluding infinities); and (iv) shows the KL-2 divergence diagnostic (excluding infinities).
Figure 4: Influence diagnostics for the 1D cubic test function with influentials at and . Panel (i) displays the mean cooks diagnostic; (ii) displays the maximum cooks diagnostic; (iii) displays the KL-1 diagnostic (excluding infinities); and (iv) shows the KL-2 diagnostic (excluding infinities). The true influential observations are denoted by

Branin
The results for no influential observations for the Branin function are shown in Figure 5. Here we see that, as expected, the mean cooks and KL-2 diagnostics do not suggest any problematic observations. The maximum cooks and KL-1 diagnostics again suggest potentially problematic observations, which might indicate that these diagnostics are overly sensitive. As before, the KL-divergence diagnostics do not plot any observations whose criterion evaluated to infinity. In fact, 8 observations in this example did evaluate to infinity, indicating that the limit was violated once those observations were held out from their respective terminal nodes. These observations generally occured at the edges of the domain and/or in regions where the response is changing rapidly. These are scenarios where it is known that the quality of BART’s fit can suffer, and it is interesting that the KL-divergence diagnostics (and possibly the maximum cooks diagnostic) are able to detect these issues.

Figure 5: Influence diagnostics for Branin test function with no constructed influentials. Panel (i) displays the mean cooks diagnostic; (ii) displays the maximum cooks diagnostic; (iii) displays the KL-1 diagnostic (excluding infinities); and (iv) displays the KL-2 diagnostic (excluding infinities).
Figure 6: Influence diagnostics for Branin test function with influentials at and . Panel (i) displays the mean cooks diagnostic; (ii) displays the maximum cooks diagnostic; (iii) displays the KL-1 diagnostic (excluding infinities); and (iv) displays the KL-2 diagnostic (excluding infinities). The true influential observations are denoted by

Adding in influential observations results in the diagnostic outputs shown in Figure 6. Here we see that the mean cooks diagnostic is able to pick up the influential at (0,1) easily and also suggests a potential problem with the influential at (0.5,0.5), although some observations around index 200 and index 400 give similar diagnostic values. The maximum cooks diagnostic easily detects the influential at (0,1), but also flags a few observations around index 200 and index 400 while missing the influential at (0.5,0.5). These plots suggest that while the cooks

diagnostics can be useful, they may also suffer from higher than desired type-I and type-II errors. The

KL-1 diagnostic has fairly good performance but also suggests a few observations that might be flagged as problematic when they are not. The KL-2 diagnostic appears to be the most powerful of the four - it easily, and strongly detects the two influential observations and succesfully ignores the observations that were not influential. And, as an added feature, this diagnostic again detected observations along the perimeter of the domain that evaluate to infinity (not shown).

5.1.2 Reweighting Predictions

Given the successful identification of problematically influential observations, an existing fit to our respective test functions can be reweighted to alleviate the impact of the influentials. We apply all three methods described in Section 4, including the method of bradlow:zaslavsky:1997 which we refer to as global and the proposed methods of Propositions 1, 2 and 3 which were refer to as union, int and union-int respectively.

Figure 7: Reweighting psoterior predictive distribution draws of fitted cubic test function with influential observations at and

The original BART fit (uncorrected) is shown as the light gray lines along with +/-2sd credible intervals, while corrected predictions and intervals are shown in black. Panel (i) shows the

global correction, (ii) shows the union correction and (iii) shows the int correction. In panel (iv), the hyperrectangular regions to which the int correction is applied is shown for all 10K posterior draws. In comparison, the union correction is applied to the entire [0,1] domain, resulting in the same performance as global in this example.

The results for the cubic test function are shown in Figure 7. The original BART fit (light grey) demonstrates the local effect of the influential observations located at and respectively. The three reweighting methods are summarized in Figure 7(i)-(iii). From this example we observe that global is the worst of the reweighting methods, noticeable affecting the quality of fit away from the influential observations. The union method, in this case, matches global’s performance. This somewhat counter-intuitive behavior arises from the fact that the union of hyperrectangles in this method ends up being the entire [0,1] input domain. The int method demonstrates much better performance, having nearly identical model fit quality as the original BART posterior away from the influential observations while correcting for the influential observations in their respective localities. These localities, defined by the intersection of hyperrectangles in this case, are shown over all 10K posterior draws in Figure 7(iv). Finally, the union-int provides the best performance by ‘collapsing’ the posterior draws in Figure 7(iv) while also being computationally cheaper to perform.

A similar behavior is seen for the Branin test function. Table 1 summarizes the performance by looking at in-sample and out-of-sample RMSE for the various BART predictors. Similar to the cubic test function, we see that the global and union methods have equal performance since union again results in the union of hyperrectangles being the entire input domain. Both methods introduce variance in the predictor that inflates the prediction error relative to the baseline BART fit denoted as oracle. Meanwhile, the int method again exhibits performance on par with the baseline BART fit while removing the influence of the outliers located at and The posterior intersection hyperrectangles detected by int are shown in Figure 8 (left panel), confirming that the reweighting procedure is being applied in appropriate regions of the input space. Finally, the union-int provides slightly better performance than int by taking the regions shown in Figure 8 (right panel).

- oracle global union int union-int
in-sample 0.0430 0.0786 0.0786 0.0434 0.0430
out-sample 0.1085 0.1220 0.1220 0.1059 0.1052
Table 1: RMSE performance of BART predictors for the Branin test function.
Figure 8: Posterior hyperrectangle regions of influence for the Branin test function using the int method (left panel) and union-int method (right panel). The true influential observations are located at input settings and

5.2 Simulation Study

For a broader persepctive on the performance of our detection and reweighting methods, we considered a simulation study using the 5-dimensional Friedman test function, defined as where the input domain is We consider a single influential observation at the centroid, and the influential observation is generated using an offset of 5. We also explore and settings reflecting single-tree and default BART models respectively, and vary the sample size as and All other settings, in particular were left at the BART defaults. At each of these experimental settings, 100 replicate runs were performed by generating a new dataset, fitting BART, and then calculating the usual BART posterior prediction. Performance was measured in terms of local and global prediction performance. Global prediction was estimated by evaluating the prediction error at out-of-sample inputs drawn in while local prediction error considered out-of-sample inputs drawin in

To generate the data with the outlier being influential enough to be detected and corrected, one can use the nice interpretation of (2) to motivate the offset to add to the influential observation. Equation (2) allows one to ask how many standard deviations away (say ) would an influential observation need to be to be as influential as an observation 2 standard deviations away when ? The solution is given by the inequality where we can take to be the typical number of observations in a terminal node. Under the default tree prior we expect no more than 8 terminal nodes, so with a simulation study of a reasonable range for is This results in ranging from ; we take Finally, since we generate the data with a reasonable offset for our simulated influential observation is therefore

Criterion Weighting
oracle default 2.53 3.03 3.08 1.77 0.71
oracle oracle 2.20 2.43 2.95 1.00 0.53
oracle global 2.26 2.68 2.99 1.29 0.63
oracle union 2.29 2.71 2.95 1.29 0.63
oracle int 2.29 2.71 2.95 1.76 0.72
oracle union-int 2.26 2.68 2.99 1.30 0.64
oracle 2.37 2.82 3.03 1.45 0.65
cooks global 2.79 3.20 3.11 1.92 0.76
cooks union 2.26 2.57 2.02 1.92 0.76
cooks int 2.26 2.57 3.02 1.77 0.71
cooks union-int 2.81 3.17 3.08 1.71 0.71
cooks 2.48 2.98 3.08 1.73 0.71
KL-1 global 2.71 3.25 3.24 1.84 0.81
kL-1 union 2.34 2.66 3.17 1.84 0.81
KL-1 int 2.34 2.66 3.17 1.76 0.72
KL-1 union-int 2.74 3.18 3.16 1.45 0.65
kL-1 2.48 2.95 3.10 1.55 0.66
KL-2 global 2.76 3.19 3.18 1.80 0.74
KL-2 union 2.36 2.74 3.24 1.80 0.74
KL-2 int 2.36 2.74 3.24 1.76 0.72
KL-2 union-int 2.78 3.13 3.15 1.49 0.64
KL-2 2.49 2.97 3.08 1.58 0.65
Table 2: Local RMSE performance in region around influential observation for Friedman simulation study.
Criterion Weighting
oracle default 3.77 3.34 2.76 1.54 0.66
oracle oracle 3.76 3.35 2.78 1.48 0.66
oracle global 3.80 3.36 2.77 1.70 0.77
oracle union 4.04 3.51 2.79 1.70 0.77
oracle int 4.04 3.51 2.79 1.54 0.66
oracle union-int 3.80 3.36 2.77 1.54 0.66
oracle 3.77 3.34 2.77 1.54 0.66
cooks global 4.32 3.67 2.85 1.98 0.74
cooks union 4.57 3.99 3.00 1.98 0.74
cooks int 4.57 3.99 3.00 1.54 0.66
cooks union-int 4.32 3.66 2.83 1.55 0.66
cooks 3.77 3.35 2.77 1.54 0.66
KL-1 global 4.29 3.70 2.90 2.02 0.89
KL-1 union 4.60 4.01 3.07 2.02 0.89
KL-1 int 4.60 4.01 3.07 1.54 0.66
kL-1 union-int 4.29 3.68 2.89 1.55 0.66
KL-1 3.77 3.35 2.77 1.54 0.66
KL-2 global 4.28 3.68 2.90 1.80 0.84
KL-2 union 4.58 4.03 3.06 1.80 0.84
KL-2 int 4.58 4.03 3.06 1.76 0.66
KL-2 union-int 4.28 3.67 2.87 1.49 0.66
KL-2 3.77 3.35 2.77 1.58 0.66
Table 3: Global RMSE performance over entire prediction domain for Friedman simulation study.

The results are summarized in Tables 2 and 3. The results labeled as default are regular BART without reweighting, while the oracle results are the best case performance achieved by explicitly training BART with the influential observation removed. The reweighting schemes considered are labeled global (bradlow:zaslavsky:1997), union (Theorem 1), int (Theorem 2), union-int (Theorem 3) and , where the method used an distance of 0.09.A criterion setting of oracle denotes when the true influential observations are taken as known, whereas cooks, KL-1 and KL-2 detects the influentials using our proposed diagnostics.

There are a few takeaways from the above study. First, as expected, the global method often provides the worse performance particularly over the global prediction domain. That is, possible improvements in local prediction near the influential observation often results in a decrease in global performance. The union method also displays this unfavorable tradeoff as it is most similar to the global method, even though the local prediction was often good. The int method appears to suffer from over-localization, making its performance more dependent on the behavior of the response surface and/or the settings of BART’s prior. The union-int method appears to be the approach that is broadly robust, providing best or near-best performance in both local and global metrics. The method can also provide good performance, but its dependence on the tuning of a distance parameter would render it computationally problematic in most cases. Finally, while the BART oracle local performance remains out of reach for all methods, there is nonetheless a significant reduction in error offered by the best methods, which approach oracle-level performance in many cases, particularly for KL-2 with union-int.

Finally, we note that the detected influentials of cooks, KL-1 and KL-2 generally have a large degree of overlap, with perhaps some slight differences. This suggests combinining the detected influentials amongst metrics to possibly increase performance. We suggest combining cooks with KL-2 since both can choose the detection threshold in the same principled manner.

5.3 Real World Example

Our motivating dataset comes from a study of biomass fuels and the application of artificial intelligence models to predicting the Higher Heating Value (HHV) of such fuels based on their molecular makeup

(ghugare:etal:2014). Biomass fuels are the fourth largest source of energy, with the most common sources being solid products such as wood and biomass pellets. However, determining the HHV potential of a biomass fuel involves expensive and time-consuming calorimetric experiments. Instead, a popular alternative is to use mathematical models to approximate the HHV potential of a fuel source based on its makeup of key components. ghugare:etal:2014 consider a dataset involving observations where biomass covariates recorded include the amount of carbon, hydrogen, oxygen, nitrogen and sulfur present in the fuel (as a percentage of mass), with the response being the HHV value measured in MJ/kg. The dataset is available in the modeldata package on CRAN, and consists of samples, of which 80 are test-set observations and 456 are training-set observations.

Figure 9: Influence diagnostics for the HHV training data when fit using BART with and trees. Panel (i) displays the mean cooks diagnostics, (ii) displays the maximum cooks diagnostic and (iii) displays the KL-2 diagnostic (excluding infinities).
Figure 10: Location of infinities (black triangles) as evaluated by the KL-2 diagnostic for the HHV training data (grey dots) when fit using BART.

We applied BART to the training data with and using trees, and explored our influence metrics to determine if there are any worriesome observations in the data. Figure 9 shows the resulting mean and maximum cooks diagnostics as well as the KL-2 diagnostics. All three metrics provide evidence of influential observations, though to varying degrees. The mean cooks diagnostic seems the least sensitive in this example while the max cooks diagnostic is the most sensitive. The KL-2 diagnostic is somewhere in-between these extremes, although there are additionally infinities for this metric that correspond to observations whose deletion would result in that observations terminal node failing the requirement. The covariate values of observations whose KL-2 metric evaluates to infinity are shown as black triangles in Figure 10. As expected, these observations are located in regions of relative data sparsity and/or towards the boundaries of the range of covariate values observed.

We also note there was generally agreement about which observations were potentially problematic amongst the three influence metrics. Based on this, we marked all 18 observations falling above the 2 s.d. (grey dashed) line for the KL metric as influentials.

default oracle global union int union-int GP MLP
training set 0.63 0.63 0.76 0.76 0.68 0.64 1.086 0.867
test set 1.22 0.98 1.21 1.21 1.11 1.06 0.942 0.987
Table 4: RMSE performance on the HHV dataset for BART model fits as well as GP and MLP fits from ghugare:etal:2014.

The RMSE performance of BART is summarized in Table 4, where again default is the regular BART fit, oracle is the fit obtained by dropping the detected influentials using KL-2, global is the reweighting method of bradlow:zaslavsky:1997, and the remaining methods are as proposed in this paper. In addition, the RMSE performance of ghugare:etal:2014

’s Genetic Programming (GP) and Multilayer Perceptron (MLP) models are also noted. The performance of BART’s fit on the training dataset is very strong, decreasing somewhat for most of the reweighting methods. As in the simulation study, we again see the

union-int demonstrating the best performance, nearly matching the in-sample performance of the regular BART fit. In comparison, BART’s performance on the test data is significantly worse than on the training data, and trails the GP and MLP models. Again, the union-int method provides the highest reduction in error for BART, bringing it close to the performance of GP and MLP on the test data. The remaining gap here could likely be explained by the smooth, continous fits of the GP and MLP models which would be a favourable characteristic for this dataset. Of course, BART also provides full uncertainty quantification which the GP and MLP methods lack.

Of particular interest in ghugare:etal:2014 is the performance of the models at different regimes of HHV. In particular, they note difficulty in predicting high-HHV performance, and breakdown their performance summary into three ranges of HHV values: 0-16MJ/kg, 16-25MJ/kg and 25-36MJ/kg. The performance in these ranges is summarized in Table 5 for the default BART model as well as the oracle method and union-int reweighting method. We see that the pattern obtained confirms ghugare:etal:2014 description of high HHV being particularly hard to predict. Nonetheless, the union-int method improves on the default BART fit in all three regimes, and in fact beats the oracle performance in the 16-25MJ/kg range where most of the observations lie. Still, it is hard to match the performance of GP and MLP in the 0-16MJ/kg and 25-36MJ/kg regimes, but in the high-HHV regime the oracle method dominates.

Range default oracle union-int GP MLP
0-16MJ/kg 1.88 1.50 1.70 1.16 0.90
16-25MJ/kg 0.95 0.93 0.89 0.84 0.81
25-36MJ/kg 2.80 1.01 2.07 2.55 1.55
Table 5: Range-wise RMSE performance on the HHV test dataset for BART model fits as well as GP and MLP fits from ghugare:etal:2014.

6 Conclusion

In this paper we proposed BART diagnostics for detecting influential observations, and devised reweighting procedures that allow posterior BART samples to be reweighted once influential observations are identified. The influence diagnostics include a (conditional) Cook’s distance metric, whose form is amenable to simple interpretation but only considers the effect of influentials on the mean function, and two KL-divergence metrics which measure the influence of an observation on the posterior distribution. Meanwhile, the reweighting procedures make use of importance sampling so that model training need only be done once, and the posterior samples obtained can be corrected by easily calculated weights to improve prediction performance.

Our methods were demonstrated on both simulated data and a real-world example involving biomass fuel HHV prediction. The consistently best method was the KL-2 diagnostic combined with the union-int reweighting procedure, which captures the empirical notion that highly flexible statistical learning models such as BART are affected locally by influential observations and so diagnostic and correction procedures need to capture this property in order to be practically effective. Generally our reweighting procedure provided 10-20% improvements in test-set prediction error as measured by RMSE while having negligible impact on training-set performance. In contrast, directly applying global methods such as the reweighting approach of bradlow:zaslavsky:1997 significantly deteriorated both test-set and training-set performance.

Our approach has focused on prediction performance as this is perhaps the most prominent use case for BART. Nonetheless, it would be interesting to explore extensions to alternative settings such as variable importance (horiguchi:etal:2021). However, in such settings factorizing the BART posterior in a way that allows weights to be efficiently computed is likely to be problematic and a more empirical approach perhaps motivated by the method in this paper may be more practical.

Overall, we have found a suprising amount of gains can be found by addressing influential observations even though conventional wisdom suggests that highly flexible statistical learning models like BART are not affected by such problematic observations due to their localized fits. In reality, when faced with large datasets and high-dimensional covariate spaces, the notion of ‘local’ is very much a misnomer. Even in 1-dimension, we can easily demonstrate the effect of influential observations on BART. Therefore, careful application of BART should at minimum include a diagnostic step to detect possibly problematic observations, upon which investigation, removal or the reweighting procedures proposed here can be performed.

Acknowledgements

The work of MTP was supported in part by the National Science Foundation (NSF) under Agreement DMS-1916231 and in part by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-2018-CRG7-3800.3. The work of EIG was supported by NSF DMS-1916245. The work of REM was supported by NSF DMS-1916233.

References

Supplement

Proof of Equation (2)

Recall for linear models the Cook’s distance metric for observation can be expressed as (weisberg2013applied):

For our single-tree model, we have and we replace with a posterior sample of the variance, say . Then, we need only simplify the expression involving Let represent a vector of indicators where

By construction, note that for all column vectors such that , and , the number of observations mapping to terminal node Now suppose observation maps to terminal node , resulting in being a vector of zeros except in position (which is a 1). Then,

Substituting, and the resulting form of Cook’s Distance in Equation (2) results.

Proof of Proposition 0.

First, we interpret the ratio as Now,

where since the data are independent conditional on and Then by the self-normalized importance sampling theorem (e.g. mcbook, Ch.9), we know that the estimator

as

by the strong law of large numbers, where

Therefore, by the continuous mapping theorem, it follows that