We introduce a Bayesian composite Gaussian process as a model for generating and predicting non-stationary functions defined over an input space . Our model is motivated by and extends the work of Ba and Joseph (2012), who introduced a composite Gaussian process (CGP) as a flexible model for . They used evaluations at training data locations , , to predict at one or more new locations and to quantify uncertainty about these predictions.
The problem of predicting functions that are possibly non-stationary is particularly relevant, as many physics-based and other simulator models have been developed as alternatives to physical experimental platforms. Termed “computer experiments”, simulator-based studies have been used, for example, to determine the engineering design of aircraft, automobiles, and prosthetic devices, to optimize the manufacturing settings of precision products by injection molding, and to evaluate public policy options (Ong et al., 2008; Villarreal-Marroquín et al., 2017; Lempert et al., 2000).
A common approach to prediction and uncertainty quantification when analyzing data from a computer experiment is to represent the unknown function as a realization of a Gaussian process (GP). As there are many possible functions that are consistent with the observed values sampled at training locations , a GP is used as a prior distribution over an infinite-dimensional space of functions. When combined with the observed data, the resulting posterior distribution over functions can be used for prediction and uncertainty quantification. The use of a GP as a prior over functions was introduced by O’Hagan (1978) in a Bayesian regression context. This approach has subsequently been extended and used extensively in various settings related to both physical and computer experiments (e.g., Sacks et al., 1989; Neal, 1998; Kennedy and O’Hagan, 2001; Oakley, 2002; Banerjee et al., 2004; Oakley and O’Hagan, 2004; Santner et al., 2018).
Our interest lies in prediction and uncertainty quantification for functions that, when viewed as a draw from a GP, exhibit features inconsistent with stationarity, i.e. where the behavior of the function can be substantially different in different regions of the input space. Several existing methodologies exist for working with data generated by such functions. Perhaps the most widely-used is universal kriging (Cressie, 1993), which assumes the function can be viewed as a draw from a GP of the form
is a vector of known regression functions,is a vector of unknown regression coefficients, and is a stationary Gaussian process with mean zero, process variance , and (positive definite) correlation function so that has covariance
Throughout this paper the notation will be used to describe this stationary process assumption.
The intuition of the model is that describes large-scale trends while describes small-scale deviations from the large-scale behavior. A special case of universal kriging is ordinary kriging which assumes has constant mean. Cressie (1993) and Santner et al. (2018) provide details about the model (1), including parametric options for , methods for estimating model parameters, prediction methodology for test data inputs, and uncertainty quantification of the predictions.
While universal kriging has proved useful in many applications, several limitations have been identified. The requirement that the regression functions be known or adaptively selected from a pre-defined collection of regression functions sometimes proves difficult. In addition to bias due to potential misspecification of the regression functions, standard prediction intervals under universal kriging do not account for uncertainty in the selection of the regression functions. From a computational perspective, entertaining a large class of potential regression functions may result in a large selection problem, necessitating a combinatorial search over a large space. Finally, the kriging methods described above are based on trend-stationary Gaussian processes. In many applications, even if the mean function is appropriate, the unknown function being emulated may exhibit non-stationary behavior due to the variance function. Ignoring these aspects of the data may result in both poor prediction and inaccurate uncertainty quantification.
As a motivating example, consider the (non-stationary) function
which was originally considered by Xiong et al. (2007) and also by Ba and Joseph (2012) (we henceforth refer to (2) as the BJX function). Figure 1 plots the BJX function as a black line. The points in the figure indicate the value of the function at the training data locations used by Ba and Joseph (2012). If viewed as a realization of a stochastic process, one might describe the BJX function as having three behavior paradigms. For small , can be described as having a relatively flat global trend with rapidly-changing local adjustments. For intermediate , increases rapidly and smoothly, with few local departures. For large , has a relatively flat global trend with minor local adjustments.
Two aspects of universal kriging (UK) prediction of the BJX function are of interest: the accuracy of the point predictions and the narrowness of the associated uncertainty band. Figure 1 shows point predictions of for the constant- and cubic-mean UK predictors computed at a 0.01 grid of prediction locations; a nugget was not included and so the predictors interpolate at the 17 training data locations. While the constant- and cubic-mean predictors and uncertainty bands are similar for , differences can be seen when . Reversion to the global mean is evident for the constant-mean predictor, while the cubic-mean predictor exhibits a “bump” near that is driven by reversion to the estimated cubic mean function. The 95% prediction intervals based on the cubic mean are shorter than those based on the constant mean, however both sets of intervals are unreasonably wide when . Intuitively, the intervals should be short where y(x) is essentially flat.
To address these shortcomings, alternatives to universal kriging have been proposed. The treed Gaussian processes (TGPs) of Gramacy and Lee (2008) are one such alternative. The TGP model assumes that the input space can be partitioned into rectangular subregions so that a GP with a linear trend and stationary covariance structure is appropriate to describe in each region. Following Breiman et al. (1984), TGP methodology partitions the input space by making binary splits on a sequence of the input variables over their ranges, where splits can be made on previously-split inputs by using a subregion of the previous range. After the input space is partitioned, the data in each region are used to fit a prediction model independently of the fits for other regions. In earlier proposals for fitting data to each region, Breiman et al. (1984) fit a constant mean model to the data in each region, and Chipman et al. (1998) fit a Bayesian hierarchical linear model in each region. The TGP model extends Chipman et al. (1998) by fitting a GP with a linear trend and stationary covariance structure in each region. While TGP prediction can have computational advantages over kriging, one disadvantage is that the method can be numerically challenged when the number of training data locations in one or more regions is small, a situation often encountered in computer experiments.
Ba and Joseph (2012) provide another alternative to universal kriging for emulating functions exhibiting non-stationary behavior. Their composite Gaussian process (CGP) avoids specification and/or selection of regression functions that might be required to generate the unknown by specifying the generating GP as the sum
where, conditionally on model parameters , and are independent GPs such that
Under this specification, represents a smooth process that captures any global trend in , and represents a less-smooth process that introduces local adjustments to capture the function . By replacing the regression term in (1) with the more flexible process, the CGP model is able to adapt to large-scale global features of .
Ba and Joseph (2012) employ Gaussian correlation functions and for the global and local processes, respectively, where and are corresponding correlation parameters. To ensure that the global process is smoother than the local process—and hence is interpretable as a global trend—a vector of positive bounds is specified so that , . Even though the conditional process mean, , is constant across the input space, the examples in Ba and Joseph (2012) and that of the BJX function below show that CGP often has greater prediction accuracy than ordinary kriging or even universal kriging (when the global trend is difficult to capture with pre-specified regression functions).
The variance of the CGP is . The term is a positive function that allows the range of the local process , and hence the range of , to vary over the input space. Ba and Joseph (2012) describe an algorithm for estimating that is implemented in their R package CGP (Ba and Joseph, 2018).
Figure 2 plots CGP predictions of the BJX test function (2) based on the same run training data as above. For this example, the CGP predictions are clearly more accurate across the input space than predictions under both kriging approaches shown in Figure 1. The global predictor is smooth and captures the overall trend of the function well. When the data are less volatile, as over the range , the global predictor essentially interpolates the data, and the local predictions are approximately zero. Comparing uncertainty quantification between the methods, for small , CGP produces intervals that appear slightly wider than the intervals under both kriging approaches. The CGP interval widths for large appear to fall in between the interval widths for kriging with constant and cubic mean functions, and indicate a large amount of uncertainty about the function in a region where it is essentially flat.
This paper introduces a Bayesian composite Gaussian process (BCGP) model that modifies and extends the CGP model in several ways. The BCGP model extends the covariance structure used by Ba and Joseph (2012) to allow the global and local correlation functions to be differentially weighted. This provides the covariance function with greater flexibility to handle data sets where more or less local adaptation is required. An additional feature of the BCGP model is that it introduces a new, flexible approach for handling the variance function . Direct modeling of the latent variance process is straightforward in the Bayesian context as it simply requires a new level in a hierarchical model and an additional step in a Markov chain Monte Carlo algorithm. We believe this direct approach to modeling will result in more accurate representations of uncertainty and will provide the model with additional flexibility for adapting to situations where the range of varies significantly across the input space. More generally, by formulating the model in a Bayesian context we are able to quantify uncertainty in the unknown model parameters. By fully integrating over the unknown parameters to predict , the methodology allows one to fully quantify uncertainty in the predicted values.
After the BCGP model is introduced in Section 2, Section 3 describes the computational algorithm we have developed for prediction and uncertainty quantification. Section 4 performs prediction and uncertainty quantification for three examples. The first example is the BJX example, the second example is a setting that, visually, appears stationary, and the third example performs prediction for a analytic example of the wing weight of a light aircraft.
2 The Bayesian Composite Gaussian Process Model
This section describes a Bayesian composite Gaussian process (BCGP) model that can be used to predict functions , , that, when viewed as a draw from a stochastic process, exhibit behavior consistent with non-stationarity. We assume that (perhaps after a suitable transformation) the input space is a -dimensional, finite hyper-rectangle denoted by , with for . As part of the model specification below, we extend the GP notation for stationary processes, , for use with nonstationary GPs by letting indicate that follows a Gaussian process with and covariance function . Throughout, we assume that the training data have been centered to have mean zero and scaled to have unit variance.
2.1 Conditional Model
The conditional (likelihood) component of the BCGP model assumes that can be viewed as a realization from a random process that can be decomposed as
where , and are mutually independent Gaussian processes. As in the CGP of Ba and Joseph (2012), the decomposition includes a global component, , and a local deviation component, . However, as seen below, our model specification differs in significant ways. First, the model allows for the possible inclusion of a measurement error or nugget process (see Gramacy and Lee, 2012, for a detailed discussion of the use of a nugget term in GP models for computer simulator output). Ba and Joseph (2012) argue that, due to the formulation of their CGP, the local process may mimic a nugget term in some situations and hence do not include such a term explicitly. We recognize that different practitioners will have different views on inclusion of a nugget component and note that, while we have formulated the BCGP model to include the for completeness, the nugget component can be easily removed if desired.
Conditional on model parameters , we assume , where
is a positive function, is a global correlation function, and is a weight. The local process is specified as , where
and is a local correlation function. The process
is a mean zero Gaussian white noise process with variance.
The functions and are taken to be the Gaussian correlation functions
with unknown parameters and . The quantities and are positive constants selected to enhance the numerical stability of operations on the correlation matrices of the data; the values are often appropriate when the data have been scaled to have unit variance. As with the CGP model, we take to be a smooth process that captures any global trend of , while adapts to local deviations. The relative smoothness of draws from and is controlled by the global and local correlation parameters and . We force to be smoother than by embedding constraints in the joint prior distribution for and .
Conditional on , the process (3) can be equivalently specified as
where is the overall mean, and
The specification in (3)-(5) emphasizes the decomposition of the process into global, local and error components, while the specification in (7)-(8) emphasizes the roles of the parameters in the overall covariance function.
As noted in Section 2.1, the parameters and control the smoothnesses of the component processes in . The parameter determines the extent that the model can make local adaptations to the global process; no local adaption is allowed when . The final parameter is . From (8), . In the applications we consider, is typically small relative to the overall range of , and hence plays the critical role in prediction and uncertainty quantification with respect to the model variance. The conditional BCGP model relies on knowing the form of , which is typically not available in practice. Rather than defining an algorithm for estimating as in Ba and Joseph (2012), we propose to model directly this function as an unknown random function by assuming
where is the Gaussian correlation function in (6) with parameters .
Modeling the variance function as a latent process provides a model-based approach for flexibly estimating the volatility of the unknown function across the input space. Specification of the model in this way introduces new, low-level parameters , and that drive the unobserved process. Our model for the variance process is easily handled in our inferential and predictive framework for two reasons. First, because we use MCMC methods for inference and prediction, the fact that the variance process is simply a level in a Bayesian hierarchical model means that values of and of the hyper-parameters of this latent process can be updated by additional sampling steps. Second, due to the initial scaling of the data, it is possible to use a prior distribution to center the parameters of the log Gaussian process around reasonable values. This allows us to anchor the function along a plausible trajectory while giving it the freedom to adapt to information contained in the training data.
2.2 Prior Model
We complete the specification of the BCGP model with a prior distribution on the unknown model parameters that factorizes as follows:
As is common in the literature, we assume a flat, location-invariant prior, , for the overall process mean. When the error process is included in the model, we assign a gamma prior distribution to its variance, , parameterized so that
. For data from a computer simulator we typically select the hyperparameters so thatis, a priori
, close to zero with high probability(see Section4 for examples).
The global correlation parameters are assumed to be independent of each other with , for . While in principle one could chose different hyperparameters for each input dimension, reflecting different a priori beliefs about the function along each input, in absence of such knowledge we typically set each and equal to common values and
. To enforce greater smoothness in the global process than in the local process in each dimension, we specify the prior for the local correlation parameter conditionally on the corresponding global parameter as a beta distribution truncated to the interval:
refers to a beta random variable truncated to the intervalwhich has density
with mean and variance . Lacking substantive prior information about the parameters and of we typically use common values and across the inputs.
The prior for the parameter that weights the global and local correlation functions is taken to be , where . Often the prior for is truncated with and to put more weight on the global process.
Finally, we assign a prior to the parameters of the latent variable process in (9) to have mutually independent components with marginals
represents the inverse gamma distribution with meanwhen . To specify values for the six hyper-parameters above recall that, ignoring the error variance , is the variance of the process. Assuming that the output has been scaled to have zero sample mean and unit sample variance, we “center” our prior so that on average. Setting , , , and encourages the process to stay near unity on average while allowing the data to suggest regions of the input space where should be larger or smaller. Lastly, the hyperparameters and can be chosen to control the smoothness of the latent variance process. In general, we expect this process to be fairly smooth, which suggests picking values that encourage high correlation. If there is a strong prior belief that the unknown may be best modeled as a stationary process, then setting the and the to ensure that the are close to 1, will encourage a (nearly) constant variance function. Setting gives a non-informative distribution.
3 Computational Algorithms for Inference and Prediction
This section describes the computational algorithms we have developed for inference and prediction under the BCGP model. Assume that the unknown function has been sampled at training data sites in the input space, denoted , , and is the associated vector of computed values of . To simplify notation, let be the random vector of unknown values of the variance process at the training data locations. We augment the collection of parameters introduced in Section 2.1 to include all unknown quantities so that now , , , , , , , , . The posterior distribution of all unknown quantities has density function , where is the prior specified in Section 2.2. The likelihood is derived from the conditional model specified in (7) and (8), which implies that where the covariance matrix has element, ,
and is the Kronecker delta function.
The posterior density is difficult to compute in closed form. For inferential and predictive purposes, we obtain samples from the posterior via a Markov chain Monte Carlo (MCMC) algorithm. We update parameters and the values based on their full conditional distributions either by Gibbs updates—sampling directly from full conditional distributions—or by Metropolis–Hastings updates—sampling from proposal distributions and accepting or rejecting the proposed draws based on full conditional distributions. Some updates are relatively straightforward, while others—in particular, the update of the latent variance process —require special attention in order to ensure good mixing of the chain.
The MCMC algorithm starts at an initial value and, at each iteration , the elements of are updated according to the 9 steps below. The chain can be initialized with any satisfying . For simplicity the following notation is used. At any step in the sampler during iteration , the notation represents a vector containing the newly-sampled values for any parameters that have already been updated in the (partial) sweep through the steps at iteration . Similarly, for steps with Metropolis–Hastings updates, the notation should be understood to mean where the parameters currently being updated are replaced with proposed values. For a generic parameter , the notation should be understood to be the up-to-date version of the parameter vector without component . Unless otherwise specified, for all proposal distributions used in Metropolis–Hastings updates, if the proposed value is , the updated value is taken to be
In our MCMC algorithm, a Metropolis–Hastings update for a parameter relies on a calibrated proposal width to help ensure reasonable mixing of the chain. Section 3.2 provides details of the calibration scheme.
At iteration in the MCMC algorithm, the parameters are updated according to the following steps.
- Step 1:
Update by sampling directly from its full conditional distribution,
where is a vector of ones and is the covariance matrix with elements (11) evaluated at the training data points using the parameters .
- Step 2:
Update by proposing from a distribution and using (12) to determine the value of .
- Step 3:
Update the global correlation parameters one-at-a-time (conditioning on the others) by proposing from a distribution and using (12) to determine the value of each .
- Step 4:
Update the local correlation parameters one-at-a-time (conditioning on the others) by proposing from a distribution and using (12) to determine the value of each .
- Step 5:
Update by proposing from a distribution and using (12) to determine the value of .
- Step 6:
Update by sampling directly from its full conditional distribution,
where and , , and is the correlation matrix for the process evaluated at the training data locations with elements
- Step 7:
Update by sampling directly from its full conditional distribution,
- Step 8:
Update one-at-a-time (conditioning on the others) by proposing from a distribution and using (12) to determine the value of each .
- Step 9:
Update as described in Section 3.1.
In practice, the MCMC algorithm is run for three sets of iterations. The first set are calibration iterations in which a fixed number of iterations are made in which the proposal widths are determined for the subsequent runs (see Section 3.2). After calibration, the chain is run for an additional burn-in period. The final set of iterations are production iterations that produce samples from the posterior distribution . The samples can be used for predictive inference as described in Section 3.3.
3.1 Updating the latent variance process
Updating the latent variance process at the training data locations requires special attention. The full conditional posterior distribution of neither nor its logarithm, , are standard distributions and so sampling updated values directly is difficult. When the number of training data locations, , is not “too large”, we can use a Metropolis–Hastings update for the full vector by sampling from a proposal process at the training data locations and accepting the proposed move with the appropriate probability. While straightforward in principle, the proposal process must be constructed carefully in order to ensure acceptance rates that result in appropriate mixing of the chain. When is large it is difficult to accept the entire vector of proposed values unless the proposal vector is very close to the current vector, which inhibits mixing.
With this in mind, we describe two different methods for updating the latent variance process. The first method is designed to work well when is “small”, while the second method is constructed to produce reasonable mixing when is large. When the number of training data locations is small, say , we recommend updating by sampling
where , , and is a predefined value that controls the variance of the proposal distribution. The proposal distribution (13) is centered around the current value at each training data location. The variance parameter should be chosen so that the proposed values, , are (1) similar enough to the current values, , to have a useful acceptance rate while (2) still allowing the to be sufficiently different from the values so the support of the posterior distribution can be fully explored. After appropriately accounting for the transformation, the acceptance probability for the proposed value is , where
When the number of training data locations is large, say , we recommend an alternate approach to updating . Rather than updating all elements of together, we instead randomly select a focal point from the design space and then update the variance process at a cluster of training locations closest to the chosen focal point conditionally on the current value of the variance process at all other training data locations. After updating the process at that cluster of points, we randomly select another focal point in the design space and repeat the process. At each iteration in the overall MCMC algorithm, the process is repeated so that total focal points are sampled, and the final vector after the cycles is retained as new state in the Markov chain. While yields a valid MCMC algorithm, we expect setting should improve mixing of the chain. The following steps describe the details of this process.
- Step 9a:
Select a focal point uniformly at random from the -dimensional, hyper-rectangular input space .
- Step 9b:
Select the training data locations closest to the randomly selected focal point. In practice, we have found that choosing works well. Denote these points by and the remaining training data points by .
- Step 9c:
Propose new values by sampling from the distribution obtained by conditioning (13) on the current values :
where is the correlation matrix for the log GP between the proposal locations, is the correlation matrix for the log GP between the locations in , and is an matrix with each column containing the correlation for the log GP between a proposal point and each of the locations in .
- Step 9d
: Update the elements of corresponding to the locations with the values with probability , where is as in (14); otherwise, do not change .
- Step 9e:
After repeating Steps (9a)-(9d) times, set and
In our examples, we typically set so that , which has resulted in satisfactory mixing of the chain.
3.2 Calibrating the proposal widths
The Metropolis–Hastings updates described above rely on proposal widths , , , and . Appropriate values must be chosen in order to ensure good mixing of the chain. We use an automated method to calibrate these proposal widths with the goal of selecting widths that result in acceptance rates of between approximately and . It has been shown theoretically that in specific contexts acceptance rates in this range lead to chains with good convergence and mixing properties (e.g., Gelman et al., 1996; Roberts et al., 1997; Roberts and Rosenthal, 2001); empirical evidence in many different model and data settings suggests these rates are generally desirable.
To adaptively calibrate the proposal widths, we initially run the MCMC algorithm with user-specified widths . The proposal widths can be different for each parameter , , etc. After iterations, we compute the empirical acceptance rates separately for each parameter with a proposal width and compare these acceptance rates to a range of target rates (our implementations uses the range ). If any individual empirical rate is outside of this range the proposal width for that parameter is updated to be , where is a specific target rate. Under this scheme, a proposal width will be increased when the empirical acceptance rate is too high and decreased when too low relative to the target. After updating the , the MCMC algorithm is continued for another iterations. The adaptation scheme is terminated after a total of adaptation periods. Section 4 provides examples of how we have implemented this approach in practice.
Because the transition kernel is potentially changing throughout the adaptation period, we discard all samples at the end of the adaptation periods and start a new MCMC run using the final state of the chain as the starting values and fixing the calibration widths at their final values. As we do not assume that we start the chain from stationarity, we typically allow for an additional burn-in period before collecting production samples from the posterior.
3.3 Prediction and Uncertainty Quantification
A primary objective is to use the methodology to predict the output of a computer simulator (or other source) at new input values. Quantification of uncertainty about these predictions is also desired. Focusing on a particular (single) input location
, predictive inference under the BCGP model is obtained via the posterior predictive distribution
where the unknown parameters are integrated over their “likely” values as specified by the posterior distribution. The point prediction is taken to be the posterior predictive mean . Uncertainty about the unknown value of is quantified by a posterior predictive interval computed as lower and upper percentiles of the posterior predictive distribution.
To compute the predictions, note that the conditional distribution of given and is
where is the covariance matrix at the training data locations with elements calculated as in (11) and is the vector of covariances between the process at the prediction input and the process at the training input locations; these elements are
for . The conditional distribution (15) can be used to construct the Rao–Blackwellized Monte Carlo estimate
of using the posterior samples obtained with the MCMC algorithm, where quantities superscripted by are computed using the -th draw of the parameters, .
Computing requires the term , the square root of the a posteriori sample of the latent variance function at . While the method for updating the latent variance process described in Section 3.1 produces samples of the latent variance process at the training data locations, it does not automatically produce samples at the prediction location. If the prediction location is known in advance, the approach described in Section 3.1 can be modified to include the prediction location together with the training data locations. The resulting can be saved for prediction.
If the prediction location is not known before the MCMC algorithm is run, samples can be obtained after the end of the MCMC run by simulating from the appropriate conditional distribution. In more detail, letting , we have
where is the correlation matrix of the latent log GP evaluated at the training data locations and has -th element . The term is the vector of covariances between the latent log GP at and each of the training data inputs , i.e., , for . Each posterior sample , , is used to generate a draw from (17); transforming yields which is required to evaluate the vector in (16).
Uncertainty about the unknown value of the function at the input is quantified via a posterior predictive interval computed by finding the upper and lower percentiles of the posterior predictive distribution. As discussed in Davis (2015), Rao–Blackwellized Monte Carlo estimates of these quantities are more difficult to compute than the Rao–Blackwellized point predictions described above. A computationally simpler approach is to obtain the percentiles required to construct the predictive intervals by first obtaining samples , , from the conditional predictive distribution (15) using the and samples. Then, for example, the and percentiles of this set of samples are Monte Carlo estimates of the endpoints of the posterior predictive interval. Note that averaging the samples together would also produce a valid estimate of ; however, when computationally feasible, we prefer the less-noisy, Rao–Blackwellized approach.
The methods of prediction and uncertainty quantification described above are specific to a single new input location . Point-wise prediction and uncertainty quantification at several new input locations , , is easily achieved by implementing the methods separately at each location.
3.3.1 Global and Local Components of Prediction
Ba and Joseph (2012) emphasized that predictions under the CGP model can be decomposed into global and local components. Decomposing predictions in this way allows one to visually assess how the behavior of the unknown function changes over the input space, e.g. by finding regions where large local adaptations are necessary.
Posterior predictions under the BCGP model can be similarly decomposed by rewriting the conditional posterior predictive mean (15) as