DeepAI
Log In Sign Up

Scaled Vecchia approximation for fast computer-model emulation

05/01/2020
by   Matthias Katzfuss, et al.
0

Many scientific phenomena are studied using computer experiments consisting of multiple runs of a computer model while varying the input settings. Gaussian processes (GPs) are a popular tool for the analysis of computer experiments, enabling interpolation between input settings, but direct GP inference is computationally infeasible for large datasets. We adapt and extend a powerful class of GP methods from spatial statistics to enable the scalable analysis and emulation of large computer experiments. Specifically, we apply Vecchia's ordered conditional approximation in a transformed input space, with each input scaled according to how strongly it relates to the computer-model response. The scaling is learned from the data, by estimating parameters in the GP covariance function using Fisher scoring. Our methods are highly scalable, enabling estimation, joint prediction and simulation in near-linear time in the number of model runs. In several numerical examples, our approach substantially outperformed existing methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/18/2019

Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

Generalized Gaussian processes (GGPs) are highly flexible models that co...
11/03/2021

Scalable mixed-domain Gaussian processes

Gaussian process (GP) models that combine both categorical and continuou...
05/08/2018

Vecchia approximations of Gaussian-process predictions

Gaussian processes (GPs) are highly flexible function estimators used fo...
12/21/2020

Statistical Modelling and Analysis of the Computer-Simulated Datasets

Over the last two decades, the science has come a long way from relying ...
08/21/2017

A general framework for Vecchia approximations of Gaussian processes

Gaussian processes (GPs) are commonly used as models for functions, time...
02/01/2023

Short-term Prediction and Filtering of Solar Power Using State-Space Gaussian Processes

Short-term forecasting of solar photovoltaic energy (PV) production is i...
06/03/2021

Uncertainty Quantification of a Computer Model for Binary Black Hole Formation

In this paper, a fast and parallelizable method based on Gaussian Proces...

1 Introduction

At the cutting edge of science, computationally intensive simulations are used to make predictions of complex phenomena, such as the distribution of matter in the Universe (Lawrence et al., 2017), the behavior of materials under high pressure (Walters et al., 2018), or the composition of rocks on Mars (Bhat et al., 2020). These simulations are simply too slow for use in data analysis (Higdon et al., 2004) or real-time applications (Mehta et al., 2014), so the statistics discipline known as computer experiments has grown to address this computational challenge. The key ingredient in much of this work is an emulator, a statistical approximation to the computer simulation. Emulators can predict the output of a simulation many orders of magnitude faster than the simulation itself, at the cost of additional error. Emulation is achieved by building a regression model from the inputs to the outputs.

Gaussian processes (GPs) are popular emulators and have emerged as indispensable tools for design, analysis, and calibration of computer experiments (e.g., Sacks et al., 1989; Kennedy and O’Hagan, 2001). GPs are accurate, flexible, interpretable, and probabilistic, thus providing natural quantification of uncertainty. For the analysis of computer-model runs, GP inference typically requires working with a dense covariance matrix. Thus, direct GP inference is infeasible for many present and future computer experiments, as new supercomputers enable increasingly large numbers of increasingly detailed simulations to be carried out. Scalability improvements for computer-experiment methods are vital to handle the expected increase in simulation output.

Many approaches have been proposed to enable scalable GP inference. Heaton et al. (2019) review and compare approaches from spatial statistics, and Liu et al. (2020)

review approaches in machine learning. In the context of large computer experiments, scalable GP approaches include compactly supported covariances

(Kaufman et al., 2011); sparse grid-based GPs (Plumlee, 2014); and the local approximate GP (laGP) of Gramacy and Apley (2015), which makes independent predictions at different input values using only nearby observations in the input space. In spatial statistics, the Vecchia approximation (Vecchia, 1988) and its extensions (e.g., Stein et al., 2004; Datta et al., 2016; Guinness, 2018; Katzfuss and Guinness, 2019; Katzfuss et al., 2020a)

are very popular GP approximations. Similar to the laGP, the Vecchia approximation considers nearest neighbors, but it does so from an ordered conditional perspective; as a result, Vecchia approximations imply a valid joint distribution for the data, resulting in straightforward global likelihood-based parameter inference and joint predictions at a set of input values, which allows proper uncertainty quantification in down-stream applications.

Here, we adapt and extend the powerful class of Vecchia GP approximations from spatial statistics to enable the scalable analysis and emulation of large computer experiments. Specifically, we apply Vecchia’s ordered conditional approximation in a transformed input space, for which each input is scaled according to how strongly it is related to the computer-model response. This greatly improves the accuracy in the high-dimensional input spaces common in computer experiments (as opposed to the usual two-dimensional space in spatial statistics). The scaling of the input space is learned from the data, by estimating parameters in the GP covariance function using Fisher scoring (Guinness, 2019). Our scaled Vecchia methods are highly scalable, enabling ordering, neighbor-search, estimation, joint prediction and simulation in near-linear time in the number of model runs. Thus, our methods can handle large numbers of model runs, joint sampling of paths through the input space, and relatively high input dimensions, assuming that only some of the inputs have a strong effect on the output, while others are less important. Recently, Shi et al. (2017) also applied a Vecchia-type approximation in a GP-emulation setting, but their focus was on infinitely smooth covariances (i.e., squared exponential) and change-of-support problems.

The remainder of this document is organized as follows. In Section 2, we describe GP emulation of computer models, and we review existing Vecchia approximations from spatial statistics. In Section 3, we introduce our new scaled Vecchia methods for fast emulation of large computer experiments. In Section 4, we provide numerical comparisons to existing approaches. Section 5 concludes and discusses future work. R code to run our method and reproduce all results is available at https://github.com/katzfuss-group/scaledVecchia.

2 Review

2.1 Computer-model emulation using Gaussian processes

Let be the response of a computer model at a

-dimensional input vector

on the input domain . In Gaussian-process emulation, is assumed to be a Gaussian process (GP) with mean function and a positive-definite covariance or kernel function . Then, the vector of responses at input values follows an

-variate Gaussian distribution with covariance matrix

whose th entry describes the covariance between the responses of simulations and as a function of the corresponding input values and .

For simplicity, we henceforth make some additional assumptions, although most of our methodology is also applicable in more general situations. Specifically, we assume that the mean function is linear in a number of covariate parameters, ; typical assumptions are or .

We also assume an anisotropic covariance function with a separate range parameter for each input dimension , also referred to as automatic relevance determination: , where

(1)

While GPs are indispensable tools for computer-model emulation due to their flexibility and natural uncertainty quantification, direct GP inference requires an factorization of the covariance matrix, which is not feasible for large computer experiments. Thus, we propose an approximation that reduces computational complexity and hence improves scalability.

2.2 Vecchia approximations in spatial statistics

Vecchia’s approximation (Vecchia, 1988) is a powerful GP approximation that is popular in spatial statistics. Motivated by the exact decomposition of the joint density as a product of univariate conditional densities, Vecchia (1988) proposed the approximation

(2)

where is a conditioning index set of size for all (and ). Even with relatively small conditioning-set size , the approximation (2) can often be very accurate due to the screening effect (e.g., Stein, 2011). The in (2) are all Gaussian distributions that can be computed in parallel using standard formulas, each using operations based on data.

The Vecchia approximation has many useful properties. For example, the implied joint distribution is also multivariate Gaussian, and the Cholesky factor of is highly sparse with fewer than off-diagonal nonzero entries (e.g., Datta et al., 2016; Katzfuss and Guinness, 2019). Further, under the sparsity constraint implied by the choice of the , the Vecchia approximation results in the optimal inverse Cholesky factor , as measured by the Kullback-Leibler (KL) divergence, (Schäfer et al., 2020). Enlarging the conditioning sets never increases the KL divergence (Guinness, 2018); for , the approximation is exact, . In contrast to some other GP approximations, the Vecchia approximation to the underlying model is global; thus, for example, model parameters can be estimated (see Section 3.2) from a subsample of the data, and then the estimated parameters can be used to make predictions (Section 3.3) using all of the data.

The approximation accuracy of the Vecchia approach depends on the choice of ordering of the variables and on the choice of the conditioning sets . A general Vecchia framework (Katzfuss and Guinness, 2019) obtained by varying these choices unifies many popular GP approximations (e.g., Quiñonero-Candela and Rasmussen, 2005; Snelson and Ghahramani, 2007; Banerjee et al., 2008; Katzfuss, 2017; Katzfuss and Gong, 2019). In practice, high accuracy can be achieved using a maximum-minimum distance (maximin) ordering and nearest-neighbor (NN) conditioning, which are illustrated in Figure 0(a). Maximin ordering picks the first variable arbitrarily, and then chooses each subsequent variable in the ordering as the one that maximizes the minimum distance to previous variables in the ordering. For NN conditioning, each then consists of the indices corresponding to the nearest previously ordered variables. For both ordering and conditioning, distance between two variables and is typically defined as the Euclidean distance between their corresponding inputs. In addition, we employ a grouping strategy (Guinness, 2018) that combines conditioning sets and when doing so is computationally advantageous. When using maximin ordering and NN conditioning, recent results (Schäfer et al., 2020) imply that, for increasing , a specific accuracy for certain isotropic Matérn kernels can be guaranteed using conditioning sets of size , under regularity conditions and ignoring edge effects. The resulting near-linear time complexity is the best known complexity for problems of this type.

3 Methodology

In principle, Vecchia approximations of spatial GPs can be directly applied to emulation of computer experiments (Section 2.1). However, while physical distance between spatial locations is usually meaningful, Euclidean distance between inputs to a computer experiments depends heavily on the arbitrary scaling of each input dimension. In addition, while spatial fields are typically two- or three-dimensional, computer experiments often consider inputs; as the asymptotics discussed at the end of Section 2.2 imply that is required to achieve a certain accuracy, a very large might be required for large , resulting in a prohibitive computational cost (which scales cubically in ).

3.1 Scaled Vecchia approximation for computer experiments

(a) MN of , shown in
(b) MN of , shown in
(c) MN of , shown in
Figure 1: Maximin ordering and nearest-neighbor conditioning (MN) for inputs (small grey points) generated using Latin hypercube sampling on in dimensions, assuming an anisotropic covariance (1) with range parameters . MN is carried out on the original inputs (top row, red) or the scaled inputs (bottom row, black). The first ordered inputs are numbered, with emphasis on the th input () and its nearest previously ordered neighbors with indices (). (a) MN of original inputs viewed on original space : First inputs are spread out over input space, are nearby. (b) Same MN on scaled space : First inputs are irregularly spaced, missed nearby and . (c) MN of scaled inputs on scaled space: First inputs are spread out over input space, are nearby, as desired.

Hence, we propose a scaled Vecchia approximation that exploits that the input variables can vary widely in the magnitude of their effect on the response; this is sometimes referred to as factor sparsity. Specifically, for known range parameters , the anisotropic distance in (1) can be viewed as a Euclidean distance of scaled inputs,

(3)

where are the scaled inputs, and we call the relevance of the th input dimension or variable (assuming standardized input space ). Similar scaling ideas have been considered for other GP approximations of computer experiments (e.g., Gramacy, 2016).

Our scaled Vecchia approximation is defined as in (2), except based on a maximin ordering and NN conditioning of the scaled inputs , assuming known for now; more precisely, we define the distance between variables and as the Euclidean distance between the corresponding scaled inputs (see Figure 0(c)), instead of in the standard Vecchia approximation. Note that this scaled Vecchia approximation can be viewed as a special case of correlation-based Vecchia (Kang and Katzfuss, in prep.). The ordering and conditioning can be computed in quasilinear time in (Schäfer et al., 2017, 2020).

The resulting scaled Vecchia approximation of the GP with anisotropic kernel , can be viewed as a standard Vecchia approximation of a GP with isotropic kernel with scaled inputs in the scaled input space . Importantly, Euclidean distance is only meaningful in , not in . Figure 0(b) shows that maximin ordering of can be highly irregular in , and nearest-neighbor conditioning of may miss important and nearby inputs in . In contrast, scaled Vecchia (Figure 0(c)) is directly carried out in ; the resulting ordering is more regular, and the conditioning set contains the nearest previously ordered neighbors, as desired to achieve good screening properties in the conditional distributions in (2).

Many computer codes contain input variables that only weakly affect the response ; this can be captured in our model by a large , so that changes in only result in small changes in , and thus only minor changes in position in . In the extreme case of , the input variable is effectively eliminated from the model and the dimension of is smaller than the dimension of the original input space, and thus a smaller is required to achieve a certain approximation accuracy. But even for large but finite range parameters, Figure 1(a) shows that scaled Vecchia can achieve a certain accuracy with much smaller than standard Vecchia (see Section 4.2 for more details).

3.2 Estimation of parameters

In practice, the parameters in the mean function and parameters in the covariance function may be unknown, including the range or scaling parameters . We estimate these parameters by maximizing the logarithm of the Vecchia likelihood in (2). This is challenging due to the potentially large number of parameters. Hence, we use a Fisher scoring algorithm (Guinness, 2019), which exploits first- and second-derivative information for fast convergence but preserves the scaling of the Vecchia approximation. We briefly review this algorithm here, but refer to Guinness (2019) for details.

Let , where is the Vecchia approximation from (2) with , except that we have now made explicit the dependence of the density on the parameters. Taking derivatives of the conditional densities in (2) is challenging; replacing them by joint distributions,

(4)

enables the use of well-known formulas for the gradient and Fisher information of the Gaussian distributions and . Because appears linearly in the mean of the Gaussian distributions, we can profile out using a closed-form expression for . Then, starting with an initial value , Fisher scoring for proceeds for as

(5)

where and can be computed based on (4) as the sum of log-densities that are at most of dimensions . The algorithm is terminated when the dot product between the step and the gradient is less than , obtaining the estimates and

. In practice, a mild penalization term (e.g., to discourage variance parameters that are much larger than the sample variance of the training data) is added to (

4) to improve convergence. Also, when the Fisher-scoring step fails to increase the loglikelihood, the step is replaced by a line search along the gradient. This concludes the review of Guinness (2019).

In our scaled Vecchia approach, over the course of the Fisher-scoring iterations, the estimate of will change, and along with it, the scaled inputs , the resulting maximin ordering and NN conditioning, and the implied approximate density . For the purpose of computing the derivatives and the resulting parameter update in (5), we ignore the dependence of the ordering and conditioning on ; instead, we update the ordering and conditioning separately given the current estimate , but only at certain iterations, say , to avoid slowing the algorithm unnecessarily. Hence, as we refine the estimates of the parameters, we refine our Vecchia approximation of the implied covariance. If it is of interest, we can carry out crude variable selection and eliminate inactive input dimensions by setting if is over a certain threshold (e.g., ).

Figure 1(b) shows that our scaled Fisher-scoring approach can be much more accurate than using standard Vecchia (see Section 4.2 for more details). The estimation algorithm converged quickly, typically requiring only around ten iterations to estimate eleven parameters.

(a) Known parameters
(b) Estimated parameters
Figure 2: For data simulated from a Matérn GP in input dimensions, comparison of our proposed scaled Vecchia (SVecchia) approach to two existing GP approximations, in terms of average difference in log score (dLS), which approximates KL divergence (see Section 4.2 for details)

3.3 Prediction

Given the estimated parameters and , we would like to predict the response at unobserved inputs, . This is equivalent to obtaining the posterior distribution of , where . To be able to compute this distribution even for large or , we apply a Vecchia approximation to the joint density , where . To do so, we employ a maximin ordering of the scaled inputs corresponding to , under the restriction that the entries of are ordered before those in (Guinness, 2018; Katzfuss et al., 2020a). As a result, we can write , where is as before in (2

), and the desired posterior predictive distribution is

(6)

and contains the indices of the variables that are closest to (in terms of scaled distance) among those that are previously ordered in . It is possible for few or even none of the indices in a particular set to correspond to observations, so that only conditions on other unobserved prediction variables; however, because these prediction variables may, in turn, condition on observations, the resulting predictions can be as good or better marginally (and much more accurate jointly) than predictions using only observations (e.g., compare the methods LF-full and LF-ind in the first two rows of Fig. 5 in Katzfuss et al., 2020a).

As in (2), all the univariate conditionals, , are Gaussian and can be computed in time. It is straightforward to, for example, compute the mean of or draw joint samples from using the expression (6). In addition, is jointly Gaussian with a sparse inverse Cholesky factor, from which any distributional summary of interest can be computed (Katzfuss et al., 2020a). These properties enable joint simulation and uncertainty quantification for a set of unobserved input values, such as a path through the input space.

3.4 Implementation

We implemented our methods in R, building on top of the R package GpGp (Guinness and Katzfuss, 2018). We provide the anisotropic covariance function matern_scaleDim as in (1), where is the isotropic Matérn covariance (e.g., Stein, 1999). We also provide its special cases for half-integer smoothness values 0.5, 1.5, 2.5, 3.5, 4.5, which avoid expensive Bessel functions. Parameter estimation is based on the Fisher-scoring procedure in GpGp; at iterations , we update the ordering and conditioning of the current scaled inputs , using the exact maximin ordering algorithm implemented in GPvecchia (Katzfuss et al., 2020b). Each ordering and conditioning can be computed in quasilinear time in (Schäfer et al., 2020), and in practice the added time is negligible relative to a standard Vecchia approximation that keeps the ordering and conditioning fixed. For prediction, we again modify and extend functions in GpGp to allow predictions in scaled input space.

Due to the global nature of the Vecchia approximation (see Section 2.2), it is possible to separate training of our emulator (i.e., parameter estimation) from prediction. As parameter estimation requires multiple Fisher-scoring iterations, we recommend using relatively small conditioning sets of size for the Vecchia density (2) used for the parameter estimation described in Section 3.2, and of larger size for the Vecchia approximation of the predictive distribution in (6). In addition, our numerical experiments below showed that a random subsample of the training data of size in the low thousands was enough to estimate the small number of unknown parameters for the considered here. For this SVecchia procedure, the computational cost is independent of the full training size , aside from negligible pre-processing costs. For parameter estimation, each Fisher-scoring iteration scales roughly as ; given the trained emulator, prediction scales as . The computations can be carried out in parallel across the and terms.

The code is available at https://github.com/katzfuss-group/scaledVecchia. Using default settings, scaled-Vecchia estimation and prediction is as simple as:

    fit   <- fit_scaled( y.train, inputs.train )
    preds <- predictions_scaled( fit, inputs.test )

3.5 Design

Our methods can also be extended straightforwardly for the design of computer experiments. For example, consider the following two-stage design of total size . In the first stage, we obtain a small number of runs, say , with input values chosen by a space-filling design, such as a Latin hypercube (LH). Then, we apply our estimation method from Section 3.2 to the responses to obtain an estimate of , including the estimated ranges . In the second stage, we “oversample”, say inputs values using a LH design, and then choose the first inputs in a maximin ordering of the scaled space determined by the range estimates from the first stage. Finally, based on the resulting full dataset of size , we can re-estimate the parameters, and make predictions at unobserved input values as described in Section 3.3. Note that such a “sensitivity-weighted distance” has previously been considered for small sequential designs in Williams et al. (2011).

In addition, our methods can be used for designs based on optimization criteria (e.g., Mockus, 1989; Jones et al., 1998), sometimes referred to as Bayesian or model-based optimization. These sequential designs at each stage require re-estimation of parameters and predictions at large numbers of inputs (e.g., to compute the expected improvement), which can be carried out rapidly using our methods.

4 Numerical comparisons

4.1 General information

We carried out numerical studies comparing the following methods:

[itemsep=1pt,topsep=2pt,parsep=1pt]

SVecchia:

Our proposed scaled Vecchia approximation, as described in Section 3

Vecchia:

Existing standard Vecchia approximation, with maximin ordering and nearest-neighbor conditioning based on Euclidean distance between inputs

LowRank:

Modified predictive process (Finley et al., 2009), equivalent to Vecchia, except that all variables simply condition on the first variables in the (Euclidean) maximin ordering: for

laGP:

Local approximate GP (Gramacy and Apley, 2015; Gramacy, 2016)

H-laGP:

Hybrid global-local extension of laGP (Sun et al., 2019, Sec. 3) with pre-scaling based on a random subsample of size 1,000

For SVecchia, Vecchia, and LowRank, we assumed zero mean , was assumed to be a Matérn covariance with smoothness 3.5 and zero nugget. For each comparison, training input values were generated using Latin Hypercube sampling using the R package lhs (Carnell, 2019), and test inputs were sampled uniformly at random on .

4.2 Matérn simulations

We considered responses simulated from a GP with mean zero and Matérn covariance function with smoothness 3.5 in dimensions. We assumed two “relevant” input dimensions with range parameters , and eight less relevant inputs with range parameters . Only squared-exponential covariances are implemented in laGP, and so laGP was not included in this comparison. For the other three methods, we considered the average difference in log scores (dLS; Gneiting and Katzfuss, 2014) or loglikelihoods, , over ten datasets simulated from the true model. This score approximates the KL divergence between the true and approximated model.

Figure 1(a) shows the dLS when assuming that the covariance function (including its parameters) was known. Vecchia was more accurate than LowRank, but SVecchia resulted in additional, substantial improvement. For example, SVecchia with was more accurate than Vecchia (or LowRank) with ; due to the cubic scaling in , this implies a 1,000–fold decrease in computational cost for a given accuracy.

For Figure 1(b), the parameters were assumed unknown and estimated from the data, but the resulting dLS using were very similar to the known-parameter case.

4.3 Borehole function

We carried out a simulation study comparing prediction accuracy for the Vecchia-based methods (i.e., SVecchia, Vecchia, and LowRank) using the popular borehole-function example (Morris et al., 1993), which models the water-flow rate through a borehole as a function of input variables. For various training-data sizes , we estimated parameters based on the training data, and made predictions at unobserved test inputs. We computed mean square error (MSE) between the true test responses and the corresponding predictive means obtained by each of the approximation methods with different values of . For each and , we repeated the experiment ten times, averaged the results, and computed the root MSE (RMSE). We used for parameter estimation and for prediction; for SVecchia, a training subsample of size was used for estimation if .

(a) Increasing
(b) Increasing
Figure 3: Root mean square error (RMSE, on a log scale) for prediction at unobserved inputs using different GP approximations for the borehole example (see Section 4.3 for more details)

The results are shown in Figure 3. For scale, the trivial predictor given by the average of the training data had an RMSE around 45. SVecchia outperformed the other methods for every combination of and shown in the plots. Note that RMSE is plotted on a log-scale. Thus, for example for and , the seemingly small improvement of SVecchia over Vecchia actually corresponds roughly to a 50% reduction in RMSE. Figure 2(a) shows that LowRank’s accuracy did not improve much with , and so this method was not considered for the large- comparisons below. There is a trade-off with the tuning parameter , which determines the size of the conditioning or neighbor sets: For all methods, increasing resulted in higher accuracy (Figure 2(b)), but the computational cost also increases roughly cubically with (see Section 3.4 for a discussion of SVecchia’s cost).

4.4 Test functions

We then considered larger datasets generated using three physical models from the Virtual Library of Simulation Experiments (Surjanovic and Bingham, 2013), including the borehole function from Section 4.3. We generated training inputs and test inputs, and averaged the results over five datasets for each test function. For SVecchia and Vecchia, we used and a subsample of the training data of size for parameter estimation, and and all training data for prediction. For laGP and H-laGP, we used or neighbors for both, and we manually specified a much smaller nugget () than the default value to obtain more accurate predictions. Timing results were obtained on a basic desktop computer (3.4GHz Intel Quad Core i5-3570), using one core (single-threaded) for SVecchia and Vecchia and using all four cores for laGP and H-laGP.

(a) borehole ()
(b) robot arm ()
(c) piston ()
Figure 4: Comparison of root mean square error (RMSE) versus computing time (both on a log scale) for test functions in Section 4.4, with two different tuning-parameter settings for each function and method. We provide training and total (i.e., training plus prediction) times, on a single core for (S)Vecchia and on four cores for (H-)laGP.

The results are summarized in Figure 4, and the detailed numerical results are given in Table 1 in Appendix A. For all three test functions, SVecchia was the most accurate, despite having the lowest computational cost.

In general, it is difficult to set a comparison in which all methods are placed on perfectly equal footing. The Vecchia approaches used a subsample for parameter estimation, and the full training set with a larger conditioning-set size for prediction. H-laGP also uses data for global pre-estimation, but in a different manner. Both laGP methods do some estimation on the fly. For all methods, increasing the size of the conditioning or neighbor sets improves the accuracy but also increases the computational cost. Timing results will also depend heavily on a number of other factors, including , , implementation, parallelization, and the computing environment. Due to the good parallelization properties of the laGP implementation, laGP prediction times could potentially be pushed below those of single-core SVecchia by using enough cores for laGP.

4.5 Computer model for satellite drag

Finally, we carried out comparisons using a computer simulator for atmospheric drag coefficients of satellites in low Earth orbit under varying input conditions. A detailed description of the computer model and a previous analysis using state-of-the-art GP emulators can be found in Sun et al. (2019), with data and results available at https://bitbucket.org/gramacylab/tpm/src/master/. In short, we considered simulations of drag coefficients for the Hubble space telescope with inputs. The simulation runs consist of responses for each of six pure chemical species, which can be combined into actual drag coefficients by computing a weighted average of the species.

As in Sun et al. (2019, Sect. 6.1), we first carried out a 10-fold cross-validation (CV) exercise, separately for each of the six species. For the Vecchia-based methods, we used for prediction, and we used and a randomly selected subset of size for parameter estimation. We also tried estimation using the full dataset (i.e., ) and a larger , but the increase in predictive accuracy was small relative to the increase in computational cost. The parameter estimates were quite stable between different CV folds. One example of the estimated relevance is shown in Figure 4(b) for each input variable and each species; the highest and lowest relevance differed by two-to-three orders of magnitude, indicating that scaling the input dimensions accordingly should be useful for emulating this simulator.

(a) CV root mean square percentage error (RMSPE)
(b) SVecchia estimates of relevance of input variables
Figure 5: Results for the six chemical species in the satellite-drag simulator. H-laGP: hybrid global-local extensions of laGP. Relevance: (see details below (3))

Figure 4(a) shows a comparison of CV prediction accuracy in terms of root mean square percentage error (RMSPE). We compared the Vecchia-based methods to the 19 laGP variants considered and described in Sun et al. (2019), seven of which are versions of the basic, local-only laGP, and twelve of which are hybrid global-local laGP (H-laGP) extensions. Vecchia was more accurate than the basic laGP methods, but none of these approaches was able to achieve the standard benchmark of a 1% relative error, indicated by the horizontal line. In contrast, SVecchia met the benchmark and was the most accurate method for all six chemical species. While the accuracy improvement might look small on the log scale of Figure 4(a), note that the RMSPE of the best-performing laGP method (“alcsep2.sb”) was considerably higher than the SVecchia RMSPE for several species, ranging from roughly 2% higher for H, to around 14% for O and N, up to 40% for He. This is especially remarkable when considering that the total time for estimation and prediction for SVecchia was only around 13 to 14 minutes (4–5min for estimation and roughly 9min for prediction) per species and fold, on a single core on a basic desktop computer; the best-performing laGP method took up to 45 core hours according to Sun et al. (2019), which is around 200 times as long.

We also examined predictions along likely trajectories in low Earth orbit, which corresponds to paths in input space. Sun et al. (2019, Sect. 6.2) consider two trajectories, for a quiet and active regime, each for ten-second intervals (i.e., about one day). Predictions are made for each of the six pure chemical species, which are then averaged according to weights corresponding to the actual chemical compositions for each of the two regimes. The resulting RMSPE for the best laGP method was about 39% and 8% higher than for SVecchia for the quiet and active regimes, respectively. However, the trajectories traverse only such a small fraction of the input space, that comparing prediction scores for only two such trajectories is not statistically meaningful. Vecchia even happened to have a smaller RMSPE than SVecchia for the active regime. More importantly, given estimated parameters, joint prediction using SVecchia scales linearly in , the number of test inputs. Thus, SVecchia can produce joint predictions (e.g., samples from the joint predictive distribution) for the day-long trajectory with in less than one minute on a single core; this is less time than it takes the most accurate laGP method (“ALC-ex”) to compute predictions for small subsets of size .

5 Conclusions and future work

We have introduced a fast and accurate scaled-Vecchia approximation for Gaussian-process emulation of large computer experiments. The Vecchia approach relies on an ordered conditional approximation, which results in a joint global likelihood and natural joint prediction and uncertainty quantification. Maximin ordering ensures that high accuracy can be achieved by simply conditioning on (previously ordered) nearest neighbors. For the high input dimensions prevalent in computer experiments, our approach applies the Vecchia approximation in a scaled input space, for which the scaling parameters are automatically determined from the data using a fast parameter-estimation procedure. For fixed conditioning-set sizes, this estimation procedures requires linear time in the number of estimation data, while joint prediction scales linearly in the number of prediction points.

In several numerical comparisons, our proposed method strongly outperformed existing approximations, in that it was able to produce more accurate results in less computational time. For example, for the satellite-drag computer simulator, even a basic version of scaled Vecchia was more accurate and several orders of magnitude faster than the state-of-the-art laGP approaches. As it can produce highly accurate joint predictions with a few lines of code in minutes on modest computers even for big datasets, we consider scaled Vecchia to be a good candidate for a default approach for emulating large computer experiments.

Additional improvements in prediction accuracy may be possible for our method by considering nonstationary covariance functions, such as a Matérn covariance whose parameters vary over input space (Paciorek and Schervish, 2006); ordering and conditioning should then be correlation-based (Kang & Katzfuss, in prep.). Such a correlation-based approach would also be possible for joint emulation for multivariate or functional computer-model output.

More sophisticated frequentist variable (i.e., input-dimension) selection could be achieved by adding a lasso-type L1 penalty for the inverse range parameters to (4

). MCMC-based Bayesian inference can also be accurately approximated using Vecchia approaches

(Finley et al., 2019; Katzfuss and Guinness, 2019, App. E); straightforward extensions include scaling the input space at certain MCMC iterations, and variable selection (Linkletter et al., 2006).

Non-Gaussian computer-model responses could be analyzed by combining scaled Vecchia with the Vecchia-Laplace approximation of generalized GPs (Zilber and Katzfuss, 2019). Finally, it would be interesting to investigate the use and extension of our methods in the context of computer-model calibration (Kennedy and O’Hagan, 2001).

Acknowledgments

Katzfuss’s research was partially supported by National Science Foundation (NSF) Grants DMS–1654083 and DMS–1953005, and by a Texas A&M University System National Laboratories Office grant on “Scalable Gaussian-Process Methods for the Analysis of Computer Experiments.” Lawrence’s research was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20200065DR. Guinness’s research was supported by the NSF under grant No. 1916208 and the National Institutes of Health under grant No. R01ES027892. We would like to thank Bobby Gramacy and Furong Sun for providing help, data, and laGP results for the application in Section 4.5.

Appendix A Detailed test-function comparison results

Table 1 provides the specific accuracy and timing results underlying Figure 4, as discussed in Section 4.4.

borehole () robot arm () piston ()
method E time (min) E time (min) E time (min)
SVecchia (30/140) 3.2 1.60.7 2.3 2.6 0.80.8 1.6 1.9 1.10.7 1.8
Vecchia (30/140) 3.4 1.61.1 2.7 3.5 0.71.0 1.7 9.5 1.10.9 2.0
laGP (30) 19.0 04.4 4.4 11.3 04.0 4.0 135.4 04.3 4.3
H-laGP (30) 4.0 3.24.3 7.5 3.3 3.94.0 7.9 14.1 3.14 7.1
SVecchia (50/140) 1.6 4.30.8 5.1 2.5 1.70.7 2.4 1.7 20.7 2.7
Vecchia (50/140) 3.3 6.21 7.2 3.5 1.91.0 2.9 9.1 2.50.9 3.4
laGP (50) 10.9 013.4 13.4 10.8 011.5 11.5 103.2 012.0 12.0
H-laGP (50) 2.4 3.111.7 14.8 3.1 3.711.3 15.0 9.3 2.311.5 13.8
Table 1: Comparison for test functions in Section 4.4. E: root mean square error. Computing times for training prediction total, on one core for (S)Vecchia and on four cores for (H-)laGP. Numbers after method names are for SVecchia and Vecchia, and for laGP and H-laGP. Smallest errors and computing times are highlighted in red for each test function and for the first and last four rows, respectively.

References

  • Banerjee et al. (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70(4):825–848.
  • Bhat et al. (2020) Bhat, K. S., Myers, K., Lawrence, E., Colgan, J., and Judge, E. (2020). Estimating scale discrepancy in Bayesian model calibration for ChemCam on the Mars Curiosity Rover. arXiv:2004.04301.
  • Carnell (2019) Carnell, R. (2019). lhs: Latin Hypercube Samples. R package version 1.0.1.
  • Datta et al. (2016) Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812.
  • Finley et al. (2019) Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28(2):401–414.
  • Finley et al. (2009) Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis, 53(8):2873–2884.
  • Gneiting and Katzfuss (2014) Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1(1):125–151.
  • Gramacy (2016) Gramacy, R. B. (2016). LaGP: Large-scale spatial modeling via local approximate Gaussian processes in R. Journal of Statistical Software, 72(1):1–46.
  • Gramacy and Apley (2015) Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
  • Guinness (2018) Guinness, J. (2018). Permutation methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429.
  • Guinness (2019) Guinness, J. (2019). Gaussian process learning via Fisher scoring of Vecchia’s approximation. arXiv:1905.08374.
  • Guinness and Katzfuss (2018) Guinness, J. and Katzfuss, M. (2018). GpGp: Fast Gaussian Process Computation Using Vecchia’s Approximation. R package version 0.2.2.
  • Heaton et al. (2019) Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological, and Environmental Statistics, 24(3):398–425.
  • Higdon et al. (2004) Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2004). Combining field data and computer simulations for calibration and prediction. SIAM Journal on Scientific Computing, 26(2):448–466.
  • Jones et al. (1998) Jones, D. R., Schonlau, M., and W. J. Welch (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492.
  • Katzfuss (2017) Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112(517):201–214.
  • Katzfuss and Gong (2019) Katzfuss, M. and Gong, W. (2019). A class of multi-resolution approximations for large spatial datasets. Statistica Sinica, accepted.
  • Katzfuss and Guinness (2019) Katzfuss, M. and Guinness, J. (2019). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, accepted.
  • Katzfuss et al. (2020a) Katzfuss, M., Guinness, J., Gong, W., and Zilber, D. (2020a). Vecchia approximations of Gaussian-process predictions. Journal of Agricultural, Biological, and Environmental Statistics, accepted.
  • Katzfuss et al. (2020b) Katzfuss, M., Jurek, M., Zilber, D., Gong, W., Guinness, J., Zhang, J., and Schäfer, F. (2020b). GPvecchia: Fast Gaussian-process inference using Vecchia approximations. R package version 0.1.3.
  • Kaufman et al. (2011) Kaufman, C. G., Bingham, D., Habib, S., Heitmann, K., and Frieman, J. A. (2011). Efficient emulators of computer experiments using compactly supported correlation functions, with an application to cosmology. The Annals of Applied Statistics, 5(4):2470–2492.
  • Kennedy and O’Hagan (2001) Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B, 63(3):425–464.
  • Lawrence et al. (2017) Lawrence, E., Heitmann, K., Kwan, J., Upadhye, A., Bingham, D., Habib, S., Higdon, D., Pope, A., Finkel, H., and Frontiere, N. (2017). The Mira-Titan universe. II. Matter power spectrum emulation. The Astrophysical Journal, 847(1):50.
  • Linkletter et al. (2006) Linkletter, C., Bingham, D., Hengartner, N., Higdon, D., and Ye, K. Q. (2006). Variable selection for Gaussian process models in computer experiments. Technometrics, 48(4):478–490.
  • Liu et al. (2020) Liu, H., Ong, Y.-S., Shen, X., and Cai, J. (2020). When Gaussian process meets big data: A review of scalable GPs.

    IEEE Transactions on Neural Networks and Learning Systems

    .
  • Mehta et al. (2014) Mehta, P. M., Walker, A., Lawrence, E., Linares, R., Higdon, D., and Koller, J. (2014). Modeling satellite drag coefficients with response surfaces. Advances in Space Research, 54(8):1590–1607.
  • Mockus (1989) Mockus, J. (1989). Bayesian Approach to Global Optimization. Kluwer, Dordrecht, NL.
  • Morris et al. (1993) Morris, M., Mitchell, T., and Ylvisaker, D. (1993). Bayesian design and analysis of computer experiments: Use of derivatives in surface prediction. Technometrics, 35(3):243–255.
  • Paciorek and Schervish (2006) Paciorek, C. and Schervish, M. (2006). Spatial modelling using a new class of nonstationary covariance functions. Environmetrics, 17(5):483–506.
  • Plumlee (2014) Plumlee, M. (2014). Fast prediction of deterministic functions using sparse grid experimental designs. Journal of the American Statistical Association, 109(508):1581–1591.
  • Quiñonero-Candela and Rasmussen (2005) Quiñonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.
  • Sacks et al. (1989) Sacks, J., Welch, W., Mitchell, T., and Wynn, H. (1989). Design and analysis of computer experiments. Statistical Science, 4(4):409–435.
  • Schäfer et al. (2020) Schäfer, F., Katzfuss, M., and Owhadi, H. (2020). Sparse Cholesky factorization by Kullback-Leibler minimization. arXiv:2004.14455.
  • Schäfer et al. (2017) Schäfer, F., Sullivan, T. J., and Owhadi, H. (2017). Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. arXiv:1706.02205.
  • Shi et al. (2017) Shi, H., Kang, E. L., Konomi, B. A., Vemaganti, K., and Madireddy, S. (2017). Uncertainty quantification using the nearest neighbor Gaussian process. In Chen, D.-G., Jin, Z., Li, G., Li, Y., Liu, A., and Zhao, Y., editors, New Advances in Statistics and Data Science, pages 89–107. Springer.
  • Snelson and Ghahramani (2007) Snelson, E. and Ghahramani, Z. (2007). Local and global sparse Gaussian process approximations. In Artificial Intelligence and Statistics 11 (AISTATS).
  • Stein (1999) Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York, NY.
  • Stein (2011) Stein, M. L. (2011). 2010 Rietz lecture: When does the screening effect hold? Annals of Statistics, 39(6):2795–2819.
  • Stein et al. (2004) Stein, M. L., Chi, Z., and Welty, L. (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B, 66(2):275–296.
  • Sun et al. (2019) Sun, F., Gramacy, R. B., Haaland, B., Lawrence, E., and Walker, A. (2019). Emulating satellite drag from large simulation experiments. SIAM-ASA Journal on Uncertainty Quantification, 7(2):720–759.
  • Surjanovic and Bingham (2013) Surjanovic, S. and Bingham, D. (2013). Virtual Library of Simulation Experiments: Test Functions and Datasets. http://www.sfu.ca/~ssurjano.
  • Vecchia (1988) Vecchia, A. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50(2):297–312.
  • Walters et al. (2018) Walters, D. J., Biswas, A., Lawrence, E. C., Francom, D. C., Luscher, D. J., Fredenburg, D. A., Moran, K. R., Sweeney, C. M., Sandberg, R. L., Ahrens, J. P., et al. (2018). Bayesian calibration of strength parameters using hydrocode simulations of symmetric impact shock experiments of Al-5083. Journal of Applied Physics, 124(20):205105.
  • Williams et al. (2011) Williams, B. J., Loeppky, J. L., Moore, L. M., and MacKlem, M. S. (2011). Batch sequential design to achieve predictive maturity with calibrated computer models. Reliability Engineering and System Safety, 96(9):1208–1219.
  • Zilber and Katzfuss (2019) Zilber, D. and Katzfuss, M. (2019). Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv:1906.07828.