Large-scale local surrogate modeling of stochastic simulation experiments

Gaussian process (GP) regression in large-data contexts, which often arises in surrogate modeling of stochastic simulation experiments, is challenged by cubic runtimes. Coping with input-dependent noise in that setting is doubly so. Recent advances target reduced computational complexity through local approximation (e.g., LAGP) or otherwise induced sparsity. Yet these do not economically accommodate a common design feature when attempting to separate signal from noise. Replication can offer both statistical and computational efficiencies, motivating several extensions to the local surrogate modeling toolkit. Introducing a nugget into a local kernel structure is just the first step. We argue that a new inducing point formulation (LIGP), already preferred over LAGP on the speed-vs-accuracy frontier, conveys additional advantages when replicates are involved. Woodbury identities allow local kernel structure to be expressed in terms of unique design locations only, increasing the amount of data (i.e., the neighborhood size) that may be leveraged without additional flops. We demonstrate that this upgraded LIGP provides more accurate prediction and uncertainty quantification compared to several modern alternatives. Illustrations are provided on benchmark data, real-world simulation experiments on epidemic management and ocean oxygen concentration, and in an options pricing control framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

10/09/2017

Replication or exploration? Sequential design for stochastic simulation experiments

In this paper we investigate the merits of replication, and provide meth...
08/28/2020

Locally induced Gaussian processes for large-scale simulation experiments

Gaussian processes (GPs) serve as flexible surrogates for complex surfac...
12/01/2017

Emulating satellite drag from large simulation experiments

Obtaining accurate estimates of satellite drag coefficients in low Earth...
01/30/2020

Towards a Kernel based Physical Interpretation of Model Uncertainty

This paper introduces a new information theoretic framework that provide...
03/04/2015

Quantifying Uncertainty in Stochastic Models with Parametric Variability

We present a method to quantify uncertainty in the predictions made by s...
05/07/2021

Local approximate Gaussian process regression for data-driven constitutive laws: Development and comparison with neural networks

Hierarchical computational methods for multiscale mechanics such as the ...
10/13/2020

Batch-sequential design and heteroskedastic surrogate modeling for delta smelt conservation

Delta smelt is an endangered fish species in the San Francisco estuary t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Continued advances in high-performance computing (HPC) and techniques like particle transport and agent-based modeling yield enormous corpora of stochastic simulation data. For a nice review of state-of-the-art simulation and modeling in such contexts, and challenges going forward, see baker2020analyzing. Examples include simulators for disease/epidemics (johnson2018phenomenological; hu2017sequential; fadikar2018calibrating) tumor spread (ozik2019learning), inventory/supply chain management (hong:nelson:2006; xie2017heteroscedastic), ocean circulation (herbei2014estimating), radiation/nuclear safety (werner2018mcnp6), and more. In many cases — in particular those cited above —the simulator can exhibit input-dependent-noise, or so-called heteroskedasticity. When the data is noisy, in addition to the usual nonlinear dynamics in mean structure, a large and carefully designed simulation campaign is essential for isolating signal.

And so is modeling. Gaussian process (GP) regression is the canonical choice of surrogate model for simulation experiments because it provides accurate predictions and uncertainty quantification (UQ) which facilitate many downstream tasks such as calibration, input sensitivity analysis, Bayesian optimization, and more. See Santer2018 or gramacy2020surrogates for a review of GPs and for additional context. However, standard GP inference and prediction scales poorly to large data sets. GP modeling involves a multivariate normal (MVN) distribution whose dimension matches the size of the training data, i.e., for regression. The quadratic cost of storage and cubic cost of decomposition of covariance matrices, for determinants and inverses involved in likelihoods, say, limits to the small thousands. For many stochastic computer experiments in modest input dimension, small simulation campaigns are insufficient for learning. Or conversely, larger experiments cannot be modeled by GPs.

Numerous strategies abound in diverse literatures (machine learning, spatial/geostatistics, computer experiments) to scale up GP capabilities in

. Most rely on approximation. Some take a divide-and-conquer approach, constructing multiple GPs in segments of a partition of the design space (Kim2005; Gramacy2008; park2018patchwork). A local approximate GP (LAGP; Gramacy2015) offers a kind of infinite-partition, separately constructing -sized local neighborhoods around each testing location in a transductive (vapnik2013nature) fashion. Small means thrifty processing despite complexity. Handling multiple is parallelizable (gramacy2014massively). Predictions are accurate, but a downside to local/partition modeling is that it furnishes a pathologically discontinuous predictive surface.

Other approaches target matrix decomposition directly by imposing sparsity on the covariance (titsias2009; aune2014parameter; Wilson2015; Gardner2018; Pleiss2018; solin2020hilbert), the precision matrix (Datta2016; katzfuss2021), or via reduced rank. One rank reduction strategy deploys a set of latent knots, or inducing points, through which an decomposition is derived (e.g. Williams2001; Snelson2006; Banerjee2008; hoffman2013stochastic). Recently, the inducing points approach has been hybridized with LAGP (LIGP; cole2021locally) for an extra speed boost/larger neighborhoods .

GP modeling technology for computer experiments emphasizes the no-noise (i.e., deterministic simulation) or constant (iid) noise case, and this remains true when scaling up via approximation. Partition-based schemes are an important exception, as region-based independent modeling imparts a degree of non-stationarity, potentially in both mean and variance. This is not by design in all cases. There are few mechanisms in place to “turn it off”, as computational independence (from which speed derives) is tightly coupled to statistical independence (non-stationary flexibility). LAGP is an exception, as it allows the user to control compute time directly through the neighborhood size,

. Although the software (laGP; laGP)

supports local inference of so-called nugget hyperparameters, results with noisy data disappoint because small

means it is easy to be fooled into misinterpreting noise as signal, exacerbating predictive discontinuity.

Expressly accommodating input-dependent noise has been explored in the (global/ordinary) GP literature (e.g., goldberg1997regression; kersting2007most), although this is in the opposite direction on the computational efficiency frontier: introducing latent nugget variables which must be learned alongside the usual tunable kernel quantities. Stochastic kriging (ankenman2010) offers thriftier computation for GP modeling under heteroskedasticity, as long as each of

unique inputs includes a minimum number of runs/outputs; i.e., as long as the per-run degree of replication is high. This allows latent variables to be replaced with independent moment-based variance estimates. Heteroskedastic GPs

(HetGP; hetGP1) combine these two ideas: using latent variables and leveraging replication in design but not requiring a minimum degree. The so-called “Woodbury identities” (Harville2011) facilitate exact, not approximate, GP inference and prediction in time even when is huge. hetGP2 then went on to show how replication, especially in high noise regions of the input space, can be essential for both statistical and computational efficiency. Yet even when , big may be needed to map out the input space, which can be limiting.

In this paper we propose that, by porting the Woodbury approach from HetGP over to the LIGP framework, one achieves the best of both worlds. Relatively few local- inducing points could support a much larger number of local unique inputs representing a much larger neighborhood of total observations under replication. Casting a wide -net is crucial for separating signal from noise, but would be prohibitive under the original LAGP regime. Bolting a nugget onto LIGP, as is already available in laGP, is the tip of the iceberg. Local inducing points, whose number are far fewer, together with Woodbury structure for representing the full , make for a powerful cascade of local information. We detail the quantities involved for inference, including the gradient for numerical optimization, and prediction. Illustrations are provided along the way, followed by full benchmarking in both constant noise and heteroskedastic settings. We show that this new LIGP capability is both faster and more accurate than modern alternatives.

The remaining flow of the paper is as follows. An overview of GP regression and LIGP is provided in Section 2. Our contribution, focused on estimating local noise and incorporating replication in LIGP via Woodbury is detailed in Section 3. Section 4 summarizes empirical work on synthetic and benchmark examples from epidemiology, biology and computational finance. Section 5 explores this new capability in the context of pricing Bermudan options. Section 6 wraps up with a discussion and directions for future work.

2 Review

Here we review standard GP surrogate modeling followed by locally induced GPs.

2.1 Gaussian process regression

Consider an unknown function for a set of -dimensional design locations and corresponding noisy observations of . A common surrogate for such data is a Gaussian process (GP), which places an MVN prior on

. The prior is uniquely defined by a mean vector and covariance (kernel) function

. The observation model is where, under constant noise, . Often the iid noise is hyperparameterized by coupling a nugget with scale , as in , so that the full model for responses is

where denotes an identity matrix. A zero mean simplifies the exposition without loss of generality.111Any, even nontrivial, mean structure can be subsumed into a covariance kernel if so desired. We prefer zero-centered observations and coded/pre-scaled inputs (wycoff2021sensitivity) based on inverse distance between rows of . After pre-scaling, kernels filling out are reasonable, e.g., the so-called isotropic Gaussian kernel:

(1)

A scalar lengthscale hyperparameter governs the rate of radial decay of covariance. Other kernel families such as the Matérn (Stein2012; gramacy2020surrogates, Section 5.3.3), and so-called separable forms with lengthscales for each input coordinate are also common. Our work here is largely agnostic to such choices, as long as is differentiable in the coordinates of . Many of the limiting characteristics of this overly simplistic choice (e.g., infinite smoothness and isotropy) are relaxed by a local application.

Working with MVNs involving dense covariance matrices, as would be furnished by , involves storage capacity that grows quadratically in . Prediction and likelihood evaluation for hyperparameter inference, such as for , involve inverses and determinants requiring decomposition (usually via Cholesky) that is cubic in . To economize on space, the relevant formulas are not provided here as they are not needed later. We shall provide related expressions momentarily in a more ambitious context. These reduce to ones that may be found in textbooks alongside illustrations of many desirable GP properties such as excellent nonlinear predictive accuracy and UQ.

2.2 Local approximate GPs

Cubic computational costs are crippling in the modern context of large- simulation experiments. As a thrifty workaround, Gramacy2015 introduced the local approximate GP (LAGP) which imposes an implicitly sparse structure by fitting a separate, limited-data GP for each predictive location of interest. Each fit/prediction is based on an -sized subset of data nearby . Multiple criteria have been suggested to build this local neighborhood , with Euclidean nearest neighbor (NN) being the simplest. Each prediction thus requires flops in , which can offer dramatic savings and is embarrassingly parallel for each (gramacy2014massively)

. LAGP was developed for interpolating deterministic simulations, however the

laGP software on CRAN (laGP) includes a nugget-estimating capability for local smoothing.

Figure 1: Mean (top) and (bottom) variance estimates (left) and errors (right) for SIR model along the slice .

The “local nugget” feature underwhelms when applied globally for many comprising a dense testing set. This can be seen first-hand by trying the examples in the laGP documentation. Figure 1 provides a demo in a stochastic simulation context with susceptible–infected–recovered (SIR) disease model (hu2017sequential), implemented as sirEval in hetGP (hetGP; binois2021hetgp). We used a training set of unique inputs in , each paired with a random degree of replication in , so that the full experiment comprised of runs. The views in the figure arise from a grid of 200 testing locations along the slice . Estimates of the “true” mean and variance (for comparison) are based on smoothed averages of 1000 replicates at each location. Comparators include: LAGP.NN with , LIGP.NN with () and =10 inducing points, and HetGP on a random training subset of size . Observe that LAGP’s estimates (blue dashed lines) of the mean and variance are highly discontinuous. The error plots in the right column show how LAGP especially struggles to produce accurate estimates for larger . Details on our preferred LIGP method (green dashed line), alongside further analysis of this figure, are the subject of Section 3.

The problem with LAGP and noise is both fundamental and technical: fundamental in that you need larger with noisy data or you’ll be fooled by the noise, but this severely cuts into any time advantage; technical in that many stochastic simulation experiments involve replication but the implementation does not adapt its notion of neighborhood accordingly. In the most extreme suppose, for example, that the nearest input site to has replicates. Under that configuration LAGP will degenerate without a diversity of unique training inputs. Less pathologically, a multitude of nearby replicates would effectively result in a narrower -neighborhood even though the number of sufficient statistics is fewer than . This is inefficient both statistically and computationally. Addressing these two issues is the primary contribution of this paper.

2.3 Locally induced GPs

LIGP (cole2021locally) is a recent extension of the LAGP idea that relaxes coupling between neighborhood and local covariance structure. Rather than imposing a local covariance structure on directly via pairwise distances between , LIGP uses an intermediary set of knots, or so-called inducing points or pseudo inputs , whose coordinates are not restricted to . This approach is mainstream in the ordinary/global GP approximation context, yielding speedups by reducing the rank of the covariance structure,222Several citations provided in Section 1. but its local application is new. When , the LIGP setup offers a double-speedup, first through neighborhood structure () and then through low rank local approximation (cubic in rather than ). Conversely, it allows a much larger neighborhood without cubic-in- increase in flops.

Let denote a kernel matrix built from inducing points and , e.g., in Eq. (1). A discussion of how one may choose is deferred to Section 3.2. Similarly, write as cross evaluations of the kernel between and . Adopting the GP approximation with inducing points introduced by Snelson2006 yields

(2)

as the MVN covariance deployed for inference/prediction at . Here, is a diagonal correction matrix for the covariance and expresses the pure noise. Our notation is suppressing some dependence on to keep the expressions tidy. Going forward we shall use to express the sum of diagonal matrices in Eq. (2), electing not to embolden like we do for other matrices since it can be stored as an -vector. When , reduces to the standard LAGP covariance and offers no computational benefit. The cost savings comes through the decomposition of (2) through Woodbury matrix identities. Evaluations of and may be obtained with flops rather than . Likewise, taking with yields a global GP without approximation.

Hyperparameter inference may be achieved by maximizing the logarithm of the MVN likelihood , which may be expressed up to an additive constant as

(3)

Above, and is a vector of ones. Differentiating Eq. (3) with respect to and solving yields a closed-form MLE (deferred until Section 3), while a numerical solver like in can work with negative concentrated log-likelihood (i.e., plugging into Eq. (3)) to estimate and . It is important to note that cole2021locally did not include/estimate a local nugget . We see this as part of our novel contribution, but find it expedient to notate quantities here as part of our review.

For fixed hyperparameters , a predictive distribution for arises as standard MVN conditioning via an -dimensional MVN for . Using

, the moments of that Gaussian distribution are

(4)

Both log likelihoods (3), and predictive equations (4) conditional on those values, reduce to LAGP and ordinary GP counterparts with and with , respectively. Choosing the locations of inducing points , globally or locally, may also be based on the likelihood (Snelson2006), however this is fraught with issues (titsias2009). cole2021locally observed that situating via classical design criteria like integrated variance, or so-called -optimal design, led to superior out-of-sample predictive performance while also avoiding additional cubic calculation. Specifically for local analysis in LIGP, they introduced a weighted form of , centered around

, and proposed an active learning scheme minimizing weighted integrated mean-square error (wIMSE) for greedy selection of the next element

augmenting . Upgraded details will be provided in Section 3. Saving on computation by stylizing the feature of the local designs for , cole2021locally developed several template schemes that allowed bespoke optimization of wIMSE for each to be bypassed.

3 Local smoothing under replication

LIGP provides an opportunity to reduce computing time while also increasing neighborhood size, because complexity grows cubically in , not or . This is doubly important when there are replicates among . Without inducing points, LAGP neighborhoods of size could have very few unique inputs . However, LIGP’s Woodbury representation may be extended from inducing points to replicates so that neighborhoods of size may be built, regardless of how much bigger is, without additional cubic cost.

3.1 Fast GP inference under replication

Given a local neighborhood , let , represent the unique input locations. Notate each unique in as having replicates, where . Let be the response of the replicate of , . Without loss of generality, assume the -neighborhood inputs are ordered so that with each unique input repeated times, and suppose is stacked with responses from the replicates in the same order. Based on this construction of , , where is a block matrix . Defining in this way, we have where , while where and .

Using these definitions we present a new expression for , extending Eq. (2), based on the unique input locations in the local neighborhood:

(5)

While in (5) is , we do not build it in practice. It is an implicit, intermediate quantity which we notate here as a step toward efficient likelihood and predictive evaluation.

When , decomposing to obtain and is much cheaper than under the Woodbury identities. Generically, these are as follows and may be found in most matrix computation resources (e.g., Harville2011):

(6)

where and are invertible matrices of size and respectively, and and are of size . Efficient decomposition in our local approximation context involves combining LIGP’s use of Eq. (6) with inducing points, and hetGP1’s (hetGP1) separate use leveraging replicate structure. Based on (5), let

The result is as follows:

(7)

where , , and is the diagonal term in . Eq. (7) is key to reducing computational cost from in Eqs. (34) to .

The representations in (7) allow the log-likelihood (3) to be expressed as follows

(8)

up to an additive constant. Above, stores the averaged responses for each unique row . Notice that when Eq. (7) is applied to the log-likelihood, the terms vanish. In fact, we do not need this quantity or except as notational devices. Instead all matrices in Eq. (8) have dimension in and/or . Although must still be calculated, this is a product of vectors and the entries in may be stored and accessed through and . Their storage and manipulation are thus linear in .

Differentiating (8) with respect to and solving, gives the MLE:

(9)

The equation for in (9) still relies on all replicates through the full and is equivalent to as provided by (cole2021locally, Eq. 6). This part of the calculation is linear in . However when replicates are present, evaluation in (9) may be much faster owing to a the second term being sized in and . Plugging into Eq. (8) yields the following concentrated log likelihood:

(10)

The expression above is negated for library-based minimization in search of MLEs or . Numerical methods such as BFGS (Byrd1995) work well. Convergence and computing time of such optimizers is aided by closed form derivatives, which are tedious to derive but are also available in closed form:

(11)

Again, observe that none of these log-likelihood-derived quantities (811) involve matrices sized bigger than .

Conditioning on the estimated hyperparameters leaves us with the following re-expressions of the LIGP predictive equations (4):

(12)

Although these look superficially similar to (4), the following remarks are noteworthy. Perhaps the biggest difference is that the average of replicates is used for the mean. The Woodbury identity ensures that the result is the same as (4), both in mean and variance, if the replicate structure is overlooked. In both cases, many of the details are buried in matrices whose scripts mask a substantial difference in the subroutines that would be required to build the requisite quantities for (12) as compared to (4).

Kriging with pre-averaged responses , whether locally or globally, is a common technique for efficiently managing replicates. However, one must take care to (a) adjust the covariance structure appropriately, and (b) utilize all replicates in estimating scale . hetGP2 show that are not, alone, sufficient statistics. Using only pre-averaged responses risks leaving substantial uncertainty un-accounted for in prediction. Our Woodbury identities provide a full covariance structure (5), and the adjustments in -quantities for an appropriate scale estimate (9). Alternative approaches, such as ankenman2010’s stochastic kriging (SK), also use unique

inputs and pre-averaged outputs. Asymptotically covering (a), it may be shown that SK yields best linear unbiased predictor (BLUP) when conditioning on estimated hyperparameters. However, the uncertainty in that predictive surface is not fully quantified. For example, SK may only furnish confidence intervals on predictions, not full predictive intervals, because they do not utilize the full set of sufficient statistics (b). Since our Woodbury application links Eq. (

12) with (4) identically, a BLUP property holds (locally) for LIGP without asymptotic arguments, and with full predictive UQ.

More specifically, consider the following comparison which upgrades an argument from hetGP1 to our local inducing points setting. We show how the MLE of the scale parameter is not the same (conditioned on ) when calculated on the original set of data versus the unique design locations with averaged responses. Using unique- calculations based on only would result in where

(13)

weights the nugget (noise) based on the number of replicates at each unique location . Whereas our full neighborhood calculation for the local MLE for with inducing points in Eq. (9) can be rewritten as follows:

(14)

See the Appendix for details. Observe that above in Eq. (13) is but a small part of (14). The term following serves as a correction for the variance estimate.

3.2 Local neighborhood geography

To dig a little deeper into the differences between LAGP and LIGP, especially as regards handling replicates and LIGP’s upgraded capability, consider again the SIR example introduced in Section 2.2. Whereas Figure 1 involved a continuum of predictions along a 1d slice, Figure 2 shows relevant quantities involved in one of the locations along that slice, indicated in green. Recall that the input space is in 2d with unique inputs and replicate degree distributed uniformly. With such a high degree of replication, LIGP’s default neighborhood is small, comprising of only unique locations. As the green dot moves along the slice of Figure 1, eventually one of those five will be replaced, resulting in a change in conditioning set of upwards of 1/5 of , or . It is perhaps not surprising that the LAGP surfaces in that figure were so “jumpy”, and often inaccurate.

Figure 2: LAGP and LIGP local neighborhoods at , with the numbers denoting the number of replicates at each location. Gray dots represent .

LIGP’s implementation of Eqs. (812) afford a much larger amount of data in the local model for commensurate computing effort through a template of inducing points. These are indicated as blue dots in Figure 2. The neighborhood of unique design locations are indicated with their number of replicates (black numbers), in total. In this instance, LIGP conditions upon 20 times more training data than the LAGP does. This yields more accurate and stable mean and variance estimates, shown visually along a slice in Figure 1 and summarized in a more expansive Monte Carlo (MC) exercise in Section 4.3. Those latter results additionally show that such accurate predictions can be achieved at a computational cost that is substantially lower than LAGP.

The location and spacing of the inducing points in Figure 2 (blue dots) come from a local space-filling template scheme based on wIMSE. The idea is that a local inducing point set of a desired size can be built up greedily, using integrated variance under a Gaussian measure centered at . An expression using unique neighborhood elements (together with all replicates, ) given (5) is

(15)

where erf is the Gaussian error function, are bounds for a hyperrectangle study region , and . Again, the resemblance to a similar expression from cole2021locally masks a substantial enhancement in its component parts due to a double-Woodbury application, as notated by scripts. The entry of is defined in Eq. (11) in cole2021locally. A closed form gradient with respect to may also be derived. As again this is similar in form to cole2021locally, so we do not duplicate it here. Given , the calculation in (15) and derivative are equivalent to Eqs. (10–12) in that paper. Consequently, a wIMSE optimized design of inducing points is not affected by the local distribution of replicates. Under replication can, of course, be calculated much faster than , so the optimization is speedier.

Regularity in local inducing point design also means that simple space-filling designs are equally valid. This is what was used for Figures 12. Once a set of inducing points is calculated for one , cole2021locally show that it can serve as a template for predictions at other members of a testing set through displacement and warping, so long as the characteristics of the original design are uniform. We shall contrast several options in our empirical work in Section 4.

4 Implementation and benchmarking

Now we provide practical details and report on experimental results showcasing LIGP’s potential for superior accuracy and UQ at lower computational costs against LAGP and HetGP. Examples include data from a toy function and real stochastic simulators. All analysis was performed on an eight-core hyperthreaded Intel i9-9900K CPU at 3.60 GHz. Every effort was made to fully tap that distributed resource to minimize compute times.

4.1 Implementation details

Open source implementation of LIGP can be found in the liGP package on CRAN (liGP). R code (R) supporting all examples reported here and throughout the paper, may be found on our Git repository.

https://bitbucket.org/gramacylab/ligp/src/master/noise

Our main competitors are laGP (laGP) and hetGP (hetGP). While laGP is coded in C with OpenMP for symmetric multiprocessing (SMC) parallelization (R serving only as wrapper), and hetGP leverages substantial RCpp (Rcpp), our LIGP implementation is pure R using the foreach for SMC distribution. With laGP we use the default neighborhood size of built by NN and Active Learning Cohn (ALC; cohn1994adavances). For hetGP, we reduce the training data to a random subset of unique inputs (retaining all replicates) to keep decompositions tractable. Despite compiled C/C++ libraries and thrifty (local/global) data subset choices, our (more accurate and larger-neighborhood) LIGP models are competitive, time-wise, and sometimes notably faster.

Our LIGP implementation uses an isotropic Gaussian kernel with scalar lengthscale . To improve numerical conditioning of matrices and for stable inversion, we augment their diagonals with and jitter (Neal1998), respectively. Our implementation in liGP automatically increases these values for a particular local model if needed. LAGP and HetGP results also utilize an isotropic Gaussian kernel. Using separable local formulations do not significantly improve predictive performance, especially after globally prescaling the inputs (sun2019emulating). Pre-scaling or warping of inputs (wycoff2021sensitivity) has become a popular means of boosting predictive performance of sparse GPs (e.g., katzfuss2020scaled). In our exercises, we divide by square-root separable global lengthscales obtained from a GP fit to random data subsets with averaged responses . See gramacy2020surrogates, Section 9.3.4, for details. The time required for this, which is negligible relative to other timings, is not included in our summaries.

For inducing point designs , we use wIMSE templates based on Eq. (15) built at the center of the design space , determined as the median of in each dimension. For initial local lengthscale , we adopt a strategy from laGP

via the 10% quantile of squared pairwise distances between members of

.333In laGP, the function providing in this way is darg. Each augmenting optimizing wIMSE is found via a 20-point multi-start derivative-based L-BFGS-B (Byrd1995) scheme (using optim in R) peppered within the bounding box surrounding the neighborhood to a tolerance of 0.01. In addition to the wIMSE design for the inducing point template, we consider so-called “qNorm” templates derived from space-filling designs. These originate from point Latin hypercube samples (LHS; Mckay:1979) on the hyperrectangle enclosing , augmented with as the point. These LHSs are warped with the inverse Gaussian CDF to closely mimic the wIMSE design.

Individual local neighborhoods and predictions are made for a set of testing locations given training data , neighborhood size , and number of inducing points . We use for each experiment, with and for the 2d and 4d problems, respectively. Each location , for proceeds in parallel via 16 foreach threads.444Two per hyperthreaded core. To estimate scale and lengthscale we used Eqs. (911) via optim through the local neighborhoods of . To aid in discerning signal from noise during hyperparameter optimization, we incorporate default priors for and , from laGP, described in Appendix A of gramacy2016lagp. Predictions follow Eq. (12).

4.2 Benchmark non-stationary data

The toy 2d function known as Herbie’s tooth (herbtooth) is attractive as a benchmark problem due to its non-stationary mean surface with multiple local minima. The mean function is defined by , , where

For this experiment we introduce constant noise , creating a response . Each of the 30 MC repetitions is comprised of a fresh training set of LHS locations, each with for . Although our later experiments use a more common, fixed-degree setup for replicates, we chose random-degree here to underscore that LIGP does not require a minimum amount of replication. A separate LHS of out-of-sample testing locations is created for each MC run. Our LAGP fits include via NN for a slightly fairer comparison to LIGP, in addition to the other LAGP options described earlier.

Figure 3: RMSE (left) and score (right) vs. log compute time over 30 MC repetitions for Herbie’s tooth experiment. Median statistics are denoted with bold markers.

Figure 3 shows out-of-sample accuracy statistics versus the logarithm of computation time across the MC repetitions. A particular method’s median statistic, taken marginally across both axes, is represented with a bold symbol. The left panel shows the root mean-squared prediction error (RMSE; lower is better) against the (de-noised) truth along the vertical axis. In the right panel, the vertical axis shows proper scores (higher is better), combining mean and variance (UQ) accuracy (gneiting2007strictly, Eq. (27)) against the (noisy) test data. LIGP methods yield the fastest predictions with the lowest RMSEs and highest scores. LIGP.qNorm (blue ’s) has the quickest compute time, faster than LIGP.wIMSE (purple ) due to its thriftier inducing point template design construction. LAGP.NN with (green inverted triangles) is the fastest among the LAGP models, but least accurate. HetGP (black circles) is fit with substantial sub-setting, which explains its high-variance metric on both time and accuracy. It is noteworthy that LAGP.ALC (red diamonds) and LAGP.NN (; navy triangles) perform similarly for RMSE, but the latter has a higher score. This supports our claim that a wider net of local training data are needed to better estimate noise.

4.3 Susceptible-infected-recovered (SIR) epidemic model

We return to the SIR model (hu2017sequential) first mentioned in Section 2.2. SIR models are commonly used for cost-benefit analysis of public health measures to combat the spread of communicable diseases such as influenza or Ebola. We look at a simple model with two inputs: the initial number susceptible individuals; and the number of infected individuals. In hetGP (hetGP), the function sirEval

accepts these two inputs on the unit scale. The response of the model is the expected aggregate number of infected-days across Markov chain trajectories until the epidemic ends. The signal-to-noise ratio varies drastically throughout

, making this model an ideal test problem for LIGP.

In this experiment we keep the same training set size as Section 4.2, but now use a fixed degree of 10 replicates at each location, a more common default design setup in the absence of external information about regions of high/low variability. We also keep a large testing set, which helps manage MC error in our out-of-sample metrics in this heteroskedastic setting. The experiment’s results are displayed in Figure 4. The left panel’s vertical axis shows the RMSE values between each method’s mean predictions and the noisy testing observations. Variability across repetitions in the data generating mechanism looms large compared to the differences in RMSE values among the methods. Although RMSE/scores may look similar marginally, within each MC repetition there is a clearer ordering. To expose that, the left section of Figure 4’s table ranks the methods based on their RMSE (lowest/best first). The reported -values are for one-sided paired Wilcoxon tests RMSEs from the next best fit. Observe, for example, that LIGP.qNorm (blue ’s) produces a significantly lower distribution of RMSEs compared to the other models, with LIGP.wIMSE (purple ’s) and HetGP (black circles) not far behind.

Figure 4: Top: RMSE (left) and score (right) vs. log compute time over 30 MC repetitions for SIR experiment. Median statistics are denotes with bold markers. Bottom: Median RMSEs (left) and scores (right) in ascending rank. The -values are for one-sided paired Wilcoxon tests between the model and the model directly below. Median computation time (in seconds) for each model is listed in the last column.

Score (right panel) offers clearer distinctions between the methods, with LIGP besting LAGP and HetGP via paired Wilcoxon test. LIGP.qNorm and LIGP.wIMSE are statistically significant from each other (-value ) and are significantly higher than LAGP.NN with (-value ), the next in the list. There is more variation in the computation times of the LIGP models compared to LAGP, which is likely due to LIGP’s tendency to need more function evaluations for hyperparameter optimization. Yet LIGP still provides significant time savings, despite an R implementation.

4.4 Ocean oxygen concentration model

Here we consider a stochastic simulator modeling oxygen concentration deep in the ocean (mckeague2005statistical). This highly heteroskedastic simulator uses MC to approximate the solution of an advection-diffusion equation from computational fluid dynamics. There are four inputs: latitude and longitude coordinates and two diffusion coefficients. The code required to run the simulation can be found in our repository. Inputs are scaled to the unit cube ; the output is the ocean oxygen concentration.

Figure 5: Top: RMSE (left) and score (right) vs. log compute time over 30 MC repetitions for Ocean Oxygen experiment. See Figure 4 caption.

We keep the same training/testing set sizes from the SIR experiment: with 10 replicates each and . Figure 5 shows out-of-sample metrics similarly mirroring that experiment. At first glance, HetGP (black circles) is clearly worse (higher RMSEs, lower scores) than LIGP/LAGP. In 4d, using a subset of only 1000 unique locations does not provide enough information to model this surface. The variation in computation time for HetGP, due to challenges in modeling the latent noise, support this claim.

Focusing on LIGP and LAGP results, we find LIGP again among the best in RMSE, score, and computation time. LAGP.NN with the larger neighborhood of (navy triangle) does well at modeling the mean and noise surfaces (its RMSE and score medians are comparable to the LIGP models), but takes nearly 6 times longer than LIGP.qNorm. LAGP.ALC (red diamonds) models the mean surface well, producing slightly worse RMSEs, but struggles with UQ as measured by proper score.

5 Bermudan Option Pricing

An application of high-throughput stochastic simulation arises in computational methods for optimal stopping, which are in particular motivated by quantitative finance contexts. Several types of financial contracts, the so-called American-type claims, offer their buyer the option to exercise, i.e., collect, her contract payoff at any time until its maturity. Pricing and risk managing such a contract requires analysis of the respective optimal exercise strategy, i.e., the decision when to exercise the option.

Monte-Carlo-driven valuation of American-type claims consists of applying the dynamic programming formulation to divide the global dynamic control problem into local single-step optimization sub-problems. The latter revolve around partitioning the input space (interpreted as the underlying prices at the current time-step) into the exercise and stopping regions, to be learned by constructing a certain statistical surrogate. More precisely, one wishes to compare the -value, namely the expected future payoff conditional on not exercising now, with the immediate payoff. While the latter is a known function, the -value is specified abstractly as a recursive conditional expectation and must be learned.

In the literature (hetGP1; ludkovski2018kriging; ludkovski2020mlosp)

, it has been demonstrated that GP surrogates are well suited for this emulation task, in particular in view of their flexibility, few tuning parameters, and synergy with sequential design heuristics. Nevertheless, their cubic scaling in design size

is a serious limitation. The nature of the financial context implies that typically the input space is 2-5 dimensional (one-dimensional problems are also common but are computationally straightforward) and that there is a very low signal-to-noise ratio. Consequently, Monte Carlo methods for accurate estimation of the -value call for simulations. In ludkovski2018kriging, replication was already identified as one potential solution to this scaling challenge; lyu2020adaptive explored adaptive replication. LIGP offers an additional boost by leveraging replication and local approximation, simultaneously addressing speed, accuracy and localization.

Relative to existing surrogates LIGP brings several improvements. First, it overcomes limitations on the number of training simulations . In previous experiments working with matrices of size was prohibitively slow. This limited usability of GPs to dimension or required excessive replication, lowering accuracy. Second, LIGP organically handles replicated designs that are needed to separate signal from noise, which is heteroskedastic and non-Gaussian in this application. Third, the localization intrinsic in LIGP is beneficial to capture non-stationarity. Typically the -value is flat at the edges and has a strong curvature in the middle of the input space. Finally, the control paradigm implies that the surrogate is intrinsically prediction-driven – the main use-case being to generate exercise rules along a set of price scenarios. Thus, the surrogate is to be evaluated at a large number of predictive locations and LIGP offers trivial parallelization that vastly reduces compute time relative to plain GP variants.

5.1 Illustrating the Exercise Strategy

We have added LIGP regressors as a new choice to the mlOSP library (ludkovski2020mlosp) for R, which contains a suite of test cases that focus on valuation of American-type options on multiple assets. This allows benchmarking the LIGP module against alternatives, in particular plain GP and hetGP solvers that are already part of mlOSP.

To fix ideas, we consider the -dimensional Bermudan Max-Call with payoff function

The parameter is known as the strike, is the continuously compounded interest rate and is the exercise frequency measured in calendar time (e.g. daily, ). The input space represents prices of stocks. Thus, the buyer of the Max-Call is able to collect dollars, the difference between the largest stock price and the strike, at any one of the exercise opportunities, . The choice of when to do so leads to decisions regarding whether to exercise the option at time step (assuming it was not exercised yet) or not. Thus, we need surrogates for the -value , .

The underlying stochastic simulator is based on generating trajectories of the asset prices. We stick to independent log-normal (Geometric Brownian motion) dynamics for the price of the asset,

(16)

where are independent standard Brownian motions, is the volatility, and the continuous dividend yield of the -th asset. This implies that given , for any ,

has a log-normal distribution. Denote by

a realized trajectory of the asset prices on the interval .

The stochastic simulator yields empirical samples of where the aggregate function , representing the pathwise payoff, is a selector

and records the first pathwise exercise time along the given trajectory. The objective of learning the -value at step is equivalent to regressing against (exploiting the Markov property of the state dynamics) to obtain the so-called timing value and identify regions where the conditional mean is positive/negative. Note that this task must be done for each time step and by construction the outputs at step depend on the ’s fitted previously. This iterative estimation leads to non-trivial error back-propagation along the financial time dimension in .

Returning to the LIGP implementation, we have a sequence of surrogates to be constructed. For each surrogate, we are given a data set of size with unique inputs in and a constant number of replicates , . For the experimental design we take an LHS on a user-specified sub-domain. After fitting the surrogate for , we need to predict on unique outputs, which lie in the training sub-domain but are otherwise distinct from the training inputs.

To illustrate how that looks, Figure 6 compares global GP and LIGP fits for the timing value of the 2d Bermudan Max-Call option. The model parameters are , , , , and so that there are time periods. We display the fitted surrogates at one intermediate time step, . In this problem, the option is out-of-the-money when both and are below the strike , hence the exercise strategy is trivially to continue in that region and the latter (the white square in the bottom-left of each panel) is excluded from the surrogate prediction. For both problems, we have a randomized training set of inputs, with replicates each, created by a LHS design. The LIGP surrogate uses , with estimated nugget and fixed lengthscale .

Figure 6: Fitted timing value and the corresponding exercise boundary (solid red curves) for the global GP (left) and LIGP (right) for the 2-d Max-Call problem at time step . The colors indicate the fitted and the points the training inputs . The bottom left corner is excluded from the input space since it has zero payoff.

The estimated strategy is to continue when the timing value (color coded in both panels) is positive and to stop when it is negative, with the zero-contour (in red) demarcating the boundary of the stopping region. The stopping region consists of two disconnected components, in the bottom-right and upper-left; due to the symmetric choice of parameters, those regions (and the entire response) are symmetric along the diagonal. Thus, the response surface features a diagonal ridge with a global maximum around and two basins on either side. The height of the ridge and the depth of the basins change in . The respective simulation variance is low in regions where and is high in regions where is large.

We emphasize that Figure 6 is purely diagnostic; actual performance of the surrogate is based on the resulting estimated option price. The latter is obtained by generating a test set of scenarios and using the collection of surrogates , to evaluate the resulting exercise times and ultimately the average payoff. Fixing a test set, the surrogate that yields a higher expected payoff is deemed better. Traditionally, a test set is obtained by fixing an initial and sampling i.i.d. paths of . For example, taking and using the surrogates displayed in Figure 6 we obtain an estimated Max-Call option value of 8.02 via LIGP and 8.00 via the global GP, so that LIGP is slightly better. Note that this financial assessment of surrogate accuracy is not based on standard IMSE-like metrics, but is driven by the accuracy of correctly learning the exercise strategy, namely the zero-contour of . Hence, the goal is to correctly specify the sign of which determines whether the option is exercised in the given state or not. Consequently, errors in the magnitude of are tolerated, while errors in the sign of will lead to degraded performance.

5.2 Results for Bermudan Option Pricing

The 2d example was primarily for illustrative purposes; for a more challenging case we take a 5d asymmetric Max-Call. We keep but adjust the parameters of the log-normal asset dynamics (16) to be and to have different volatilities. Thus, the setting is no longer symmetric and, due to larger variance the third-fifth coordinates, are progressively more important. The initial starting point is taken to be .

We benchmark the results on a fixed test set of paths of , comparing the resulting option price estimates. Our comparators include a global GP model homGP and a hetGP surrogate. Note that the above are not pre-scaled and take inputs as-is. We train the surrogates with an experimental design sampled from the log-normal density of , with unique inputs and replicates each. In this case study we found that a relatively small number of inducing points