With remarkable advances in computing power, complex physical systems today can be simulated comparatively cheaply and to high accuracy by using mature libraries. The ability to simulate has dramatically driven down the cost of scientific inquiry in engineering settings, at least at initial proof-of-concept stages. Even so, computer models often idealize the system—they are biased—or require the setting of tuning parameters: inputs unknown or uncontrollable in actual physical processes in the field.
An excellent example is the simulation of a free-falling object, which is a potentially involved if well-understood enterprise from a modeling perspective. The acceleration due to gravity might be known, but possibly not precisely. The coefficient of drag may be completely unknown. A model incorporating both factors (in the ideal case where drag is known) but not other factors such as ambient air disturbance or rotational velocity could be biased in consistent but potentially unpredictable ways.
Researchers are interested in calibrating such models to experimental data. With a flexible yet sturdy apparatus, a limited number of field observations from physical experiments can provide valuable information to fine tune, improve fidelity, understand uncertainty, and correct bias between computer simulations and the physical phenomena they model. When done right, tuned and bias-corrected simulations are more realistic, forecasts more reliable, and the entire apparatus can serve as a guide to subsequent simulation redevelopment, if necessary. Such is the enterprise of statistical computer model calibration.
Here we are motivated by a calibration and uncertainty quantification goal in the development of a so-called honeycomb seal, a component in high-pressure centrifugal compressors, with collaborators at Baker Hughes, a General Electric company (BHGE). Several studies in the literature treat similar components from a mechanical engineering perspective (e.g., D’Souza and Childs, 2002). To our knowledge, however, no one has yet coupled mathematical models and field experimentation in this setting. With access to a commercial simulator and having performed a limited field experiment, our BHGE colleagues performed a nonlinear least-squares (NLS) calibration as a proof of concept. The results left much to be desired.
Although we were initially optimistic that we could readily improve on this methodology, a careful exploratory analysis on computer model and field data revealed challenges hidden just below the surface. These included data size (simultaneously too little and too much), dimensionality, computer model reliability, and the nonstationary nature of the dynamics under study. Taken separately, each stretches the limits of the canonical computer model calibration setup, especially in our favored Bayesian setting. Taken all at once, these challenges demanded a fresh perspective.
Contributions by Kennedy and O’Hagan (2001, KOH) and Higdon et al. (2004) lay the foundation for flexible Bayesian calibration of computer experiments, tailored to situations where the simulations are computationally expensive and cheap, respectively. Our situation is somewhere in between, as we describe in more detail in Section 2. To set the stage and establish some notation, we offer the following brief introduction. Denote by the controllable inputs in a physical experiment and by any additional (tuning) parameters to the computer model that are unobservable or uncontrollable (or even meaningless, such as mesh size) in the field. In the KOH framework, the physical field observations are connected with computer model simulations through a discrepancy term, or bias correction , between simulation and field as follows:
Here, is the unknown “true” or “best” setting for the calibration input parameters, and represents random noise in the field measurements.
The main distinguishing feature between KOH and the work of Higdon et al. is the treatment of . If simulation is fast, then Higdon et al. describe how evaluations may be collected on-demand within the inferential procedure, for each choice of entertained, with bias trained directly on residuals observed at a small number of field data input sites, . When simulations are slow or if the computer model is not readily available for on-demand evaluation, then KOH prescribe surrogate modeling to obtain a fitted from training evaluations , with inference being joint for the bias and tuning parameter settings, , via a Bayesian posterior. If Gaussian processes (GPs) are used both for the surrogate model and bias, a canonical choice in the computer experiments literature (Sacks et al., 1989; Santner et al., 2003)
, then that posterior enjoys a large degree of analytical tractability. Numerical methods such as Markov chain Monte Carlo (MCMC) facilitate learning in
-space, potentially averaging over any GP hyperparameters, such as characteristic lengthscale or nugget. For GP details, seeRasmussen and Williams (2006).
The KOH framework has been successfully implemented in many applications and has demonstrated empirically superior predictive power for new untried physical observations, that is, out-of-sample, compared with many other techniques. The method is at the same time highly flexible and well regularized. Its main ingredients, coupled GPs ( and ) and a latent input space (), have separately been proposed as tactics for adding fidelity to fitted GP surfaces, in particular as a thrifty means of relaxing stringent stationarity assumptions (Ba and Joseph, 2012; Bornn et al., 2012). However, KOH is not without its drawbacks. One is identifiability, which is not a primary focus of this paper (see, e.g., Plumlee, 2017; Tuo and Wu, 2015). Of more pressing concern for us are computational demands, especially in the face of the rapidly growing size of modern computer experiments, both in the number of runs and in the input dimension or, to a lesser extent, . GPs require calculations cubic in to decompose large covariance matrices, limiting experiment sizes to the small thousands in practice. KOH exacerbates this situation with matrices. Moreover, a Bayesian analysis in input high dimension (, , or ), coupled with the large that could be required to adequately cover such a big computer simulation space, is all but impossible without modification.
Inroads have recently been made in order to effectively and tractably calibrate in settings where the computer experiment is orders of magnitude larger than typical. For example, Gramacy et al. (2015) simplified KOH with three modern ideas: modularization (Liu et al., 2009) to simplify joint inference, local GP approximation (Gramacy and Apley, 2015) for fast nonstationary modeling, and derivative-free optimization (Abramson et al., 2013; Le Digabel, 2011)
for point estimation. While effective, Bayesian posterior uncertainty quantification (“the baby”) was all but thrown out (“with the bath water”).
Here we propose a setup that borrows some of these themes, while at the same time backing off on others. We develop a flavor of local GP approximation that we call an on-site surrogate, or OSS that does not require modularization in order to fit within the KOH framework. As a result, we are able to stay within a Bayesian joint inferential setup, although we find it effective to perform a preanalysis via optimization, in part to prime the MCMC. We show how our on-site surrogate accommodates a degree of nonstationarity while imposing a convenient sparsity structure on otherwise huge coupled-GP covariance matrices, leading to fast decomposition under partitioned inverse identities. The result is a tractable framework that leads to calibrated fits that are both more accurate out-of-sample and more descriptive about uncertainties than is the NLS approach previously developed at BHGE for the honeycomb seal.
The reminder of this paper is outlined as follows. Section 2 describes the honeycomb seal application, the challenges stemming from its simulation, and subsequent attempts to calibrate via a limited field data. Section 3 introduces our novel on-site surrogate strategy for emulation within a calibration framework and elucidates its merits including its blazingly fast application within an optimization/point-estimate calibration setting. Section 4 expands this setup in Bayesian KOH-style. Returning to our motivating example, Section 5 demonstrates calibration results from both optimization and fully Bayesian approaches, including comparison with the simpler NLS strategy at BHGE. Section 6 concludes this paper with a brief discussion.
2 Honeycomb seal
The honeycomb seal is an important component widely used in BHGE’s high-pressure centrifugal compressors to enhance rotor stability in oil and gas applications or to control leakage in aircraft gas turbines. The seal(s) and applications at BHGE are described by design variables characterizing geometry and flow dynamics: rotational speed, cell depth, seal diameter and length, inlet swirl, gas viscosity, gas temperature, compressibility factor, specific heat, inlet/outlet pressure, and clearance. The field experiment, from BHGE’s component-level honeycomb seal test campaign, comprises runs varying a subset of those conditions, , believed to have greatest variability during turbomachinery operation: clearance, swirl, cell depth, seal length, and seal diameter. Measured outputs include direct/cross stiffness and damping, at multiple frequencies. Here our focus is on the direct stiffness output at 28 Hz.
A few hundred runs in thirteen input dimensions is hardly sufficient to understand honeycomb seal dynamics to any reasonable degree in this highly nonlinear setting. Fortunately, the rotordynamics of seals like the honeycomb are relatively well understood, at least from a mathematical and computational modeling standpoint. Although the input dimension is somewhat high by computer model calibration standards, library-based numerical routines provide ready access to calculations for direct/cross stiffness and damping for inputs like the ones listed above. In what follows, we provide some insight into one such solver and the advantages as well as challenges to using it (along with the field data) to better understand and predict the dynamics of our honeycomb seal.
2.1 Isotseal simulator
A simulator called ISOTSEAL, developed at Texas A&M University (Kleynhans and Childs, 1997), offers a relatively speedy evaluation (about one second) of the response(s) of interest for the honeycomb seal under study at BHGE. ISOTSEAL is built on bulk-flow theory, calculating gas seal force coefficients based on seal flow physics. Our BHGE colleagues have developed an R interface mapping seventeen scalar inputs for the honeycomb seal experiment into the format required for ISOTSEAL. Thirteen of those inputs match up with the columns of (i.e., they are ’s); four are tuning parameters , which could not be controlled in the field.
These comprise statoric and rotoric friction coefficients and exponents . They are the friction factors of the honeycomb seal. In the turbulent-lubrication model from bulk-flow theory, the shear stress is a function of the friction coefficient and exponent through the Blasius model , where is the Reynolds number (Hirs, 1973). Applied separately for the stator () and rotor (), friction factors and must be determined empirically from experimental data. To protect BHGE’s intellectual property, but also for practical considerations, we work with friction factors coded to the unit cube.
These are treated as calibration parameters , with the goal of learning their setting via field data and ISOTSEAL simulations.
Although ISOTSEAL is fast and has a reputation for delivering outputs faithful to the underlying physics, we identified several drawbacks in our application. For some input settings ISOTSEAL fails to terminate, especially with friction factors () near the boundary of their physically meaningful ranges.111At least this is the case for the commercial version of the simulator in use at BHGE, paired with their input–mapping front-end. The R wrapper aborts the simulation and returns NA after seven seconds of execution.
For others, where a response is provided, numerical instabilities and diverging approximation are evident. Although evaluations are operationally deterministic, in the sense that providing the same input always yields the same output, the behavior can seem otherwise unpredictable in certain regimes. Even subtle numerical “jitters” of this sort can thwart conventional GP interpolation(Gramacy and Lee, 2012). As we show below, ISOTSEAL’s jitters can be extreme in some regimes. Others have commented on similar drawbacks (Vannarsdall, 2011); modern applications of ISOTSEAL may be pushing the boundaries of its engineering.
Figure 1 shows some example outputs obtained by varying one input at a time in a narrow range, while fixing the others at sensible values (first three rows); and varying all inputs in grids between two arbitrary points (fourth row). The first row in the figure shows three ideal settings: the response is a smooth function of the input over the range(s) entertained. The second row, however, which shows zoomed-in versions of the same input–response scenario, exemplifies a “staircase” or “striation” effect sometimes present at small scales. The third row shows more concerning macro-level behavior over both narrow and wide input ranges. According to BHGE’s rotordynamics experts, these “staircase” and discontinuity features could be related to tolerances imposed on first-order equilibrium and flow equations implemented in ISOTSEAL.
The last row illustrates unpredictable regime-changing behavior and gaps due to termination failure in the simulation. Particular challenges exhibited by the bottom row notwithstanding, dynamics are clearly nonstationary from a global perspective. A great example of this is in the first column of the third row, where the response is at first slowly changing and then more rapidly oscillating. In that example, the regime change is smooth. In other cases, however, as in the middle column of the bottom row, a “noisy” discontinuity separates a hill-like feature from a steadier slope. An ordinary GP model, even with a nugget deployed to smooth over noiselike features by treating them as genuine noise (Gramacy and Lee, 2012), could not accommodate such regime changes, smooth or otherwise.
Consequently, initial attempts to emulate ISOTSEAL-generated response surfaces via the canonical GP in the full (17-dimensional) input space of interest were not successful. Even with space-filling designs sized in the several thousands, pushing the limits of the bottleneck of large matrix decompositions, we were unable to adequately capture the distinct features we saw in smaller, more localized experiments. Modest reductions in the input dimensions—holding some inputs fixed—and, similarly, reductions in the width of the input domain for the remaining coordinates led to unremarkable improvement in terms of accuracy in out-of-sample predictions. Global nonstationarity, local features, numerical artifacts, and high input dimension proved to be a perfect storm. Section 3 uses those unsuccessful proof-of-concept fits as a benchmark, showing how our proposed on-site surrogate offers a far more accurate alternative, at least from a purely out-of-sample emulation perspective.
2.2 Nonlinear least-squares calibration
To obtain a crude calibration to the small amount of field data they had, our BHGE colleagues performed a nonlinear least-squares analysis. Starting in a stable part of the input space, from the perspective of ISOTSEAL behavior, they used a numerical optimizer—a Nash variant of Marquardt NLS via QR linear solver, nlfb (Nash, 2016)—to tune -values, that is, the four friction factors, based on a quadratic loss between simulated and observed output at the input training data sites .
In search for , each new -value tried by the nlfb optimizer triggered calls to ISOTSEAL, one for each row of the design parameters , much in the style of Higdon et al. (2004) but without estimating a bias correction. To cope with failed ISOTSEAL runs, nlfb
monitors the rate of missing values in evaluations. When the missingness rate is below a threshold (e.g., 10%), a predetermined large residual value (100 on the original scale) is imputed for the missed residual to discourage convergence toward solutions nearby. Once above the threshold,nlfb reports an error message and is started afresh.
We repeated this experiment, starting instead from 100 random space-filling values in hopes of improving on our BHGE colleagues’ results with a best value at RMSE and having a strong straw man for later comparison. Because this NLS setup does not model a discrepancy between and , converged solutions have large quadratic loss, even in-sample. Among 100 restarts, two failed; and the other losses, mapped to the scale of by taking the square root, had the following distribution.
The blue/circle marks in Figure 2 show the observed residuals between field data and NLS calibrated ISOTSEAL with the best solution we obtained, , which had three out of four friction factors set at or near their limit values. This restart benefited from a serendipitous initialization, having initial RMSE of 9.219 converging to 6.605. However, Figure 2 shows that many large residuals still remain (blue/circles).
The red/crosses comparator is based on our proposed methodology and is described in subsequent sections. For comparison, and to whet the reader’s appetite, we note that the in-sample RMSE we obtained was 1.125. Out-of-sample results are provided in Section 5.3. We attribute NLS’s relatively poor performance to two features. One is its inability to compensate for biases in ISOTSEAL runs, relative to the outcome of field experiments. Another is that the solutions found were highly localized to the neighborhood of the starting configuration. A post mortem analysis revealed that this was due primarily to large missingness rates.
Although we were confident that we could improve on this methodology and obtain more accurate predictions by correcting for systematic bias between field and simulation in a Bayesian framework, it quickly became apparent that a standard, KOH-style analysis would be fraught with difficulty. In a test run, we used a space-filling design and fit a global GP emulator in the 17-dimensional space of ISOTSEAL runs thus obtained. That surrogate offered nice-looking predictive surfaces and provided posterior surfaces for calibrated friction factors substantially different from those obtained from NLS (e.g., away from the boundary), but unfortunately the surrogates were highly inaccurate out of sample, as we illustrate empirically below.
3 Local design and emulation for calibration
Failed attempts at surrogate modeling ISOTSEAL, either generally or for the specific purpose of calibration to field data [see Section 2.1], motivate our search for a new perspective. Local emulation has been proposed in the recent literature (Gramacy et al., 2015) as a means of circumventing large-data GP surrogate modeling for calibration, leveraging the important insight that surrogate evaluation is required only at field data locations , of which we have relatively few (). But in that context the input dimension was small, and here we are faced with the added challenges of numerical instability, nonstationary dynamics, and missing data. In this section we port that idea to our setting of on-site surrogates, leveraging relatively cheap ISOTSEAL simulation, while mitigating problems of big , big , and challenging simulator dynamics.
3.1 On-site surrogates
On-site surrogates reduce a -dimensional problem into a -dimensional problem by building as many surrogates as there are field data observations, . Let denote a generic design variable setting and
a generic tuning vector (e.g., friction factor inISOTSEAL). Then the mapping from one big surrogate to many smaller ones may be conceptualized by the following chart:
That is, rather than building one big emulator for the entire -dimensional input space , we instead train separate emulators focused on each site where field data has been collected. In this way, OSSs are a divide-and-conquer scheme that swap joint modeling in a large -space, where design coverage and modeling fidelity could at best be thin, for many smaller models in which, separately, ample coverage is attainable with modestly sized design in -space only. Fitting and simulation can be performed in parallel, since the calculations for each field data site , are both operationally and statistically independent. Nonstationary modeling is implicit, since each surrogate focuses on a different part of the input space. If simulations are erratic for some , say, the OSS indexed by can compensate by smoothing over with nonzero nuggets (Gramacy and Lee, 2012). If dynamics are well behaved for other sites , those OSSs can interpolate after the typical fashion.
In some ways, OSSs are akin to an in situ emulator (Gul, 2016). Whereas the in situ emulator is tailored to uncertainty quantification around a nominal input setting, however, the OSS is applied in multitude for each element of in the calibration setting. Another key distinction is the role of design in building the elements of the OSS. Here we propose separate designs at each to learn each , rather than working with design subsets. We use maximin Latin hypercube sample (hereafter maximin LHS) designs because of their space-filling and uniform margin properties (see, e.g., Morris and Mitchell, 1995), via maximinLHS in the R package lhs (Carnell, 2018).
Specifically, at each of the field data sites, we create novel 1000-run maximin LHS designs for friction factors in -dimensional -space. In this way, we separately design a total of ISOTSEAL simulation runs. With about one second for evaluation (for successfully terminating runs and about seven seconds waiting to terminate a failed run), this is a manageable workload requiring about 81 core-hours, or about one day on a modern hyperthreaded multicore workstation.
Let be a vector holding the converged ISOTSEAL runs (out of the 1,000) at the site, for . is the corresponding on-site design matrix. In our ISOTSEAL experiment, where , a total of runs terminated successfully. Most sites (241) had successful runs from a complete on-site maximin LHS. Of the 51 with missing responses of varying multitudes, the smallest was .
On each site, the OSS comprises a fitted GP regression between successful on-site ISOTSEAL run outputs and . Specifically, is built by fitting a stationary zero-mean GP using a scaled and nugget-augmented separable Gaussian power exponential kernel
where is a site-specific scale parameter, is vector of site-specific lengthscales, is a nugget parameter,222Note that the nugget augmentation is applied only when and are identically indexed, i.e., on the diagonal of a symmetric covariance matrix; not simply when their values happen to coincide. and is the Kronecker delta. Denote the set of hyperparameters of the OSS as , for . Although nuggets are usually fit to smooth over noise, here we are including them to smooth over any deterministic numerical “jitters.” Other mean and covariance structures may be reasonable, so in what follows let stand in generically for the estimable quantities of each OSS. Although numerous options for inference exist, we prefer plug-in maximum likelihood estimates (MLEs) , calculated in parallel for each via L-BFGS-B (Byrd et al., 1995) using analytic derivatives via mleGPsep in the laGP package (Gramacy and Sun, 2018; Gramacy, 2016) for R. As we illustrate momentarily, this simple OSS strategy provides far more accurate emulation out-of-sample than does the best global alternative we could muster with a commensurate computational effort.
3.2 Merits of on-site surrogates
To build a suitable global GP competitor, we created an -run maximin LHS in input dimensions, fit a zero-mean GP based on a separable covariance structure (4), and estimated the 19-dimensional hyperparameters via maximum likelihood, identical to the procedure for each described for the OSSs above. We chose 8,000 runs because that demanded a comparable computational effort to the OSS setup described in Section 3.1. Although the ISOTSEAL simulation effort for 8,000 runs is far less than the 292K for the OSSs, the hyperparameter inference effort and subsequent prediction for an -sized design is commensurate with that required for our 292 size OSS calculations. Repeated matrix decompositions in likelihood and derivative calculations in search of the MLE, requiring flops for the global surrogate, represented a heavy burden even when parallelized by multi-threaded linear algebra libraries such as the Intel Math Kernel Library. Similarly threaded calculations of flops were faster even in 292 copies, in part because fewer evaluations were needed to learn hyperparameters in the lower-dimensional -space.333The OSSs learn compared to for the global analog. The latter thus demands more expensive gradient calculations. Moreover, the former generally converges to the same local optima when reinitialized, whereas the latter have many local minima due to nonstationary and locally “jittery” responses. Multiple restarts are required to mitigate the chance of finding vastly inferior local optima.
Since the OSSs were trained on a much larger corpus of simulations, it is perhaps not surprising that they provide more accurate predictions out of sample. To demonstrate that empirically, Figure 3 summarizes the results of emulation accuracy from both global surrogate and OSSs. For our calibration goal, we need accurate emulation only at locations where we have field data . Therefore we entertain out-of-sample prediction accuracy only for those sites. At each of the 292 field input sites , we design with 1,000 runs each, the same amount as the training set, via maximin LHS. In total we collected testing ISOTSEAL runs, which is fewer than we ran since some came back missing. A pair of RMSEs, based on the OSSs and global surrogates, were calculated at each site based on the testing runs located there. The distribution of these values is summarized in Figure 3. From those boxplots, one can easily see that the OSSs yield far more accurate predictions.
Figure 4 supplements those results with a window into the behavior of the OSSs, in three glimpses. The first row shows three relatively well-behaved input settings by varying two -coordinates at , and one at . In all three cases, the three dashed-red lines describing the predictive distribution (via mean, and 95% interval) completely cover the ISOTSEAL simulations in that space. Both flat (middle panel) and wavier dynamics (outer panels) are exhibited, demonstrating a degree of nonstationary flexibility. The horizontal line indicates the field data value, and in two of those cases there is a substantial discrepancy between , and for the range of -values on display. The middle row in the figure shows what happens when ISOTSEAL runs fail to converge, again via two -coordinates for one OSS, at , and one for another . Notice that failures happen more often toward the edges of -space, but not exclusively. In all three cases the extrapolations are sensible and reflect diversity in waviness (first two flatter, third one wavier) that could not be accommodated by a globally stationary model. All three have the corresponding -value within range, but only in the extrapolated regime. The last row of the figure shows how a nugget is used to smooth over bifurcating regime changes in the output from ISOTSEAL, offering a sensible compromise and commensurately inflated uncertainty in order to cope with both regimes. All three cases map to outlying RMSE values (open circles beyond the whiskers OSS boxplot) from Figure 3. Although they are among the hardest to predict out of sample, the overall magnitude of the error is small. Since the corresponding -values (horizontal lines) are far from , and in the -range under study, a substantial degree of bias correction is needed to effectively calibrate in this part of the input space.
3.3 Calibration as optimization with on-site surrogates
Even with accurate OSSs at all field data locations, Bayesian calibration can still be computationally challenging in large-scale computer experiments. In the KOH framework (1), both and a bias correcting GP , via hyperparameters , are unknown and must jointly be estimated. The size of that parameter space, using a separable Gaussian kernel (4) for , is large (19d) in our motivating honeycomb seal application. MCMC in such a high-dimensional space is fraught with computational challenges.
As an alternative to the fully Bayesian method, presented shortly in Section 4 taking advantage of a sparse matrix structure, and to serve as a smart initialization of the resulting MCMC scheme, we propose here an adaptation of Gramacy et al. (2015)’s modularized (Liu et al., 2009) calibration as optimization. Instead of sampling a full posterior distribution, and are calculated as
which explores different values of
via the resulting posterior probability of discrepancy hyperparametersapplied to a data set of residuals . Specifically, is the observed field inputs and discrepancies given a particular . The probability refers to the marginal likelihood of the GP with parameters fit to those residuals via their own “inner” derivative-based optimization routine. The object in Eq. (5) basically encodes the idea that -settings leading to better-fitting GP bias corrections are preferred. A uniform prior is a sensible default; however, we prefer independent in each coordinate as a means of regularizing the search by mildly penalizing boundary solutions, in part because we know that frictions factors at the boundaries of -space lean heavily on the surrogate as runs of ISOTSEAL fail to converge there. Of course, any genuine prior information on could be used here to further guide the calibration.
Actually, this approach is not unlike the NLS one described in Section 2.2, augmented with OSSs (rather than raw ISOTSEAL runs) and with bias correction. Instead of optimizing a least-squares criterion, our GP marginal likelihood-based loss is akin to a spatial Mahalanobis criterion (Bastos and O’Hagan, 2009). In practice, the log of the criteria in Eq. (5) can be optimized numerically with robust library methods such as “L-BFGS-B”, via optim (R Core Team, 2018), or nloptr (Ypma et al., 2017). Since the optimizations are fast but local and since the surface being optimized can have many local optima, we entertain a large set of random initializations—in parallel—in search for the best (most global) solution for and .
To economize on space, we summarize here the outcome of this approach on the honeycomb seal, alongside its fully Bayesian KOH analog. A more detailed discussion of the full Bayesian calibration is provided in Section 4 and a discussion of results in Section 5. As mentioned above, our main use of this procedure is to prime the fully Bayesian KOH MCMC. Foreshadowing somewhat, we can see from Figure 9 that the point estimates so-obtained are not much different from the maximum a posteriori (MAP) found via KOH, yet at a fraction of the computational cost. Since the MCMC is inherently serial and since our randomly initialized optimizations may proceed in parallel, we can get a good in about an hour, whereas getting a good (effective) sample size from the posterior takes about a day.
4 Fully Bayesian calibration via on-site surrogates
The approach in Section 3.3 is Bayesian in the sense that marginal likelihoods are used to estimate hyperparameters to the GP-based on-site surrogates and discrepancy , and priors are entertained for the friction factors
. However, the modularized approach to joint modeling, via residuals from (posterior) predictive quantities paired with optimization-based point inference, makes the setup a poor man’s Bayes at best. In the face of big data—large, and —such a setup may represent the only computationally tractable alternative. However, in our setting with moderate and composed of independently modeled computer experiments of moderate size (), fully Bayesian KOH-style calibration is within reach. As we show below, a careful application of partition inverse identities allows the implicit decomposition of a huge matrix via its sparse structure.
4.1 KOH setup using OSS
Our OSSs for Section 3.1 are trained via -dimensional on-site designs . Their row dimension, , depends on the proportion of ISOTSEAL runs that successfully completed. Collect these outputs of those simulations, each tacitly paired with inputs , as . The KOH framework compensates for surrogate biased computer model predictions under an unknown setting by estimating a discrepancy via field data runs observed at inputs :
stacks identical -dimensional row vectors . Under joint GP priors, for each of OSSs and , the sampling model can be characterized by the following multivariate normal (MVN) distribution.
Generally speaking, would be derived by hyperparameterized pairwise inverse distances between inputs on -space. In our OSS setup, however, it has a special structure owing to the independent surrogates fit at each , for .
Let denote the covariance matrix for the OSS, for example, following Eq. (4). This notation deliberately suppresses dependence on hyperparameters , which is a topic we table momentarily to streamline the discussion here. Similarly, . Let be the matrix of the OSS’s cross-covariances between field data locations, paired with -values, and design locations. Since the OSS is tailored to only, independent of the other , this matrix is zero except in the row. Let be a diagonal matrix holding values. Although expressed as a function of it is not actually a function of because the distance between and itself is zero. Using Eq. (4) would yield . With those definitions, we have the following:
Although is huge, being or roughly billion entries in our honeycomb setup, it is sparse, having several orders of magnitude fewer nonzero entries—about 292 million in our setup. That is still very big, too big even for sparse matrix storage and computation on most machines. Fortunately, the block diagonal structure makes it possible to work with, via more conventional libraries. Toward that end, denote by the huge upper-left block diagonal submatrix from the OSSs. Let represent the remaining (dense) lower-right block, corresponding to the bias. Abstract by and the remaining, symmetric, rows and columns on the edges. Recall that the therein are themselves sparse, comprising a single row of nonzero entries.
Before detailing in Section 4.2 how we use these blocks, we focus on the specific operations required. A fully Bayesian approach to inference for the calibration parameters via posterior involves evaluating a likelihood, whose MVN structure implies
The main computational challenges are manifest in the inverse and determinant calculations, both involving flops in addition to storage, assuming a dense representation. However, substantial savings comes not only from the sparse structure (7) of but also from the fact that only a portion—the edges—involves .
4.2 On-site surrogate decomposition
Partition inverse and determinant equations (e.g., Petersen et al., 2008) provide convenient forms for the requisite decompositions of :
In this way, an otherwise operation may instead by calculated via calculations, potentially in parallel. If some are big, then the burden could still be substantial. However, both are constant with respect to , so only one such decomposition is required, even when entertaining thousands of potential . With in our honeycomb application, these calculations require mere seconds, even in serial.
Similar tricks extend to other quantities involved in Eq. (9). Consider , which appears multiple times in original and transposed forms. We have
and is a vector holding the nonzero part of . In other words, is a matrix comprising column vectors, whose nonzero entries , are arranged in a block structure for columns . If required, each can be updated in parallel for new .
Next consider , which appears in each block of Eq. (9). is dense but is easy to compute because it is just . Recall from Eq .7 that , which requires inversion only once because it is constant in . The next part extends nicely from following Eq. (12), an diagonal matrix whose entries can be calculated alongside the , similarly parallelized over .
Combining and results gives , where “” is the Hadamard product applied columnwise to and where . More concretely,
where are scalar elements of .
The first component, , is composed of computations, constant in . Only the second component, needs to be updated with new .
In summary, OSSs can be exploited to circumvent huge matrix computations involved in likelihood evaluation (8), yielding a structure benefiting from a degree of precalculation, and from parallelization if desired. These features come on top of largely improved emulation accuracy demonstrated in Section 3.2, compared with the global alternative.
4.3 Priors and computation
As briefly described in Section 3.3, we consider two priors on , the friction factors in our motivating honeycomb example. The first is independent uniform, . The second is as a means of regularizing posterior inference. The marginal posterior for is known to sometimes concentrate on the boundaries -space, because of identifiability challenges in the KOH framework (see, e.g., Gramacy et al., 2015). Furthermore, we know that ISOTSEAL is least stable in that region. slightly discourages that boundary and commensurately elevates the posterior density of central values. This choice has the added benefit of providing better mixing in the MCMC described momentarily.
The coupled GPs involved in the KOH setup are hyperparameterized by scales, lengthscales, and nuggets as in Eq. (4). A fully Bayesian analysis would include these in the parameter space for posterior sampling, augmenting the dimension by an order of magnitude in many cases. In other words, the posterior becomes , where , for surrogate and for discrepancy, which would work out to more than thirty parameters in our honeycomb example. Because of that high dimensionality, a common simplifying tactic is to fix those at their MLE or MAP setting , found via numerical optimization. In our OSS setup, with independent surrogates, the burden of hyperparameterization is exacerbated, with being several orders of magnitude higher in dimension, over one thousand for honeycomb. This all but demands a setup where point estimates are first obtained via maximization, following the scheme outlined in Section 3.3. That leaves only for posterior sampling via . Additionally, we initialize our Monte Carlo search of the posterior with values found via Section 3.3.
Following KOH, we employ MCMC (Hastings, 1970; Gelfand and Smith, 1990) to sample from the posterior in a Metropolis-within-Gibbs fashion (see, e.g., Hoff, 2009). Each Gibbs step utilizes a marginal random-walk Gaussian proposal , , . A pilot tuning stage was used to tune the , leading to in the honeycomb example. Figures 8–9 in Section 5.2 indicate good mixing and adequate posterior exploration of the four-dimensional space of friction factors.
5 Empirical results
Before detailing the outcome of this setup on our motivating honeycomb example, we illustrate the methodology in a more controlled setting.
5.1 Illustrative example
Consider a mathematical model with three inputs , following
where is a one-dimensional field input and are two-dimensional calibration parameters. Suppose the real process follows
Mimicking the features of ISOTSEAL, suppose the computer model is unreliable in its evaluation of the mathematical model , sometimes returning NA values. Specifically, suppose the response is missing when the
input is in its upper quartile,, and , where rounds to the nearest integer. Figure 5 provides an illustration.
Each panel in the figure shows the response as a function of for a different setting of . Observe the nonstationary dynamics manifest in increasing waviness of the surface as increases. Similarly, the pattern of missingness becomes more complex for increasing . Therefore, a global surrogate would struggle on two fronts: with stationarity as well as with (nonmissing) coverage of the design in -space.
Now consider observing field realizations of as , where , under a maximin LHS in -space, and two variations on a computer experiment toward a calibrated model. The first involves a global GP surrogate fit to computer model evaluations via a maximin LHS in -space, where 33 (6.6%) came back NA. The second uses OSSs trained on maximin LHSs in -space, paired with for . Of the such simulations, 95 came back missing (4.75%). The sizes of these computer experiment designs were chosen so that the computing demands required for the global and OSS surrogates were commensurate. Counting flops, the global approach requires about , whereas the OSSs need , which can be 10-fold parallelized if desired.
Before turning to calibration, consider first the accuracy of the two surrogates. Mirroring Figure 3 for ISOTSEAL in our honeycomb example, Figure 6 shows the result of an out-of-sample comparison of otherwise identical design.
The story here is similar to the one for ISOTSEAL. Clearly, the OSSs are more accurate. They are better able to capture the nonstationarity nature of computer model nearby to the field sites.
Next, we compare calibration results from global surrogate optimization, OSSs via modularization/optimization [Section 3.3, and OSSs via full Bayes [Section 4]. In this simple toy example, uniform priors are sufficient for good performance. The first row of Figure 7 shows the distributions of converged via Eq. (5) from the optimization approach described in Section 3.3. The left panel corresponds to the lower-fidelity global surrogate and the right panel to the higher-fidelity OSSs. Converged solutions from 500 random initializations are shown. Terrain colors on the ranked log posteriors are provided to aid in visualization. The best single coordinate is indicated by the black dot. For comparison, the true value is shown as red-dashed crosshairs. Although the best values found cluster near the truth, both are sometimes fooled by a posterior ridge in another quadrant of the space.
The second row of Figure 7 shows the posterior distribution of in full (left) and zoomed-in ranges (right). Compared with the OSS-based optimization approach, the KOH analog found ’s tightly coupled around the truth.
In this simple example, posterior uncertainty is low, in part because a relatively large computer experiment could be entertained in a small input dimension. In fact all three methods worked reasonably well. However, as we entertain more realistic settings, such as the honeycomb in 17 dimensions, only the methods based on OSSs are viable computationally (assuming a relatively dense sampling of the computer model is viable).
5.2 KOH versus modularized optimization: on honeycomb
Here we return to our motivating honeycomb seal example, first providing a qualitative comparison between our two approaches based on OSSs, via modularized optimization [Section 3.3] and KOH [Section 4]. We then turn to an out-of-sample comparison, pitting the KOH framework against the initial NLS analysis. Throughout, we use a regularizing independent prior on the components of . Appendix 7 provides an analog presentation under a uniform prior, accompanied by a brief discussion.
The figure indicates clear convergence to the stationary distribution with mixing that is qualitatively quite good. The effective sample sizes (ESS) (Kass et al., 1998), marginally for all four friction factors, are sufficiently high at , , , , respectively.
Figure 8 clearly shows that the posterior is, at least marginally, far more concentrated for the first two friction factors (first row) than for the last two. For a better joint glimpse at the four-dimensional posterior distribution of , the bottom-left panels of Figure 9 show these samples via pairs of coordinates. The points are colored by a rank-transformed log-scaled posterior evaluation as a means of better visualizing the high concentrations in a cramped space. Histograms along the diagonal panels show individual margins; panels on the top-right mirror those on the bottom-left but instead show solutions found by the modular/optimal approach [Section 3.3] in 500 random restarts.
Several notable observations can be drawn from the plots in that figure. For one, consistency is high between the two approaches: KOH and modular/opt. Although the values of log posteriors evaluations are not directly comparable across the models, both agree on most probable values (black dots in the off-diagonal panels). A diversity of solutions from the optimization-based approach indicates that the solver struggles to navigate the log posterior surface but usually finds estimates that are in the right ballpark. The full posterior distribution via KOH indicates that the first two friction factors are well pinned-down by the posterior. However, posterior concentration is more diffuse for the latter two. A complicated correlation structure is evident in .
A similar suite of results under an independent uniform prior is provided in Appendix 7. The story there is more or less similar, except that the posterior sampling concentrates more heavily on the boundary of -space for all four parameters. Considering that we know our ISOTSEAL simulator is less reliable in those regimes, leading to far more missing values and thus requiring greater degree of extrapolation from our OSSs, we prefer the more stable regime (better emulation and MCMC mixing) offered by light penalization under a prior.
5.3 Out-of-sample prediction
To close the loop on our comparison to NLS from Section 2.2, particularly Figure 2 which was based in in-sample prediction, we conclude our empirical work on the honeycomb with an exercise in out-of-sample prediction. The idea is to feed , through an OSS for a new field data location and subsequently correct for bias:
Another option is to use directly, but that can be problematic with ISOTSEAL, whose dynamics are nonstationary, unstable, and unreliable. Therefore, instead we fit a new OSS based on a new on-site design to pair with . As with our training OSSs described in Section 3.1, we utilize a size maximin LHS.
To benchmark predictions obtained in this way, we designed the following leave-one-out (LOO) cross-validation (CV) experiment. Alternately excluding each field data location , we fit 292 LOO discrepancy terms via residuals and . We could build a new OSS for , treating it as a as described above, but instead it is equivalent (and computationally more thrifty) to use the we already have. Based on those calculations, point predictions are composed of
Predictions thus obtained are compared with true outputs and residuals for RMSE calculations. We note that this experiment focuses primarily on bias correction. New are not calculated for each of because that is impractical computationally. Posterior sampling for each , requiring about 20 hours for 10,000 samples, would translate into roughly 200 days of computation in full.
Figure 10 shows those LOO residuals graphically, mimicking the in-sample setup of Figure 2. Only results for via KOH MCMC are shown here since from the modular/opt framework are similar [see Section 5]. The panels in the figure indicate that the KOH discrepancy correction offers substantial improvement over NLS: in-sample NLS residuals are worse than LOO OSS Bayes results. Summarizing those residuals, KOH has an out-of-sample RMSE of , being even smaller than the in-sample NLS value reported in Section2.2. Furthermore, OSS full Bayes LOO RMSE is comparable to its in-sample analog of .
Motivated by a computer model calibration problem stemming from an application to the design of a seal used in turbines, we developed a thrifty new method to address several challenging features. Those challenges include a high-dimensional input space, local instability in computer model simulations, nonstationary simulator dynamics, and modeling for large computer experiments. Taken alone, each of these challenges has solutions that are, at least in some cases, well established in the literature. Taken together, a more deliberate and custom development was warranted. To meet those challenges, we developed the method of on-site surrogates. The construction of OSSs is motivated by the unique structure of the posterior distribution under study in the canonical Kennedy and O’Hagan calibration framework, where predictions are needed only at (usually a limited number of) field data sites, no matter how big the computer experiment is. This unique structure allowed us to map a single, potentially high-dimensional problem, into a multitude of low-dimensional ones where computation can be performed in parallel. Two variations on OSS-based calibration were entertained, one based on simple bias-corrected maximization and the other akin to the original KOH framework. Both were shown to empirically outperform simpler, yet high-powered, alternatives.
Despite its many attractive features, there is clearly much potential to refine this approach, in particular the design and modeling behind the OSSs. While simple Latin hypercube samples and GPs with exponential kernels and nuggets work well, several simple extensions could be quite powerful. The need for such extensions, along at least one avenue, is perhaps revealed by the final row of Figure 4. Those plots show bifurcating ISOTSEAL runs due to numerical instabilities. Although inflated nuggets enable smoothing over those regimes, the result is uniformly high uncertainty for all inputs rather than just near the trouble spot. The reason is that the GP formulation being used is still (locally) stationary. Specifically, the error structure is homoskedastic. Using a heteroskedastic GP instead (Binois et al., 2018), say via hetGP on CRAN (Binois and Gramacy, 2018), could offer a potential remedy. In a follow-in paper Binois et al. (2018) showed how designs for effective hetGP modeling could be built up sequentially, balancing an appropriate amount of exploration and replication in order to effectively learn signal-to-noise relationships in the data. Such an approach could represent an attractive alternative to simple LHSs in -space.
Finally, in this paper we only entertained a single output , at a single frequency, among a multitude of others and at other frequencies. In future work we plan to investigate a multiple output approach to calibration. Much work remains to assess the potential for such an approach, say via simple co-kriging (Ver Hoef and Barry, 1998) or a linear model of co-regionalization (e.g., Wackernagel, 1998). Our BHGE collaborators’ pilot study also indicated that there could potentially be input-dependent variations in the best setting of the friction factors. That is, we could be looking at a , perhaps for a subset of the coordinates of the 17-dimensional input. Whether a simple partition-based or linear scheme might be appropriate, or if something more nonparametric like (Brown and Atamturktur, 2018) is required, remains an open question.
Authors JH, RBG, and MB are grateful for support from National Science Foundation grants DMS-1521702 and DMS-1821258. JH and RBG also gratefully acknowledge funding from a DOE LAB 17-1697 via subaward from Argonne National Laboratory for SciDAC/DOE Office of Science ASCR and High Energy Physics. The work of MB is partially supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract No. DE-AC02-06CH11357. We thank Gail Pieper for her useful language editing. Finally, we wish to thank our collaborators at Baker Hughes/GE for being generous with their data, their time, and their expertise.
7 Appendix: Calibration under uniform prior
Compared with those results, the ones shown here more heavily concentrate on the boundaries of the study region. Also, somewhat more inconsistency exists between the modular/opt results and the fully Bayesian analog. The regularization effect of the leads to better numerics.
- Abramson et al. (2013) Abramson, M., Audet, C., Couture, G., Dennis, Jr., J., Le Digabel, S., and Tribes, C. (2013). “The NOMAD project.” Software available at http://www.gerad.ca/nomad.
- Ba and Joseph (2012) Ba, S. and Joseph, V. (2012). “Composite Gaussian process models for emulating expensive functions.” Annals of Applied Statistics, 6, 4, 1838–1860.
- Bastos and O’Hagan (2009) Bastos, L. and O’Hagan, A. (2009). “Diagnostics for Gaussian Process Emulators.” Technometrics, 51, 4, 425–438.
- Binois and Gramacy (2018) Binois, M. and Gramacy, R. B. (2018). hetGP: Heteroskedastic Gaussian Process Modeling and Design under Replication. R package version 1.0.3.
- Binois et al. (2018) Binois, M., Huang, J., Gramacy, R. B., and Ludkovski, M. (2018). “Replication or exploration? Sequential design for stochastic simulation experiments.” Technometrics, 1–44.
- Bornn et al. (2012) Bornn, L., Shaddick, G., and Zidek, J. (2012). “Modelling Nonstationary Processes Through Dimension Expansion.” J. of the American Statistical Association, 107, 497, 281–289.
- Brown and Atamturktur (2018) Brown, D. A. and Atamturktur, S. (2018). “Nonparametric Functional Calibration of Computer Models.” Statistica Sinica, 28, 721–742.
- Byrd et al. (1995) Byrd, R., Qiu, P., Nocedal, J., , and Zhu, C. (1995). “A Limited Memory Algorithm for Bound Constrained Optimization.” Journal on Scientific Computing, 16, 5, 1190–1208.
- Carnell (2018) Carnell, R. (2018). lhs: Latin Hypercube Samples. R package version 0.16.
- D’Souza and Childs (2002) D’Souza, R. J. and Childs, D. W. (2002). “A Comparison of Rotordynamic-Coefficient Predictions for Annular Honeycomb Gas Seals Using Three Different Friction-Factor Models.” Journal of Tribology, 124, 3, 524–529.
- Gelfand and Smith (1990) Gelfand, A. E. and Smith, A. F. M. (1990). “Sampling-Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association, 85, 410, 398–409.
- Gramacy (2016) Gramacy, R. (2016). “laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R.” Journal of Statistical Software, Articles, 72, 1, 1–46.
- Gramacy and Lee (2012) Gramacy, R. and Lee, H. (2012). “Cases for the nugget in modeling computer experiments.” Statistics and Computing, 22, 3, 713–722.
- Gramacy and Apley (2015) Gramacy, R. B. and Apley, D. W. (2015). “Local Gaussian process approximation for large computer experiments.” Journal of Computational and Graphical Statistics, 24, 2, 561–578.
- Gramacy et al. (2015) Gramacy, R. B., Bingham, D., Holloway, J. P., Grosskopf, M. J., Kuranz, C. C., Rutter, E., Trantham, M., and Drake, R. P. (2015). “Calibrating a large computer experiment simulating radiative shock hydrodynamics.” Annals of Applied Statistics, 9, 3, 1141–1168.
- Gramacy and Sun (2018) Gramacy, R. B. and Sun, F. (2018). laGP: Local approximate Gaussian process regression. R package version 1.5-2.
- Gul (2016) Gul, E. (2016). “Designs for computer experiments and uncertainty quantification.” Available online: http://hdl.handle.net/1853/55578.
- Hastings (1970) Hastings, W. K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika, 57, 1, 97–109.
- Higdon et al. (2004) Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2004). “Combining field data and computer simulations for calibration and prediction.” SIAM Journal on Scientific Computing, 26, 2, 448–466.
- Hirs (1973) Hirs, G. G. (1973). “A Bulk-Flow Theory for Turbulence in Lubricant Films.” Journal of Lubrication Technology, 95, 2, 137–145.
Hoff, P. D. (2009).
A first course in Bayesian statistical methods. Springer Science & Business Media.
- Kass et al. (1998) Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. (1998). “Markov Chain Monte Carlo in Practice: A Roundtable Discussion.” The American Statistician, 52, 2, 93–100.
- Kennedy and O’Hagan (2001) Kennedy, M. C. and O’Hagan, A. (2001). “Bayesian calibration of computer models.” Journal of the Royal Statistical Society, Series B, 63, 3, 425–464.
- Kleynhans and Childs (1997) Kleynhans, G. and Childs, D. (1997). “The Acoustic Influence of Cell Depth on the Rotordynamic Characteristics of Smooth-Rotor/Honeycomb-Stator Annular Gas Seals.” ASME Journal of Engineering for Gas Turbines and Power, 949–957.
- Le Digabel (2011) Le Digabel, S. (2011). “Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm.” ACM Transactions on Mathematical Software, 37, 4, 44:1–44:15.
- Liu et al. (2009) Liu, F., Bayarri, M., and Berger, J. (2009). “Modularization in Bayesian analysis, with emphasis on analysis of computer models.” Bayesian Analysis, 4, 1, 119–150.
- Morris and Mitchell (1995) Morris, M. D. and Mitchell, T. J. (1995). “Exploratory designs for computational experiments.” Journal of Statistical Planning and Inference, 43, 381–402.
- Nash (2016) Nash, J. C. (2016). nlmrt: Functions for Nonlinear Least Squares Solutions. R package version 2016.3.2.
- Petersen et al. (2008) Petersen, K. B., Pedersen, M. S., et al. (2008). “The matrix cookbook.” Technical University of Denmark, 7, 15.
- Plumlee (2017) Plumlee, M. (2017). “Bayesian calibration of inexact computer models.” Journal of the American Statistical Association, 112, 519, 1274–1285.
- R Core Team (2018) R Core Team (2018). optim: General-purpose Optimization. R package version 3.6.0.
Rasmussen and Williams (2006)
Rasmussen, C. E. and Williams, C. (2006).
Gaussian Processes for Machine Learning. MIT Press.
- Sacks et al. (1989) Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). “Design and analysis of computer experiments.” Statistical science, 4, 409–423.
- Santner et al. (2003) Santner, T. J., Williams, B. J., and Notz, W. I. (2003). The design and analysis of computer experiments. Springer Science & Business Media.
- Tuo and Wu (2015) Tuo, R. and Wu, C. F. J. (2015). “Efficient calibration for imperfect computer models.” Ann. Statist., 43, 6, 2331–2352.
- Vannarsdall (2011) Vannarsdall, M. L. (2011). “Measured Results for a New Hole-Pattern Annular Gas Seal Incorporating Larger Diameter Holes, Comparisons to Results for a Traditional Hole-Pattern Seal and Predictions.” Available electronically from http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9759.
- Ver Hoef and Barry (1998) Ver Hoef, J. and Barry, R. P. (1998). “Constructing and Fitting Models for Cokriging and Multivariate Spatial Prediction.” Journal of Statistical Planning and Inference, 69, 275–294.
- Wackernagel (1998) Wackernagel, H. (1998). Multivariate Geostatistics. New York: Springer.
- Ypma et al. (2017) Ypma, J., Borchers, H. W., and Eddelbuettel, D. (2017). nloptr: R interface to NLopt. R package version 1.0.4.