Bayesian Calibration of Computer Models with Informative Failures

06/13/2020 ∙ by Peter W. Marcy, et al. ∙ 0

There are many practical difficulties in the calibration of computer models to experimental data. One such complication is the fact that certain combinations of the calibration inputs can cause the code to output data lacking fundamental properties, or even to produce no output at all. In many cases the researchers want or need to exclude the possibility of these "failures" within their analyses. We propose a Bayesian (meta-)model in which the posterior distribution for the calibration parameters naturally excludes regions of the input space corresponding to failed runs. That is, we define a statistical selection model to rigorously couple the disjoint problems of binary classification and computer model calibration. We demonstrate our methodology using data from a carbon capture experiment in which the numerics of the computational fluid dynamics are prone to instability.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computer codes (also called simulators) are often used in the physical sciences to model complex systems and phenomena. Beginning with Sacks et al. (1989), the statistical design/analysis of experiments involving simulators has become so ubiquitous that this subdiscipline needs little introduction; uninitiated readers may consult the plethora of standard references within Fang et al. (2006), Levy and Steinberg (2010), and Santner et al. (2018).

One typical inferential goal involving computer models is calibration

: when actual experimental/field data of a system are available these measurements can be used to infer the most probable values of unobservable input quantities necessary to run the computer model. Historically, calibrating computer models to data took the form of nonlinear regression

(Seber and Wild, 1989, e.g.) in which a least-squares criterion was minimized, an early example being Box and Coutie (1956). Modern calibration analyses often feature fast approximations of the computer model and can go by the names “tuning” (Park, 1991; Cox et al., 2001) or “history-matching” (Craig et al., 1997). The latter takes a Bayes-linear approach and explicitly incorporates a discrepancy term to acknowledge the imperfection of the computer model. Bayesian calibration featuring Gaussian process (GP) surrogates (emulators) and model discrepancy can be found in Kennedy and O’Hagan (2001), Higdon et al. (2008), and a glut of subsequent modifications/applications by many authors.

One impediment to calibration, or even emulation of the computer model, is the possibility that certain combinations of the calibration inputs will result in output absent some necessary property : in other words, a failure. In certain cases the scientists consider these failures data; as such, the parameter combinations leading to failures are deemed implausible, to use the language of history-matching. When this is true, the failed runs should be used to exclude or down-weight those regions of the calibration input space. Without doing so it is possible that these regions will be admitted to the calibration posterior via an emulator which is forced to extrapolate where there is no data. A smaller parameter space could be desirable in its own right, but especially when the calibrated model is to be used for further purposes such as uncertainty analysis (Oakley and O’Hagan, 2002).

Here is some concrete motivation by way of examples. First consider the Allometrically Constrained Growth and Carbon Allocation model of Gemoets et al. (2013) and Fell et al. (2018). Calibrating this code to a particular tree using radius or height is bedeviled by the fact that the model will sometimes report the tree as being dead when the real tree is very much alive (Property

). As the code behaves this way for particular and a priori unknown calibration input configurations, it would be all too reasonable to estimate and utilize the

-constraint on the parameter space. (If the constraint were known, this could be easily accommodated by the implausibility metric for a history-match.) Another example is the COMPAS (Compact Object Mergers: Population Astrophysics and Statistics) model of binary black hole formation (Barrett et al., 2016). Calibrating this code to existing binary black holes using chirp mass only makes sense when (:) a black hole is actually formed– this is not guaranteed for all inputs to the code. The motivation behind our work is the calibration of data to a particular computational fluid dynamics (CFD) model. The code is prone to numerical instability for certain parts of the input space, but its issues are not easily addressed as it is specialized software. Property in this case is the very presence of output. Those regions of parameter space yielding no output pragmatically needed to be ruled out for further upscaling and uncertainty analyses because these were built around the same code.

The outline of the paper is as follows. In Sections 2 and 3 we review the disparate methodologies of calibration and classification upon which our method will build. In Section 4 we combine them. The new methodology will be demonstrated using the CFD code mentioned above with data from a carbon capture experiment in Section 5.

2 Computer Model Calibration Using Gaussian Processes

In this section we review the details of univariate calibration using the framework of Kennedy and O’Hagan (2001). The restriction to a single output is purely for expositional clarity– because our method uses only the input space to couple the two sources of information, it is directly applicable to the case of multivariate output.

The inputs to the computer model are partitioned into two groups. The calibration inputs are quantities that can be changed within the code, but ideally would be fixed at one setting that represents the true state of nature. These calibration parameters represent physical quantities of interest that cannot directly be measured in the real system. The second group contains the variable inputs that correspond to observable or controllable experimental settings. The computer code is executed times to produce simulated data . In addition there are measurements of the true physical process :

where are observational errors. Note that this experimental data is the result of conditions which may differ from those of the simulated data (hence the “” notation above).

The end goal is to use the simulated data to find the values of most consistent with the experimental data. To make this inference the following relationship is assumed:

That is, the real physical system can be decomposed as the simulated system at a “true” setting , plus a systematic discrepancy which depends only on the variable inputs. Kennedy and O’Hagan (2001) use independent GPs to model the unknown functions and so that uncertainty at unobserved input configurations can be accounted for (e.g., O’Hagan and Kingman, 1978; Sacks et al., 1989; Santner et al., 2018). Specifically, is assumed to have mean and covariance kernel ; the discrepancy is assumed to have mean and covariance kernel . In this work, we take the kernels of the independent processes to have the form:

where is a one-dimensional correlation function (such as a squared-exponential or Matérn with specified smoothness) and are correlation length parameters. Under these assumptions, together with the relationships


the joint likelihood for all of the observations is obtained readily. Let denote all the experimental data and denote the collected simulation runs. The distribution for all the observed data is then multivariate normal (MVN):

cov (2)

Above, is the matrix whose -entry is

; other blocks within the covariance are defined similarly. If the non-calibration parameters are collected in the vector

, the posterior distribution for all parameters is


and a priori independence is almost always assumed: . The notation is shorthand for , the density or mass function of the variable , and represents the likelihood function which in this case is proportional to the MVN pdf defined by (2). We have partitioned the parameter space into emulator/discrepancy and calibration parameters to make later sections more transparent.

The posterior given in (3

) can be explored using Markov Chain Monte Carlo (MCMC) with Metropolis-Hastings updates for some of the parameters and Gibbs steps for the remaining. In updating the emulator/discrepancy parameters, the use of Gibbs versus Metropolis will depend on the prior specification. For all analyses in this paper we use the following prior distributions (which assume that the inputs within

and have all been scaled to [0,1], and that the output

has been rescaled by its sample standard deviation):

The use of uniform priors on the square root of the variances was inspired by

Gelman (2006), and we found that this produces better mixing than a typical inverse-gamma specification. All correlation functions in this work are parameterized such that a correlation length of implies a very smooth process; the upper bound of 5.0 for the uniform priors on the ’s works well to keep the covariance of the likelihood well-conditioned without sacrificing the emulator’s predictive ability. Finally, the priors for the calibration parameters will depend on the particular application. Therefore, under this scheme of prior distributions, and can be updated by Gibbs, and the rest of the parameters by Metropolis steps.

2.1 Illustrative Example (intro.)

We now introduce a simple “toy” example with which to illustrate our methodology in stages.

Figure 1: A simple calibration problem with failed “simulator” runs. The full data is given in the left panel, and the input space showing the failed runs is presented in the right panel.
Figure 2: Response surface of completed “simulator” runs for the illustrative example. The experimental data (blue points) are plotted along the slice and are connected by line segments. The green lines highlight the true region of -space for which all runs are successful.

The data are obtained using (1), where the “simulator” in this case is actually a realization of a zero-mean GP on the domain having covariance kernel

with parameter values , , and . The discrepancy function is taken to be and the experimental error variance to be . The hypothetical simulator runs are obtained on a grid of the unit square, but we suppose that 30 of these points result in failures of some kind (so ). The experimental data of the physical system are obtained at the same values as the simulated data (, N=18) using the true value of set to .

The successful simulations are shown in the left panel of Figure 1 (black points/curves) together with the experimental data (blue points); the pattern of the failed runs are shown in the right panel. A different view of both datasets is presented in Figure 2; the response surface is presented with missing regions corresponding to failed simulation runs. As a visual aid, the experimental data are connected by line segments and plotted at (of course, the goal of calibration is to infer this location). The green lines indicate the true region of the calibration input space for which all runs are successful.

We set a uniform prior on and use a fairly standard MCMC routine to explore the posterior distribution of all parameters . A total of 120,000 iterations (20,000 of those for burn-in) were completed, and every third draw was recorded. Diagnostics for all parameters within the thinned chain were deemed adequate. A plot of the estimated posterior density for is given by the black curve in Figure 4. Here it can be seen that the tails of this distribution give positive probability to regions of that give rise to simulator failures for certain values within the -space. If this calibrated model were to be used in a subsequent uncertainty analysis, the additional support would add superfluous variation to the quantity of interest.

3 A Binary Random Field Model for Success/Failure Data

In addition to the completed computer model runs , we assume that there is also a set of input configurations that result in model failures; let these be denoted , where . Latent variable approaches can be used to model the binary outcome of success or failure at arbitrary inputs given the observed outcomes.

In this section we describe Bayesian estimation and prediction under one such model: the binary or “clipped” random field. Within this model, a response variable realization is 1 if and only if a continuous latent random field is positive. This differs from a more standard logistic/probit approach in which a transformed random field (or regression function) determines the non-binary probability that a response value will be 1

(Chib and Greenberg, 1998; Neal, 1999; Rasmussen and Williams, 2006) instead of the outcome itself. The binary random field model will lead to sharper inference and predictions when the successes/failures are strongly correlated in space. We will indicate when and how this model can be extended.

3.1 Bayesian Implementation

We define an indicator random variable

to be 1 if and only if the

th run is successful, and this is tied to a latent continuous random variable

through the relationship . Thus for the given data, is realized to be 1 in the first positions (and 0 for the remaining ) as a result of a latent realization with positive values only in the first entries.

In the Bayesian framework, the introduction of the latent is also called data augmentation (Tanner and Wong, 1987)

and can readily facilitate inference for parameters of certain models. This is indeed the case for binary data when the latent variables are normally distributed because the joint posterior of model parameters and latent variables suggests the use of Gibbs sampling

(Albert and Chib, 1993; Chib and Greenberg, 1998; De Oliveira, 2000). This reasoning is detailed below.

By construction, the model for the observed binary data given the latent values is a point-mass


If the latent data are modeled as depending on parameters , then the joint posterior for all unobservables is proportional to . Furthermore, when the latent data model is MVN, then (4) implies that the full conditional distribution is a truncated MVN and can be sampled with a Gibbs algorithm (Geweke, 1991; Robert, 1995)

. We use a model of this form because it readily admits a posterior predictive process which is also MVN and hence easy to sample from; a tractable predictive distribution is crucial to our methodology, as Section

4 will demonstrate.

We suppose that the latent process defining the space of failures is continuous and well described by a GP– in other words, that there is spatial correlation to the successes and failures. Let , and also suppose that has mean and covariance kernel with form


where are correlation length parameters. Note here that for identifiability, the variance of the process is set to unity (De Oliveira, 2000). Therefore, in this latent model the parameter vector is , and the distribution of is then MVN:


and is the design matrix row-augmented by the experimental and calibration input settings for the runs that fail.

The posterior for all parameters in the classification problem combines the likelihood in (4), the latent prior of (6

), and our choice of priors for hyperparameters:


Within the MCMC, the mean parameter can be updated with a Gibbs step and with a Metropolis step. The elements of the latent vector are updated using the full conditional


We have written the usual conditional distributions in an equivalent form featuring the precision matrix to show that need be inverted only once (Rue and Held, 2005). All the necessary Gibbs steps can be accomplished through, e.g., the R package tmvtnorm (Wilhelm and Manjunath, 2015). All told, the posterior sampling just described is essentially the routine of De Oliveira (2000) Section 3.1, which is contained in steps of Algorithm 1 below.

The motivation behind the use of the latent GP model was the assumption that the failure mechanism is (almost surely) continuous on the latent space. Evidence against this stipulation comes in the form of at least one very small value amongst the estimated correlation lengths, and more conspicuously, poor leave-one-out cross validation (LOOCV) classification rates. This may point the modeler to a specification that accounts for noise, such as probit GP model where the indicator functions of (4) are replaced with . Subsequent equations would have to be adjusted accordingly. Another option is a probit regression model (Albert and Chib, 1993), where the latent mean is expanded from a constant to a function which is linear in , and wherein the covariance model is simplified to errors. This would greatly simplify the computations relating to (7), (8), and (9) below, but we avoided probit regression due to the undesirable model selection surrounding the choice of mean function . In moderate dimensional input spaces, determining a functional basis in to adequately describe potential curvature in the latent process can be problematic, especially with relatively sparse data. This was the case even in the low-dimensional illustrative example, whose pattern of failed runs was given in the right panel of Figure 1.

3.2 Prediction

The posterior predictive distribution for the latent process at new locations given in the rows of the matrix is denoted and has distribution


This immediately yields predictions on the binary scale: , and this type of binary prediction can be used for predictive model assessments. For example, as the MCMC sampling progresses, leave-one-out cross validation conditional upon the hyperparameters can be used to assess goodness-of-fit. The th prediction is , where is obtained from the single inversion of the full used in the updates of the vector. After the MCMC is finished, posterior summaries of the LOOCV classification rate can be calculated for the assessment of latent model adequacy. Making posterior predictions for LOOCV can be seen as repeating step of Algorithm 1 a total of times.

In addition to the predictions for cross-validation, realizations along the “slice” will be of interest, and this motivates two cases. The posterior predictive latent process can be:

  1. (C1)  insensitive to :  

    (C2)     sensitive to :  the covariance kernel is as given above in (5).

Case (C1) will be highly desirable because prediction along the slice amounts to a prediction at a single point. This fact then strongly encourages model selection on the -space: the latent GP model using the kernel of (C1) can be built first and the posterior of LOOCV classification rate can be used to assess this assumption. Prediction at either for (C1) or for (C2) is also a subcase of in Algorithm 1.

Given and

0:   Update hyperparameters of the latent process using a Gibbs step for and Metropolis-Hastings update for using (7)
0:   Update latent values with a sequence of Gibbs steps according to (8): sample    sample    sample
0:   Sample from the posterior predictive distribution (9) at locations given in the rows of some matrix .
Algorithm 1 Metropolis-within-Gibbs algorithm for Bayesian classification and prediction using a latent Gaussian process

Before returning to the illustrative example, we note that the modeler may be interested in better estimating the boundary of the failure-space from sequential batches of simulation runs. Adaptively refining the estimate of the contour under the latent GP model can be accomplished in a principled way using the Expected Improvement procedure of Ranjan et al. (2008) (pg. 530).

3.3 Illustrative Example (cont.)

In this example the failure mechanism is spatially correlated in and implying case (C2) and the need for two correlation length parameters and within the covariance structure:

The use of the Matérn kernel with smoothness was motivated by the tendency for the covariance matrix to become ill-conditioned.

To fit the latent GP model, steps and of Algorithm 1 were performed within an MCMC of 120,000 iterations. LOOCV predictions were performed every 200 iterations, and these resulted in a median posterior classification rate of 1.00 with a 95% credible set of (0.993, 1.00). Additional predictions were made on a dense grid of runs, and the contours for two such realizations are shown in Figure 3. Each pair of green lines in the figure enclose the region for which the latent process is positive for all values of .

Figure 3: Contours of for two realizations (solid/dashed red curves) of the posterior predictive process, and corresponding “admissible” regions whose bounds are defined by the solid/dashed green lines.

4 Calibration with Failed Simulator Runs

After investigating the pattern of simulator failures and the fitting/checking the statistical model for the latent failure mechanism, the results must be discussed with the subject matter experts who provided the simulation data in order to make sense of the failed runs. It could be, for example, that it is actually an extreme combination of variable inputs (outside the range of conditions where the experimental data was obtained) is leading to output absent Property . Or, in the case of altogether missing output, it may be that the code’s numerical solvers are not converging within some configured time limit which is altogether too small. In these two instances it would not be reasonable to exclude regions of the calibration input space. Along the lines of the second instance, Huang et al. (2020) reasoned that they should allow regions where their ISOTSEAL code failed and that they could trust their emulator to extrapolate. Another sensible example of ignoring computer model failures comes from the spot welding example of Bayarri et al. (2007) (Table 3). Even though a high proportion of the runs failed (17/52), the absolute number is small and the failures exhibit no visible pattern as a function of their two ’s and one ; hence these seem to provide no real insight.

It is important to note that whether or not the failures are meaningful (i.e. deemed “data”), if they depend on

, then two calibration analyses can be conducted regardless: one ignoring the failed runs, and one featuring the methodology of this section to utilize them. The GP emulator within the first scenario will have to interpolate or extrapolate for regions of

-space containing failures, potentially leading to the admission of these regions into the posterior distribution for . This might actually be desirable in situations where the experts believe the code failures themselves are a form of discrepancy from reality. But by comparing the posteriors from each analysis, the researchers will not only get some sense of how much the extrapolation changes the answers, but might also gain additional insight into the code.

4.1 Model

The basic idea of our approach is to use a carefully chosen weighted distribution (Patil and Rao, 1977; Bayarri and DeGroot, 1987; Patil, 2002) as a prior on the calibration parameters


where is a non-negative weight function depending on the binary data and additional parameters via the latent function , and is the prior that would otherwise be used in a typical calibration analysis. This explicitly couples calibration to classification. The approach we advocate is to use a selection model (Bayarri and DeGroot, 1987, 1992) on an unknown set that is determined by the model failures:


In the above construction, the admissible set does not (and cannot) depend on , but this stands to reason: if corresponds to a true state of nature, then it should not depend on the specific configuration of experimental settings that were used/observed. This reasoning further suggests sets of the form


for a given tolerance probability . A particular value of will be said to be admissible if it is an element of . When it is non-negative, the quantity acts as a form of discrepancy. It acknowledges a mismatch between observed and “true” failures in that while there may be some failed runs expected along the slice , can still be considered admissible.

As a concrete example, temporarily assume that the latent classifier function

is known and deterministic. The set in (12) then becomes the collection of all such that


for choice of and measure on the space of variable inputs. When


a particular is admissible if and only if the minimum of the classifier function along the slice is positive. The problem of checking admissibility is as such reduced to the problem of function minimization. The reasoning extends to the case when the classifier function is unknown/random given next.

Let be a realization of a posterior predictive GP: , such as the one given in (9). The conditional prior on the calibration parameters using (10) and (11), while assuming the conditions in (14) is


Case (C1) of Section 3.2 reduces the indicator to – i.e., evaluating the criterion at a single point. Case (C2) will necessitate a practical compromise due to the fact that the GP cannot be realized at all points in along . The simplest and most natural approximation is , where for some large space-filling design . In either case, uncertainty in the admissible set will be derived from the stochastic nature of as well as from the uncertainty in the hyperparameters themselves.

On a final note, we return to the discussion at the end of Section 3.1 and admit the possibility of a probit regression model for the classifier function. Given the regression coefficients , a realization at the appropriate slice of the posterior predictive is MVN with noise and hence discontinuous everywhere. So even though the practical check of admissibility (12) on a large design is straightforward: (for pre-specified weights summing to unity), care must be taken to formally define the integral in the analogue of (13). Of course, one fix is to define the admissibility criterion as using only the continuous mean of the latent process


This can allow for closed-form evaluations of admissibility but does not use the full posterior predictive distribution; it has stochasticity only through the regression coefficients.

4.2 Computational Details, Discussion

The full posterior of all the parameters is

  • The first term in (17) is proportional to which is the posterior for a traditional calibration model, except that a marginal prior for is replaced by a conditional prior such as that of (15).

  • The second term is the posterior predictive distribution of the latent process (9) along the slice.

  • The third term is the posterior for the latent process values and hyperparameters of the classification model (7). Here it is explicit that is assumed independent of and (further, the response values of the successful simulations and experimental data).

This clean factorization allows for a straightforward implementation using the results of previous sections. The pseudocode of the whole procedure is given in Algorithm 2.

0:   Update latent values and hyperparameters as in and resulting in and
0:   Update emulator and discrepancy parameters using (3) and the description within Section 2 resulting in
0:   Update calibration parameters with a Metropolis step: propose from distribution having density check admissibility using the indicator from predictions according to one of the following cases     (C1)       (C2)   for some large space-filling design conditional upon admissibility, set with probability min where
Algorithm 2 Metropolis-within-Gibbs algorithm for Bayesian calibration with simulator failures

It is very important to observe that the classification and calibration models can be investigated independently before the coupled analysis. The latent failure process and its hyperparameters are independent of the output values of successful simulation runs, so the classification model should be analyzed first to get some idea how useful the observed failures will be. Starting values and proposal distributions for the classification analysis can be used directly within the MCMC of the coupled model. The calibration model without the use of failed simulator runs can also be fit independently in order to get a reasonable initial proposal distributions for the combined analysis. Fitting both models separately will also allow the researcher to determine whether hyperparameters of either can be fixed, if desired.

How useful are the failed runs? Insight is provided by first considering the case of a fixed, known constraint. For clarity, and without loss of generality, one may further suppose the emulator and discrepancy hyperparameters within are known. Property is guaranteed when the full domain of is restricted to a subset (temporarily suppressing the “” subscript) implying the constrained posterior is


Furthermore, the normalizing constant of the constrained posterior, , is obviously under the full posterior. If this probability is close to 1 then the restriction to is not actually very restrictive. If given a sample , then (18) implies that is a sample from the constrained posterior. An estimate of its normalizing constant is .

In the case of a random constraint, uncertainty in the hyperparameters and predictive process for the latent classifier can be factored into an assessment of how useful the failed runs are. Consider the binary matrix whose columns correspond to draws of the calibration parameters under the traditional posterior (disregarding the failures) and whose rows correspond to draws of the latent posterior ; the entry of is 1 if is predicted as a success for realization of the latent parameters and 0 otherwise. The row-means of this matrix provide a distribution of estimates as the latent quantities vary. If there is relatively little uncertainty in this distribution and it is very close to 1, the researcher need not conduct the full coupled sampling described in Algorithm 2– an approximate sample from the constrained posterior is obtained by simply omitting values from the full sample that are not in the admissible set derived from and .

The column-means of the binary prediction matrix are arguably more informative. The mean of

is the estimated posterior probability

of being admissible (marginalizing over the latent quantities). This collection of values give a distribution on the pointwise probability of admissibility. The proportion of which are approximately zero indicate the fraction of which are almost always classified as failures– values for which the failed runs would be considered informative. On the other hand the proportion which are approximately unity indicate the fraction of full posterior values which would be admissible with or without the use of the failed simulation runs. For the analysts not willing to incorporate the admissibility criterion within their calibration routine, the collection can be used as resampling weights for the traditional posterior draws. If estimation is not the sole purpose of calibration, then the full coupling of Algorithm 2 should be used, as the updates for the discrepancy parameters (vital for prediction) will be affected by the restricted updates.

5 Examples

Here we demonstrate the methodology with the illustrative example as well as the problem that originally motivated our calibration technique.

All proposal distributions for Metropolis steps were taken to be MVN after parameter transformations. For example, the univariate marginal priors on the calibration parameters were first rescaled to [0,1] and then to via a probit transformation . All such MVN proposals had their covariances tuned adaptively according to Haario et al. (1999, 2001).

5.1 Illustrative Example (cont.)

Failures were observed at and (that is, for at least one value of ) suggesting that that the true set of admissible is contained in the interval between these two values. As we shall see, the calibration analysis incorporating failed runs will indeed produce a posterior distribution for having support , though with tails weighted according to the uncertainty in the latent failure process.

After fitting and checking the latent model, the coupled analysis was conducted; Algorithm 2 was used to obtain 120,000 draws of all parameters. The admissibility of a proposed was checked using a draw of the latent posterior predictive process on the slice with containing 50 equally spaced points. Had the entire latent process been realized, the contours would appear akin to the red curves in Figure 3. Each set of red curves would define the boundaries (green lines) of the admissible set, and a proposed would be admissible if it fell within these boundaries. There is a small probability that a realization of the latent process is positive within the naive bounds , and for that instance the admissible set is a union of intervals; this probability is smaller for longer correlation lengths and higher smoothness of the latent GP.

Figure 4: Posterior distributions for calibration parameters with and without using the failed simulations.

The posterior distribution of the calibration parameters without and with using the failures is displayed in Figure 4. In the latter case, there is zero probability given to regions that fail for some , as desired. However the distribution incorporating failed runs is not the full posterior restricted to the naive bounds because there is uncertainty associated with these bounds. In fact, the probability of admissibility (under the posterior for the latent parameters) ranges from about 0.22 to 0.61; to contrast, the point estimate of this value using the naive bounds is nearly 0.7. Regarding pointwise admissibility, 37% of the samples from the full posterior were always classified as failures whereas 20% were always predicted to be successes.

For this toy problem the true calibration parameter and traditional posterior mean/ mode/ median fell squarely within the admissible region, but this need not be. In fact, were the inclusion of failures to rule out a region having high probability under the usual calibration case, this might challenge the researchers’ understanding of the numerical model or physical system. It would also highlight the fact the most probable configurations of parameters were inferred by extrapolation of the response surface to regions which had no successful runs. As such this would be an interesting and important scenario.

5.2 MFIX CFD model

We conclude this paper by applying the methodology to the example that motivated it.

Output  :
: knot 0.975 of the functional breakthrough curve
Experimental Conditions (Variable Inputs)  : Range/Prior
: gas inflow rate (std. L/min) [15.0, 30.0]
: partial pressure of CO (%) [10.0, 20.0]
: coil temperature (C) [39.0, 81.5]
: gas inflow temperature (C) [23.6, 38.4]
CFD Model Parameters (Calibration Inputs)  :
dry adsorption
  : formation enthalpy (J/mol)                   trNorm
  : formation entropy (J/molK)                 trNorm
  : activation energy (J/mol)                       Unif
  : log pre-exponential factor (unitless)      Unif
water physisorption
  : formation enthalpy                                 Unif
  : formation entropy                                   Unif
  : activation energy                                    Unif
  : log pre-exponential factor                       Unif
wet reaction
  : formation enthalpy                                 Unif
: formation entropy                                   Unif
: activation energy                                    Unif
: log pre-exponential factor                       Unif
: particle size (m) ssBeta(4.5, 3.3; 108, 125)
: effective amine proportion when fresh (unitless) trNorm
Table 1: Summary of inputs, outputs, and CFD calibration parameters. Here, trNorm is a N distribution truncated to the interval , and ssBeta is a Beta distribution shifted and scaled to the interval .

The Carbon Capture Simulation Initiative (CCSI, a partnership between Department of Energy laboratories, academic institutions, and industry) has been developing state-of-the-art computational tools to assist in the development of technologies that capture CO from coal-burning power plants. One technology investigated was an adsorber system with a bubbling fluidized bed of 32D1 sorbent (Lane et al., 2014; Spenik et al., 2015). In this system, a mixture of gases that includes CO flows up through the bed of solid sorbent particles and together they form a fluid-like state. The increased mixing between gases and solids better facilitates the adsorption of CO from the gas stream, that is, it increases the adhesion of the CO atoms to the surface of the sorbent particles.

CCSI’s investigation of this carbon capture apparatus was done sequentially using a validation hierarchy (Lai et al., 2016). Experiments of increasing complexity were formulated to isolate subsystems of the entire multiphase flow process; computer models of each idealized system were also formulated such that each could be calibrated to its corresponding physical experiment. In particular we use data from just one of these validation exercises: the case of hot-reacting flow [Sec. 6.3 of Lai et al. (2016)].

The experimental data were obtained by varying four quantities: gas inflow rate, partial pressure of CO, coil temperature, and gas inflow temperature. A Latin Hypercube design of size 52, with 19 points replicated (for a total of points) was used to cover the variable input space; the bounds of 4-dimensional cube can be found in Table 1). Each experiment produced a breakthrough curve describing the CO adsorption over time, and this monotonic functional output could very accurately be captured by five points: the -intercept and the time until 2.5, 25, 50, and 75 percent reductions from the intercept (initial adsorption) via a monotone cubic log-spline. The second of these five outputs (knot 0.975) contained a large percent of the total variation, so we restrict our analysis to this single quantity.

The numerical representation of the experimental system was a computational fluid dynamics (CFD) model implemented within the MFIX (Multiphase Flow with Interphase eXchanges) software (Syamlal et al., 1993). MFIX solves the governing equations for conservations of mass, momentum, energy, and species subject to the boundary and initial conditions in multiphase flow.

The physics of 32D1 reactive flow involve multiple mechanisms including the hydrodynamics of multiphase flow, the transfer of heat, and the reactions between chemical species within the mixed system of the fluidized bed. Three main chemical reactions occur when 32D1 adsorbs carbon dioxide: the reaction of CO with the impregnated amine to form carbamate (dry adsorption, 19), the physical adsorption of HO to the sorbent (water “physisorption”, 20), and the reaction of CO, physisorbed HO, and amine to form bicarbonate (wet reaction, 21). The site fractions of carbamate anion, adsorbed water, and bicarbonate are denoted , , and , respectively.


Above, is the temperature of the reacting system and

is the gas constant. Each ordinary differential equation within the coupled system depends on four parameters,

and the subscripts “”, “”, and “” indicate the appropriate reaction (19, 20, 21). The , , and functions within the rate equations have relatively simple forms but are excluded because of the additional definitions required. Two additional parameters were included for calibration, one related to effective particle size of the sorbent and one to pertaining to amine degradation. The expanded formulation of the physics and a more detailed account of the parameters can be found in Lai et al. (2016).

Figure 5: The successes/failures (black dots / red ’s) versus three calibration inputs of the MFIX CFD code.

A total of simulations were run, but of these failed to converge (implying ). With this large percentage of failed runs the researchers wanted to utilize this information, if possible. This was largely due to necessity as subsequent uncertainty analyses would also be based around MFIX– a code whose numerical solvers are not easily modified. In general, just because a code fails to produce output it does not imply that there is not some numerical regime which could produce reasonable answers at the given input settings. This is why the simulation runs must be discussed with the subject-matter experts; further, it should be decided whether or not calibration should be allowed to include regions where the emulator is forced to extrapolate.

The chemists had indicated that extreme reaction rates would be numerically problematic (and likely to produce unrealistic output even if convergent), but could not provide an explicit relationship between calibration parameter configurations and success/failure. Though the bounds within the prior distributions (Table 1) were a good faith effort to encapsulate previous knowledge of the physics and failure-space boundaries, they could not completely preclude troublesome combinations. Indeed, a number of projection-based exploratory plots of the failures were studied but clear patterns did not emerge, some evidence that three or more of the parameters were jointly causing problems. An exception is found in Figure 5 where it can be seen that model failures frequently result from low and high (i.e. when the dry adsorption occurs too quickly, in a relative sense). As for the low values of producing instabilities, this was not known a priori but it was concluded that the lower bound for water physisorption activation energy was likely too pessimistic.

The latent Bayesian model under case (C1) (no dependence of the failures upon the experimental conditions) and Squared-Exponential latent correlation was run for 200,000 iterations, and after a short burn-in, the chains showed no evidence of non-convergence. Every 200 iterations LOOCV was performed resulting in a classification rate between 80.0% and 98.3% with a median of 91.3%. Given the dimension of calibration input space, these rates were deemed adequate enough to proceed with the coupled calibration under (C1).

Calibrations were conducted without and with the failed simulations. (It should be noted that the results of the analysis without using the failed runs differ from those of Lai et al. (2016) because different models and data were used– the current work used only one of the five outputs and a GP model instead of a Bayesian Smoothing Spline ANOVA model.) The priors for the 14 calibration pa