Abstract
A new algorithm is developed to tackle the issue of sampling nonGaussian model parameter posterior probability distributions that arise from solutions to Bayesian inverse problems. The algorithm aims to mitigate some of the hurdles faced by traditional Markov Chain Monte Carlo (MCMC) samplers, through constructing proposal probability densities that are both, easy to sample and that provide a better approximation to the target density than a simple Gaussian proposal distribution would. To achieve that, a Gaussian proposal distribution is augmented with a Gaussian Process (GP) surface that helps capture nonlinearities in the loglikelihood function. In order to train the GP surface, an iterative approach is adopted for the optimal selection of points in parameter space. Optimality is sought by maximizing the information gain of the GP surface using a minimum number of forward model simulation runs. The accuracy of the GPaugmented surface approximation is assessed in two ways. The first consists of comparing predictions obtained from the approximate surface with those obtained through running the actual simulation model at holdout points in parameter space. The second consists of a measure based on the relative variance of sample weights obtained from sampling the approximate posterior probability distribution of the model parameters. The efficacy of this new algorithm is tested on inferring reaction rate parameters in a 3node and 6node network toy problems, which imitate idealized reaction networks in combustion applications.
Keywords
: Gaussian process regression; active learning; surrogate models; Bayesian inference; MCMC.
1 Introduction
Mathematical models are constructed to approximate physical systems, which are then used to make predictions about their behavior at a given set of inputs. This constitutes solving the forward problem. On the other hand, inverse problems involve using observations in order to make inferences about the model inputs, or even inferences about the form of the models themselves. Posing the inverse problem in a Bayesian setting allows one to regularize any illposedness present, to account for any source of noise in the observations and prior uncertainty in the forward models, and to subsequently infer a posterior probability distribution for the inputs or models, as opposed to inferring a single best set of input values or a single best model [1, 2, 3]. The inferred posterior probability distribution summarizes all available information about the inputs or models, including a quantifiable measure of uncertainty that could, in turn, be propagated forward to provide a measure of uncertainty in the resulting predictions [4, 5].
For most inverse problems of interest, there is no analytical representation of the posterior probability distribution, so statistical information about the distribution is typically extracted using Markov chain Monte Carlo (MCMC) sampling techniques, which entails solving the forward problem several times. As a result, standard MCMC methods are known to struggle when the underlying posterior distribution is complex, and when the forward model is computationally expensive [4, 5]. There exist several approaches for tackling such challenges, some of which focus either on devising better sampling strategies [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], or on alleviating the cost of the forward model computations through developing reduced order models [19, 20, 21, 22, 23, 24, 25], or cheaper surrogate representations that can (locally or globally) approximate the forward model [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]. In this work, we focus on the surrogate representation approach.
The approximation to the forward model could be either deterministic or statistical in nature. For instance, [39, 40, 41, 42, 43, 44] aimed at constructing adaptive polynomial or spline approximations of the target density using deterministic regression techniques, and utilized these approximations as proposal densities for MCMC sampling. On the other hand, studies such as [45, 46, 47, 48] aimed at constructing a probabilistic approximation of the forward model (known as an emulator), which they then used either for Bayesian optimization purposes or for conducting uncertainty and sensitivity analyses. While our underlying goal here is analogous to the former set of studies, the approach that we take is similar to the latter set of studies in that we implement a probabilistic, kernelbased regression method in order to approximate the target density and then seek to sequentially improve the approximation as further explained below in more details.
In this paper, we adopt the surrogate model approach, where we aim at constructing a faithful approximation of the posterior probability density using Gaussian process regression (GPR). Given that the forward model is an expensive black box computer simulation, we address the question of how to efficiently select data points for training the Gaussian Process (GP), such that we obtain a relatively accurate surrogate model using the minimum number of training points. To this end, we take an active learning approach, and iteratively build our surrogate surface. We develop a data selection criterion to decide, at each iteration stage, where in input space we should run the computer simulation next. Several previous studies have tackled a similar problem to the one presented here [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 46, 45, 67], however in most of these studies, the data selection criterion implemented is either not appropriate for our current purposes due to its associated computational expense, or is coupled with an unnecessarily complicated GP model. For example, in [50, 51, 52, 53, 54]
, the selection criteria aim to minimize the predictive error (or simply the predictive variance) of the surrogate model, but they require estimating integrals over the entire input space or computing local sensitivity derivatives, which could be computationally expensive or not even readily feasible. Rasmussen
[60]takes a similar approach to ours, in that he couples an MCMC sampler (specifically, Hamiltonian Monte Carlo sampler) with a GP model, in order to obtain potential data point candidates for training the GP. However, he offers no mathematical justification to show how he arrived to the specific selection heuristic used for choosing the data point candidates. Moreover, Rasmussen’s algorithm seems to require computing first and second order covariance derivatives for optimizing the GP model, as well as computing the forward model and its partial derivatives several times at each iteration stage. On the contrary, beyond solving the forward model at the input point selected to add to the training set, the algorithm we develop in this paper does not require solving the forward problem or any of its derivatives while seeking potential data point candidates. In addition to that, no optimization of the GP model is required at any stage of the algorithm. In fact, we will demonstrate that, despite choosing a naive unoptimized GP model, our algorithm can still construct a faithful surrogate model. Kandasamy et al.
[61] suggest the same data selection criterion as the one we derive here, however in order to locate the optimal data point that maximizes the criterion, they evaluate the criterion on a coarse grid in input space, which becomes impractical in high dimensions. We circumvent this hurdle by coupling our GP model with an MCMC sampler, which allows us to cheaply locate the optimal data point.This paper is organized as follows. In Section 2 we start by briefly introducing Bayesian inference and Gaussian process regression, and then move on to develop the algorithm for the optimal selection of GP training data. Implementation of the developed algorithm on a number of network toy problems of increasing dimensionality is presented in Section 3. Major conclusions are finally summarized in Section 4.
2 Problem Formulation
2.1 Bayesian parameter inference
We consider a forward problem defined by:
(1) 
where is a realvalued
dimensional vector of observed data,
is a forward model that is a function of a realvalued dimensional vector of parameters , such that . It is assumed thatis unknown or uncertain, and is thus treated as a vector of realvalued random variables.
is a dimensional vector of i.i.d. random variables with a probability density p, which accounts for the discrepancy between the model predictions and the observed data due to measurement errors only. Note that, as expressed above, it is assumed that the noise is additive.Given the observed data, our goal is to statistically infer the model parameters,
, using the Bayesian approach. According to Bayes’ theorem, the solution to this inverse problem is given by:
(2) 
where represents the posterior probability distribution of after knowledge from the observed data has been incorporated.
represents the prior probability distribution of the model parameters, which encodes knowledge about the parameters before the data has been observed.
represents the likelihood function, which describes the probability that a given set of model parameters, , gives rise to the observed data, . Using our assumptions about the distribution of , the likelihood function can be written as:(3) 
If we further assume that the noise, , is zero mean Gaussian with a scalar variance, , and that the observed data are independent and identically distributed, then the likelihood function becomes of the form:
(4) 
When the forward model is nonlinear, as is typical in most applications, an analytical solution of Eq. (2) becomes intractable. Moreover, if the dimensionality, , of the vector of model parameters, , is high ( ), and the forward model is expensive to solve, then a simple gridbased numerical solution of the Bayesian inverse problem becomes prohibitively expensive. In these cases, one usually resorts to a statistical characterization of the posterior probability distribution of through random sampling techniques, such as Markov Chain Monte Carlo (MCMC) [3]. However, for the statistical quantities obtained through random sampling to converge to the true ones, the MCMC sampler will need be run for a long time, which consequently requires multiple evaluations of the likelihood function. This is especially true for complicated, highly nonGaussian posterior distributions [5, 68, 69]. Since each evaluation of the likelihood function involves a forward solve of the physical model, direct random sampling could become computationally impractical (or even infeasible), if the physical model is computationally expensive. One way to circumvent this hurdle is to replace the likelihood function with a cheaper surrogate model which faithfully reflects the likelihood response surface in regions of parameter space that are associated with high probabilities. We choose to employ a kernel method, specifically the Gaussian process regression (GPR) technique, for constructing such a surrogate model, mainly due to its generality and flexibility.
2.2 Gaussian process regression
Gaussian process regression is a nonparametric method that stems from Bayesian linear regression. The latter tends to define a probability distribution over the weighting parameters of a given set of nonlinear basis functions, whereas the former defines a probability distribution over the regression functions directly. Specifically, given a set of
observations at a set of input points , a Gaussian process defines a probability distribution over functions, , such that the joint probability distribution of the observations, , is Gaussian with a mean and a covariance . is an matrix whose elements are composed of the kernel function .Assuming no prior knowledge about the mean, we will take . There are several options that one could choose for the covariance kernel function [70], but in what follows, we will focus on the stationary, isotropic squaredexponential kernel function given by:
(5) 
where is the variance (diagonal component of ), and is a correlation lengthscale parameter. Typically, optimal values for the kernel parameters are determined by optimizing the logmarginal likelihood of the Gaussian process using standard techniques such as the conjugate gradient method [70, 71].
Given that the joint distribution of the outputs at a given set of inputs is normal, then the conditional predictive distribution at a new input point,
, is accordingly also Gaussian with a mean and variance given by:(6)  
(7) 
where is an kernel vector composed of the kernel function evaluated at and each of the training input points in . This gives us the predictive mean and the associated uncertainty at a new test input point, given observations at previous training input points.
2.3 Optimal selection of training data
Our aim is to build a surrogate model for the likelihood function in Eq. (3), given that the forward model, , is computationally expensive. Thus, it is important to judiciously choose the data for training the GP surface, such that we get a faithful surrogate representation of the likelihood function with the fewest number of training data. It has been shown in numerous studies [61, 72, 52, 49, 50, 51, 59] that an “active learning” or “sequential design” approach to the problem of optimal data selection is more efficient than “passive learning” strategies, and this is the approach that we opt to adopt in this paper.
Assuming that our choice of the GP covariance kernel function is suitable, and that we have initialized our GP surface with initial training inputoutput data pairs , our task is to determine which data point we should select to add to our training data set such that it maximizes the amount of information gained by our GP surface. In other words, we wish to determine the optimal at which to evaluate our likelihood function next. Subsequently, the new data point is added to our training data set, the GP surface is updated, and the process is repeated until the accuracy of the resulting GP surface is deemed satisfactory. Note that we will not be concerned in this study with the issue of selecting multiple new observation points at once — this will be the subject of future studies.
2.3.1 Utility measure
In order to determine the potential amount of information that a given data point carries, we need to define a utility measure that could serve as an information gauge. For a GP surface, each point in input space is associated with a mean and a variance (uncertainty) around that mean. Thus, a data point that would maximize the amount of information gain, would be the one that results in a maximum reduction of uncertainty at a given point in input space. In this case, a possible information gauge would be the pointwise predictive variance given by Eq. (7). However, recall that our ultimate goal is to create an approximation of the likelihood surface which we could then use for MCMC sampling. As a result, we are only interested in an accurate approximation of the regions in space that are associated with a high posterior probability. On the other hand, regions that have a low posterior probability are not important for statistically characterizing the posterior probability distribution of , and thus do not require an accurate representation by our GP surrogate surface. Consequently, our criterion should be able to weight the pointwise variance with the corresponding posterior probability value at a given input point.
In order to derive a probabilityweighted pointwise variance, we start out by writing the pointwise variance according to our GPapproximated surface. Let be the true posterior probability at a given input point, and its corresponding GP approximation (Note that, as will be noticed in the equations that follow, the GP model by itself approximates the loglikelihood). Accordingly,
(8) 
where is a normalization factor, , and . is the GP predictive mean, and is the associated GP uncertainty. accounts for any prior information or assumptions about the distribution of . For example, if our prior assumption is that , then
and .
The variance of is given by:
where we omitted the dependence on in the last expression to simplify the notation. Using simple calculus, it can be shown that . This leads to:
(10) 
The above equation gives us our sought after pointwise probabilityweighted predictive variance, which we will employ as our criterion for determining the next best data point to add to our training data set. In other words, we will choose to evaluate at the that maximizes Eq. (10), and add this new data point, , to our GP training data set. If Eq. (10) happens to admit multiple maxima, then we will randomly pick a out of the set of possible maxima.
2.3.2 Searching for the optimal point
Having derived our measure of optimality using the variance of the approximated posterior probability density, we now need a way to locate the point that maximizes Eq. (10), especially when our input space is highdimensional. Since the surface that we are seeking to optimize could potentially be multimodal and/or nonsmooth, common optimization techniques would not be the best resort as they are known to struggle in such cases. Instead, we will seek to locate our optimal point via MCMC sampling. Specifically, we will sample the surface given by Eq. (10) using the emcee implementation of the affine invariant ensemble MCMC sampler [73, 74], which we will hereafter refer to as the Hammer sampler. (We should note that the use of Hammer is not necessary. Any other MCMC sampler could be employed, however, we chose to use Hammer because of its efficiency in sampling highly nonGaussian surfaces.) From the set of samples obtained from Hammer, we pick the sample point with the highest probabilityweighted variance as given by Eq. (10). Note that, in order to prevent the covariance kernel matrix from becoming illconditioned, we also require that our potential new data input point, , be a certain distance from the other input points that are already in the training set . We found that a constraint on the Eucledian norm given by:
where is the correlation lengthscale defined in Eq. (5), suffices for this purpose. If our potential sample point violates this condition, then we pick instead the next best sample point that satisfies this condition. Note that the above distance criterion is needed due to the potential incompatibility between the smoothness imposed by the underlying covariance kernel function and that of the data. Had we opted for optimizing the GP model, then this condition would no longer be necessary. Moreover, a factor different than 0.2 could have in principle been chosen, but we picked it as a reasonable value that would not sacrifice flexibility for smoothness.
2.3.3 Accuracy measure
As mentioned before, we are aiming for constructing a faithful GPsurrogate of the likelihood function. Thus, we need a measure by which we can judge the accuracy of our approximation at each iteration stage. For this purpose, we will rely on two different quality measures. The first quality measure is based on the empirical average absolute error between predictions obtained from the approximate surface and predictions obtained from the true surface at a given set of holdout points in parameter space. The holdout points are obtained by running the Hammer sampler on both, the approximate and the true surfaces, and using the sample points after burnin as our holdout points. We resort to two sets of samples so that we do not bias our quality measure towards sample points whose true likelihood is either high or small (in other words, we would like to check the quality of our approximate surface in areas where the true likelihood is large, and also in areas where the true likelihood is small but is incorrectly classified as large by our approximation). Accordingly, this gives us two error measures; one weighted by the true probability distribution,
, and the other weighted by the approximate probability distribution . We designate these as and , respectively, and they are of the form:(11) 
where is the number of samples generated by the Hammer sampler.
The second quality measure, , is based on the relative variance of sample weights, , which is computed as follows [6, 75]:
(12) 
where is the weight of a sample point generated from Hammer by sampling the surrogate surface. Note that , and that when (in which case, the surrogate surface is an exact representation of the true surface). One has to be careful, though, that the mean of the sample weights is close to 1. Otherwise, the variance of the weights could still be zero, even though the surrogate surface is too off from the true surface. This happens if the ratio of to is some constant, in which case, would need to be rescaled by normalization with this constant factor.
3 Implementation
In what follows, we will test the efficacy of our training algorithm on node network toy problems of increasing dimensionality. The main motivation behind this study is the inference of reaction rate parameters in combustion problems, so the toy problems are meant to imitate an idealized reaction network model for the reaction kinetics in combustion applications. For each problem, we start first by running the Hammer sampler on the true posterior distribution
, in order to obtain information about the mean and the covariance of the underlying posterior surface. The number of walkers used by the Hammer sampler is twice the dimensionality of the problem, and the walkers are initialized by drawing samples from a Gaussian distribution. We then use this information on the sample mean and covariance to construct a Gaussian prior distribution,
. Note that this initial Gaussian approximation does not give us complete information about the posterior surface, since all of our test models are nonlinear, which makes the posterior distribution nonGaussian by construction. Our aim is to test whether our GPaugmented surrogate surface can help resolve the nonlinearities that a Gaussian proposal fails to capture. Moreover, it is not necessary to run the sampler on the true surface for a long time. It is sufficient to obtain only preliminary information about the mean and covariance of the true distribution, as long as the support of our initial Gaussian approximation is large enough so as not to preclude high probability regions that the sampler did not manage to visit during the preliminary sampling stage. This could be achieved, for example, by inflating the sample variance used in constructing the Gaussian prior by some factor.We will assume that the true likelihood function has a Gaussian form, which leads to the GP likelihood surrogate to be given by:
(13) 
Consequently, our GPaugmented surrogate for the posterior distribution becomes:
(14)  
where, as mentioned in Section 2.3.1, . is given by Eq. (6) after initializing the GP surface with a certain number of training points, while before adding any training points, and represents an observation of the true loglikelihood function, . Note that comparing Eq. (14) above with the previous Eq. (8), only the GP predictive mean is included above without the GP variance. The GP variance was used to derive the sought after utility measure, but for prediction purposes (in this case, to predict the value of the posterior surrogate at a given input ), only the mean is needed. The GP variance is needed only as a way to assess the uncertainty or confidence in our posterior approximation.
For all of the test problems presented below, we will adopt the isotropic squaredexponential kernel given by Eq. (5) as our covariance kernel function. We will initialize the covariance kernel parameters to be and , and keep them fixed thereafter. The maximum absolute value of the true loglikelihood function is based on observations of loglikelihood values for samples proposed by Hammer (for both, the samples that were eventually accepted and those that were rejected) when sampling the true posterior in order to construct the Gaussian prior distribution, as described above. The value of chosen is based on the fact that it is usually a good initial guess when no prior knowledge of the correlation lengthscales is available — it is neither too small nor too large. We will not attempt to optimize or update the kernel parameters as we add GP training points, as one of the goals of this preliminary study is to demonstrate that it is not necessary to optimize the kernel parameters in order for our training algorithm to work. Of course, one expects the efficiency of the training algorithm to improve when optimal kernel parameters are implemented rather than nonoptimal ones, but then the enhanced efficiency comes at the extra computational cost and added complexity of optimizing for the kernel parameters. A comparison between the computational cost and complexity of implementing an isotropic kernel with fixed parameters, an isotropic kernel with optimized parameters, and an anisotropic kernel with optimized parameters will be the subject of future investigations.
3.1 3node network model
We start by considering a 3node network model as illustrated by the schematic in Fig. 1. The reaction rate, , across each of the nodes is given by:
where , , , and C are the preexponential factor, activation energy, thermodynamic inverse temperature, and concentration of the reaction species, respectively. The indices refer to the nodes across which a reaction is taking place, while the index corresponds to the specific experimental conditions under which the reactions occur. It is assumed that the reactions are irreversible, and that they proceed sequentially across each of the nodes with the first node always being 1 and the terminal node being the highest number node in the network (3 in this case). The eventual observed endoutput of each experiment corresponds to the total time it takes the slowest reaction pathway to complete (similar to a rate limiting step in an actual reaction mechanism). The time for a given reaction pathway is given by the sum of the inverse of the reaction rate across each of the nodes involved in that pathway. For example, the total time, , of the reaction pathway is .
3.1.1 2D case
We start out with a 2D scenario, in which we assume that only two of the activation energies, , are uncertain with all other parameters being known. For inferring the uncertain parameters, we rely on observations from a set of 67 synthetic noisy experiments, not all of which are equally informative. The synthetic noisy observation data is obtained by numerically evaluating the model for each experiment using the true reaction parameters, and then adding a factor of Gaussian noise, , to the computed output from each experiment. For the 3node network model, we used a noise level of . The true network reaction rate parameters are shown in Table 1, and Table 2 lists the conditions for each experiment.
Nodes  (1,2)  (1,3)  (2,3) 
5  2  1  
1  3  2 
Experiment  

1  10  0.5  10  0.01 
2  20  0.5  20  0.1 
3  0.5  20  2  0.01 
4  2  30  0.5  0.1 
5  2  20  0.5  0.01 
6  5  20  5  0.01 
7  0.5  30  0.5  0.1 
Fig. (2) shows the unnormalized true 2D posterior distribution, , of the uncertain rate parameters given data observed using the first 6 experiments shown in Table 2. The distribution was obtained by numerically evaluating the true posterior on a grid. Note that the Gaussian prior used in evaluating the true posterior distribution is not the same as the Gaussian prior, , used to construct the surrogate posterior. To test whether we can recover this 2D probability distribution using our sequentially trained GPaugmented surrogate surface, we started by constructing our Gaussian prior distribution and initializing the GP with 16 training points on a grid. The initial training points used are shown as black dots in Fig. (2). The initial Gaussian prior distribution is shown in Fig. (2(a)), while Fig. (2(b)) shows the final surface, after sequentially training the GP surface with 200 additional observation points using the search algorithm described in Section 2.3.2. The additional observation points selected are marked as black crosses in Fig. (2). As can be seen from the figures, our training algorithm is able to select observation points in areas where the true posterior probability is high and also in areas that the original Gaussian prior incorrectly classified as important, resulting finally in a surrogate surface that correctly captures the nonlinear curvature of the true distribution. Notice also that despite using an isotropic covariance kernel whose parameters have not been optimized beforehand, we were still able to learn the underlying anisotropic character of the true surface through proper selection of training points. Note that there was no need to implement the accuracy measures introduced in Section 2.3.3, since in 2D one can easily check the quality of the approximation visually.
The warped, stretched structure of the 2D posterior distribution is a signature of the fact that the data provided by the 6 experiments does not provide sufficient amount of information to pinpoint the underlying true reaction parameters. This raises the interesting question of whether the posterior probability distribution would be better constrained, if the experiments carried more information about the parameters. To this end, we repeated the 2D parameter inference exercise above using additional data provided by experiment 7 in Table 2.
Fig. (4) shows the unnormalized true 2D posterior distribution, again evaluated on a grid, along with the 216 data input points used to train the GP. The Gaussian prior and the resulting GPaugmented surrogate surface are shown in Fig. (5). As expected, with the addition of more informative experiments, the posterior is much more constrained. However, the mode of the posterior distribution remains to be a little bit off from the true value of . This is due to the noise in the data and the nonlinear manner by which the activation energy parameter enters the model, which consequently makes the output steeply sensitive to and thus prohibits exact identification of the truth. Another thing to notice is that the posterior still exhibits a little bit of a curvature, which the surrogate surface manages to capture perfectly.
3.1.2 6D case
Having demonstrated that our sequentially trained GPaugmented surrogate surface is capable of successfully reconstructing nonGaussian 2D posterior probability distributions, we move on to testing the algorithm on the full 6D scenario, where we assume that all of the reaction rate parameters are uncertain. In this case, we have . We rely on the same set of noisy data from the 7 experiments shown in Table 2 with the same noise level of . We use observations from all 7 experiments for the inference this time, since otherwise our problem becomes illposed and would require remedies that are beyond the scope of this study.
Fig. (6) shows 2D contour projections of the true 6D posterior probability distribution,
, along the dimensions labeled in the figure. The probability contours were obtained by running the Hammer sampler on the true posterior probability surface, collecting the samples after burnin, projecting them along the given directions, and then estimating the probability using a Gaussian bivariate kernel density estimator (KDE). Rather than dealing with the logarithm of the preexponential
parameters in order to impose their positivity constraint, we instead chose to assign a negligibly low probability to negative values, thus preventing the Hammer sampler from visiting nonadmissible regions. By comparing the scales of the axes in Fig. (6), one can notice that the degree of uncertainty in the parameters is higher than that in the parameters. This is again due to the fact that the experiments are more informative of the latter than the former.In order to construct our surrogate surface, we started as before by constructing the Gaussian , and initializing the GP surface with a set of training points. The training points for initializing the GP surface were obtained by collecting sample points from a random selection of 10 iterations of the Hammer chain (after burnin) utilized in constructing . Note that we did not impose the minimum distance constraint, mentioned in Section 2.3.2, on this initial set of training points. The resulting surrogate surface was then sequentially updated with 300 additional observation points chosen according to our search algorithm (with the minimum distance constraint imposed this time). 2D projections of the additional 300 training points selected by the search algorithm are shown overlaid on the corresponding projections of the true posterior distribution in Fig. (6). For clarity, the initial GP training points have not been included in the plots. The initial Gaussian prior and the final surrogate distributions are shown in the left and right columns, respectively, of Fig. (7). The rows in Fig. (7) correspond to projections of the prior and surrogate distributions along the same directions as those shown in Figs. (5(a))–(5(d)). The probability contours of the surrogate posterior distribution were obtained via KDE in the same manner as those in Fig. (6), however, the contours of the initial Gaussian prior were obtained by numerically evaluating the 2D marginal distributions on a grid in the range shown. As can be seen from the figures, the surrogate surface successfully captures all of the nonlinearities present in the true posterior distribution, which the initial Gaussian prior approximation fails to represent.
To better quantify the accuracy of the surrogate approximation, especially since it is not easy to directly visualize a 6D surface, and to be able to judge the efficiency of our sequential training algorithm, we resort to the accuracy measures given by Eqs. (11)–(12). Figs. (7(a))–(7(c)) show the evolution of , , and the measure, respectively, as we sequentially add training points to the GPaugmented surface. Note that corresponds to the GPaugmented surface that has already been trained with the initializing set of GP training points. As can be seen from Figs. (8ab), the initial values of and are not appreciably large, indicating that the initial GPaugmented surface is not vastly different from the true posterior surface. Notice also that the initial , which means that the initial GPaugmented surface is more accurate near the mode of the posterior distribution than farther away. This is expected, since (in most cases) a Gaussian approximation near the mode of the true posterior distribution should locally be fairly accurate. Considering also that the covariance kernel correlation lengthscale is not too large, the initial GPaugmented surface starts out fairly lumpy as it adjusts its mean predictions with the addition of training points. So in computing , more sample points initially come from regions where the surrogate surface is still not very accurate and where the true probability is low. On the other hand, the sample points used to compute are always more concentrated near the mode of the true posterior distribution, where the surrogate surface is initially relatively more accurate. This underscores the importance of relying on both measures to properly judge the accuracy of the surrogate surface. Another thing to notice is that both error values exhibit an overall monotonic decrease as more observation points are added, which confirms that the search algorithm is selecting new training points that are indeed informative. Moreover, the decrease in error appears to be steepest at the beginning, and becomes more gradual later on. This too confirms that the algorithm is seeking training points that are maximally informative in a global sense, as desired.
Observing now Fig. (7(c)), we can notice (i) that the initial R measure () is only slightly larger than , which indicates that the initial surrogate approximation, while not exact, is still appreciably close to the true posterior surface, (ii) that the Rmeasure exhibits an overall monotonic decreasing trend as more training points are added to the GP surface, and (iii) that the steepest decrease occurs when the first few training points are added, while the rate of decrease becomes more gradual later on. These observations are consistent and lend further support to our earlier remarks regarding the absolute error trends.
3.2 6node network model
In this section, we consider a slightly more complicated 6node network model as illustrated in the schematic in Fig. (9). With the exception of being composed of more reactive nodes, this 6node network model is governed by the same assumptions and reaction mechanisms as those of the 3node network model. In the current model, however, we assume that only the preexponential, , parameters are uncertain, which makes our inference problem 7D with .
For inferring the uncertain parameters, we rely, as before, on observations from a set of 20 synthetic noisy experiments with a noise level of . We chose a higher noise level this time, in order to increase the degree of nonGaussianity of the posterior distribution and make the problem slightly more challenging. The true network reaction rate parameters are shown in Table 3, and Table 4 lists the conditions for each experiment. Note that each of the 10 experiments shown in Table 4 was carried out with . The remaining 10 experiments (not listed) were carried out at the same conditions as those listed in the table, but with .
Nodes  (1,2)  (1,4)  (2,3)  (2,5)  (3,4)  (4,6)  (5,6) 
5  2  2  4  4  3  2  
7  2  3  6  5  4  1 
Experiment  

1  10  0.1  10  10  10  0.1  10 
2  0.1  10  10  0.1  10  10  0.1 
3  0.1  10  0.1  10  0.1  0.1  10 
4  10  0.1  10  10  10  10  10 
5  10  10  10  10  10  0.1  10 
6  10  10  10  10  0.1  10  10 
7  10  10  0.1  10  10  10  10 
8  10  10  10  0.1  10  10  10 
9  10  10  10  10  10  10  0.1 
10  0.1  10  10  10  10  10  10 
Fig. (10) shows 2D contour projections of the true 7D posterior probability distribution, , along the subset of dimensions labeled in the figure. Notice that, due to the higher noise level in the data, the range of support of the posterior probability this time is wider than the range of support (along the directions of the parameters) of the 6D posterior probability in Section 3.1.2. Moreover, the location of the mode of the posterior distribution has been nudged to be a bit off from the underlying true value of the parameters. Comparing Figs. (9(a)–9(c)) with Fig. (9(d)), we can see, consistent with what we would expect, that the probability contours for parameters along different reaction pathways flare out more in the diagonal direction, whereas the probability contours for parameters that contribute to a shared reaction pathway are stretched in directions parallel to the axes.
Following the same methods used in Section 3.1.2 to construct the surrogate surface approximation, , and skipping over the qualitative comparison of the probability contours for the sake of brevity, we move on to check the quality of our surrogate approximation using the accuracy measures described earlier. Figs. (10(a))–(10(c)) show the evolution of , , and the measure, respectively, as we sequentially add training points to the GPaugmented surface. Again, overall monotonic decreasing trends are observed as more training points are sequentially fed to the GP surface, with the rate of decrease being highest at the beginning and becoming more gradual later on. However this time, as is evident from initial values of the absolute errors and R measure at , the initial GPaugmented surface does not start out as being a close approximation to the true posterior surface. In fact, almost an order of magnitude reduction in the absolute error and the R measure is achieved by the end of the iterations ().
To rule out the possibility that our training algorithm is not any better than a simple random selection of training points, we repeated the 7D exercise above using the same methods and constraints, but instead of relying on Eq. (10) as a criterion for selecting training points at each iteration, we sampled and randomly selected one of the (after burnin) sample points to be our next training data point. The resulting accuracy measures are shown in the insets of Figs. (10(a))–(10(c)). Contrary to the trends observed when using the training algorithm, the average absolute error measures using this random selection technique failed to converge, as can be seen from the inset in Fig. (10(b)). The eventual increasing trend observed in , despite the overall decreasing trends in and , is an indication that the GP has crashed. This is because the GP is tending towards correctly capturing the low probability regions, which gives rise to the decrease in , at the expense of misrepresenting the high probability regions, which causes the increase in . The combination of these two effects leads to a decrease in , even though the surrogate surface is progressively deviating from the actual true surface. This justifies our earlier cautioning against relying on any one of the accuracy measures exclusively. The collapse of the GP when training points are selected randomly, underscores the importance of correctly weighting the potential observation points using a utility measure similar to the one proposed in Section 2.3.1.
4 Conclusion
In this paper, we sought to construct a surrogate approximation of the posterior probability distribution that results during the Bayesian inference of model parameters for expensive forward models, in order to tackle the bottleneck of having to solve the forward model multiple times during MCMC sampling. The surrogate surface was built iteratively using a stationary, isotropic GP model with fixed (unoptimized) kernel parameters. To help maximize the efficiency of the GP training process, an algorithm was developed which seeks, at each iteration, the optimal data point to add to the GP training set using a pointwise probabilityweighted utility measure.
Motivated by the inference of reaction rate parameters in combustion applications, the capability of the algorithm to recover 2D, 6D, and 7D posterior probability distributions was tested on 3node and 6node network models. Starting with an initial Gaussian approximation of the posterior, and despite employing a possibly nonoptimal GP model, the algorithm was able to successfully reconstruct the true 2D posterior distributions and achieve almost an order of magnitude increase in accuracy after about a 100 iterations in the higher dimensional cases. When comparing our sequential learning algorithm to that of a passive (random) learner in the 7D scenario, the latter crashed after about 30 iterations, which is testament to the importance of judiciously selecting the GP training points when seeking to create surrogates to posterior probability distributions. The present experiences warrant further testing and development of the current sequential learning algorithm, particularly with regard to incorporating more sophisticated GP models and selecting multiple new training points at once.
Acknowledgments
This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics Program under contract DEAC0205CH11231. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under contract DEAC0205CH11231.
References
 [1] Marco Iglesias and Andrew M Stuart. Inverse problems and uncertainty quantification. SIAM News, volume July/August, 2014.
 [2] Albert Tarantola. Popper, bayes and the inverse problem. Nature physics, 2(8):492, 2006.
 [3] Albert Tarantola. Inverse problem theory and methods for model parameter estimation, volume 89. siam, 2005.
 [4] Youssef M Marzouk, Habib N Najm, and Larry A Rahn. Stochastic spectral methods for efficient bayesian solution of inverse problems. Journal of Computational Physics, 224(2):560–586, 2007.
 [5] James Martin, Lucas C. Wilcox, Carsten Burstedde, and Omar Ghattas. A stochastic newton mcmc method for largescale statistical inverse problems with application to seismic inversion. SIAM Journal on Scientific Computing, 34(3):A1460–A1487, 2012.
 [6] Matthias Morzfeld, Marcus S. Day, Ray W. Grout, George Shu Heng Pau, Stefan A. Finsterle, and John B. Bell. Iterative importance sampling algorithms for parameter estimation. SIAM Journal on Scientific Computing, 2018.
 [7] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman. Dram: efficient adaptive mcmc. Statistics and Computing, 16(4):339–354, 2006.
 [8] J Andrés Christen and Colin Fox. Markov chain monte carlo using an approximation. Journal of Computational and Graphical statistics, 14(4):795–810, 2005.
 [9] J Andrés Christen, Colin Fox, et al. A general purpose sampling algorithm for continuous distributions (the twalk). Bayesian Analysis, 5(2):263–281, 2010.
 [10] Amit Apte, Martin Hairer, AM Stuart, and Jochen Voss. Sampling the posterior: An approach to nongaussian data assimilation. Physica D: Nonlinear Phenomena, 230(12):50–64, 2007.
 [11] Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
 [12] Paul Dostert, Yalchin Efendiev, Thomas Y Hou, and Wuan Luo. Coarsegradient langevin algorithms for dynamic data integration and uncertainty quantification. Journal of computational physics, 217(1):123–142, 2006.
 [13] Yalchin Efendiev, Thomas Hou, and Wuan Luo. Preconditioning markov chain monte carlo simulations using coarsescale models. SIAM Journal on Scientific Computing, 28(2):776–803, 2006.
 [14] JBMBJ Berger, A Dawid, DHA Smith, and M West. Markov chain monte carlobased approaches for inference in computationally intensive inverse problems. In Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, page 181. Oxford University Press, 2003.
 [15] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
 [16] John Geweke and Hisashi Tanizaki. On markov chain monte carlo methods for nonlinear and nongaussian statespace models. Communications in StatisticsSimulation and Computation, 28(4):867–894, 1999.
 [17] Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
 [18] Tiangang Cui, Kody JH Law, and Youssef M Marzouk. Dimensionindependent likelihoodinformed mcmc. Journal of Computational Physics, 304:109–137, 2016.
 [19] Tan BuiThanh, Karen Willcox, and Omar Ghattas. Model reduction for largescale systems with highdimensional parametric input space. SIAM Journal on Scientific Computing, 30(6):3270–3288, 2008.
 [20] Tan BuiThanh, Karen Willcox, and Omar Ghattas. Parametric reducedorder models for probabilistic analysis of unsteady aerodynamic applications. AIAA journal, 46(10):2520–2529, 2008.
 [21] SR Arridge, JP Kaipio, V Kolehmainen, M Schweiger, E Somersalo, T Tarvainen, and M Vauhkonen. Approximation errors and model reduction with an application in optical diffusion tomography. Inverse Problems, 22(1):175, 2006.
 [22] Jingbo Wang and Nicholas Zabaras. Using bayesian statistics in the estimation of heat source in radiation. International Journal of Heat and Mass Transfer, 48(1):15–29, 2005.

[23]
Gianluigi Rozza, Dinh Bao Phuong Huynh, and Anthony T Patera.
Reduced basis approximation and a posteriori error estimation for affinely parametrized elliptic coercive partial differential equations.
Archives of Computational Methods in Engineering, 15(3):1, 2007.  [24] Roger G Ghanem and Pol D Spanos. Stochastic finite element method: Response statistics. In Stochastic finite elements: a spectral approach, pages 101–119. Springer, 1991.
 [25] Olivier Le Maître and Omar M Knio. Spectral methods for uncertainty quantification: with applications to computational fluid dynamics. Springer Science & Business Media, 2010.
 [26] Salman Habib, Katrin Heitmann, David Higdon, Charles Nakhleh, and Brian Williams. Cosmic calibration: Constraints from the matter power spectrum and the cosmic microwave background. Physical Review D, 76(8):083503, 2007.
 [27] Katrin Heitmann, Martin White, Christian Wagner, Salman Habib, and David Higdon. The coyote universe. i. precision determination of the nonlinear matter power spectrum. The Astrophysical Journal, 715(1):104, 2010.
 [28] Velamur Asokan Badri Narayanan and Nicholas Zabaras. Stochastic inverse heat conduction using a spectral approach. International Journal for Numerical Methods in Engineering, 60(9):1569–1593, 2004.
 [29] Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley. Computer model calibration using highdimensional output. Journal of the American Statistical Association, 103(482):570–583, 2008.
 [30] Marc C Kennedy and Anthony O’Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464, 2001.
 [31] Marc C Kennedy, Clive W Anderson, Stefano Conti, and Anthony O’Hagan. Case studies in gaussian process modelling of computer codes. Reliability Engineering & System Safety, 91(1011):1301–1309, 2006.
 [32] Ilias Bilionis, Nicholas Zabaras, Bledar A Konomi, and Guang Lin. Multioutput separable gaussian process: Towards an efficient, fully bayesian paradigm for uncertainty quantification. Journal of Computational Physics, 241:212–239, 2013.
 [33] BJN Blight and L Ott. A bayesian approach to model inadequacy for polynomial regression. Biometrika, 62(1):79–88, 1975.
 [34] JR Koehler and AB Owen. 9 computer experiments. Handbook of statistics, 13:261–308, 1996.
 [35] Max D Morris, Toby J Mitchell, and Donald Ylvisaker. Bayesian design and analysis of computer experiments: use of derivatives in surface prediction. Technometrics, 35(3):243–255, 1993.
 [36] Jeremy Oakley and Anthony O’hagan. Bayesian inference for the uncertainty distribution of computer model outputs. Biometrika, 89(4):769–784, 2002.
 [37] Anthony O’Hagan and John Frank Charles Kingman. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–42, 1978.
 [38] Patrick R Conrad, Youssef M Marzouk, Natesh S Pillai, and Aaron Smith. Accelerating asymptotically exact mcmc for computationally intensive models via local approximations. Journal of the American Statistical Association, 111(516):1591–1607, 2016.
 [39] W. R. Gilks, N. G. Best, and K. K. C. Tan. Adaptive rejection metropolis sampling within gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 44(4):455–472, 1995.
 [40] L. Martino, J. Read, and D. Luengo. Independent doubly adaptive rejection metropolis sampling within gibbs sampling. IEEE Transactions on Signal Processing, 63(12):3123–3138, 2015.
 [41] Luca Martino, Roberto Casarin, Fabrizio Leisen, and David Luengo. Adaptive independent sticky mcmc algorithms. EURASIP Journal on Advances in Signal Processing, 2018(1):5, 2018.

[42]
Renate Meyer, Bo Cai, and Fran ois Perron.
Adaptive rejection metropolis sampling using lagrange interpolation polynomials of degree 2.
Computational Statistics & Data Analysis, 52(7):3408 – 3423, 2008.  [43] Bo Cai, Renate Meyer, and François Perron. Metropolis–hastings algorithms with adaptive proposals. Statistics and Computing, 18(4):421–433, Dec 2008.
 [44] Wei Shao, Guangbao Guo, Fanyu Meng, and Shuqin Jia. An efficient proposal distribution for metropolis hastings using a bsplines technique. Computational Statistics & Data Analysis, 57(1):465 – 478, 2013.
 [45] Daniel Busby. Hierarchical adaptive experimental design for gaussian process emulators. Reliability Engineering & System Safety, 94(7):1183 – 1193, 2009. Special Issue on Sensitivity Analysis.
 [46] L. Martino, J. Vicent, and G. CampsValls. Automatic emulator and optimized lookup table generation for radiative transfer models. In 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 1457–1460, July 2017.
 [47] A. O Hagan. Bayesian analysis of computer code outputs: A tutorial. Reliability Engineering & System Safety, 91(10):1290 – 1300, 2006. The Fourth International Conference on Sensitivity Analysis of Model Output (SAMO 2004).
 [48] Zheng Wang, Shuicheng Yan, and Changshui Zhang. Active learning with adaptive regularization. Pattern Recognition, 44(10):2375 – 2383, 2011. SemiSupervised Learning for Visual Content Analysis and Understanding.

[49]
David A Cohn, Zoubin Ghahramani, and Michael I Jordan.
Active learning with statistical models.
Journal of artificial intelligence research
, 4:129–145, 1996.  [50] David A Cohn. Neural network exploration using optimal experiment design. Neural networks, 9(6):1071–1083, 1996.
 [51] David A Cohn. Minimizing statistical bias with queries. In Advances in neural information processing systems, pages 417–423, 1997.
 [52] Sambu Seo, Marko Wallat, Thore Graepel, and Klaus Obermayer. Gaussian process regression: Active data selection and test point rejection. In Mustererkennung 2000, pages 27–34. Springer, 2000.
 [53] David JC MacKay. Informationbased objective functions for active data selection. Neural computation, 4(4):590–604, 1992.

[54]
Robert B Gramacy, Herbert KH Lee, and William G Macready.
Parameter space exploration with gaussian process trees.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 45. ACM, 2004.  [55] Robert B Gramacy and Herbert K H Lee. Bayesian treed gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483):1119–1130, 2008.
 [56] Robert B Gramacy and Herbert KH Lee. Adaptive design and analysis of supercomputer experiments. Technometrics, 51(2):130–145, 2009.
 [57] Robert B Gramacy and Daniel W Apley. Local gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578, 2015.
 [58] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFLCONF161318, 2003.
 [59] Gerhard Paass and Jörg Kindermann. Bayesian query construction for neural network models. In Advances in Neural Information Processing Systems, pages 443–450, 1995.
 [60] Carl E. Rasmussen. Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals. 2003.
 [61] Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. Query efficient posterior estimation in scientific experiments via bayesian active learning. Artificial Intelligence, 243:45–56, 2017.
 [62] Jerome Sacks, William J Welch, Toby J Mitchell, and Henry P Wynn. Design and analysis of computer experiments. Statistical science, pages 409–423, 1989.
 [63] Carla Currin. A bayesian approach to the design and analysis of computer experiments. Technical report, ORNL Oak Ridge National Laboratory (US), 1988.
 [64] Carla Currin, Toby Mitchell, Max Morris, and Don Ylvisaker. Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. Journal of the American Statistical Association, 86(416):953–963, 1991.
 [65] Roland Preuss and Udo von Toussaint. Global optimization employing gaussian processbased bayesian surrogates. Entropy, 20(3):201, 2018.
 [66] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 [67] J. Andr s Christen and Bruno Sans . Advances in the sequential design of computer experiments based on active learning. Communications in Statistics  Theory and Methods, 40(24):4467–4483, 2011.
 [68] Tarek A El Moselhy and Youssef M Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012.
 [69] Matthew D Parno and Youssef M Marzouk. Transport map accelerated markov chain monte carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682, 2018.
 [70] Carl Edward Rasmussen and Christopher KI Williams. Gaussian process for machine learning. MIT press, 2006.
 [71] David JC MacKay. Introduction to gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168:133–166, 1998.
 [72] Valerii Vadimovich Fedorov. Theory of optimal experiments. Elsevier, 1972.
 [73] Jonathan Goodman, Jonathan Weare, et al. Ensemble samplers with affine invariance. Communications in applied mathematics and computational science, 5(1):65–80, 2010.
 [74] Daniel ForemanMackey, David W Hogg, Dustin Lang, and Jonathan Goodman. emcee: the mcmc hammer. Publications of the Astronomical Society of the Pacific, 125(925):306, 2013.
 [75] Eric VandenEijnden and Jonathan Weare. Data assimilation in the low noise regime with application to the kuroshio. Monthly Weather Review, 141(6):1822–1841, 2013.