# Adaptive Gaussian process surrogates for Bayesian inference

We present an adaptive approach to the construction of Gaussian process surrogates for Bayesian inference with expensive-to-evaluate forward models. Our method relies on the fully Bayesian approach to training Gaussian process models and utilizes the expected improvement idea from Bayesian global optimization. We adaptively construct training designs by maximizing the expected improvement in fit of the Gaussian process model to the noisy observational data. Numerical experiments on model problems with synthetic data demonstrate the effectiveness of the obtained adaptive designs compared to the fixed non-adaptive designs in terms of accurate posterior estimation at a fraction of the cost of inference with forward models.

## Authors

• 1 publication
• 4 publications
03/29/2017

### Adaptive Gaussian process approximation for Bayesian inference with expensive likelihood functions

We consider Bayesian inference problems with computationally intensive l...
11/06/2020

### Bayesian Regression and Classification Using Gaussian Process Priors Indexed by Probability Density Functions

In this paper, we introduce the notion of Gaussian processes indexed by ...
02/27/2019

Approximate Bayesian computation (ABC) is a set of techniques for Bayesi...
11/16/2018

### Efficient Neutrino Oscillation Parameter Inference with Gaussian Process

Neutrino oscillation study involves inferences from tiny samples of data...
09/10/2021

### Bayesian inference of PolII dynamics over the exclusion process

Transcription is a complex phenomenon that permits the conversion of gen...
09/26/2019

### Debiased Bayesian inference for average treatment effects

Bayesian approaches have become increasingly popular in causal inference...
01/12/2021

### Towards fast machine-learning-assisted Bayesian posterior inference of realistic microseismic events

Bayesian inference applied to microseismic activity monitoring allows fo...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Computer simulations are used in many science domains to study complex physical phenomena for which targeted experiments to test hypotheses would otherwise be too time consuming or impossible to conduct. The simulation models generally have parameters whose values influence the output of the model. Experimental data are often used in order to assess the accuracy of the simulation model. However, these observations are often noisy, leading to an inverse problem to arise.

Our work is concerned with the design of efficient tools for the solution of inverse problems encountered in science domains such as engineering, cosmology, or combustion. The goal of inference is the estimation of the parameters of interest that serve as inputs into the computational model from a set of observations. We consider applications in which the computational model is expensive (several minutes to hours per run on a modern supercomputer) and the observational data are noisy. The Bayesian approach provides a statistical framework for solving inverse problems with noisy and incomplete data. The solution of the inverse problem in the Bayesian framework is a posterior distribution that describes the degree of confidence about the parameters of interest. Typically, this distribution does not have an analytical form and is represented by the samples obtained with posterior sampling approaches such as Markov chain Monte Carlo (MCMC). These methods commonly require repeated evaluations of the forward model, which, in our setting, quickly becomes prohibitive from a computational point of view. Thus, our main challenge is to reduce the number of model evaluations that are required to find posterior distributions of the model’s parameters. In this paper, we address this challenge by employing Gaussian process models as surrogates of the computationally expensive model.

Approximate methods (also surrogates or metamodels) have been employed by many authors to accelerate inference tasks. Among the approximation methods used in the context of inference are projection-based model reduction [32, 7], stochastic spectral methods [30, 29], and Gaussian process regression [21, 13, 15]. Typically, the surrogate model is built over the support of the prior distribution on the parameters of interest making it a “global” approximation. As argued in [25], it can be sufficient to have a “localized” surrogate—the one that is accurate only in the region of the posterior measure concentration. In cases where the prior distribution is “broad” and the posterior is highly concentrated, the localized surrogate approach can lead to a significant reduction in the number of forward model evaluations that are required to obtain it.

In [25]

, the authors perform Bayesian inference using Polynomial Chaos (PC) surrogates that are adaptively constructed over probability distributions chosen to approximate the posterior in the sense of Kullback–Leibler (K-L) divergence. Candidate distributions are chosen from a parameterized family by minimizing the approximate K-L divergence, and the localized PC surrogates are built with respect to the chosen distributions.

In this paper, we pursue a similar idea of localized surrogates but with Gaussian process (GP) models as surrogates instead of PC. We treat the GP surrogate as a Bayesian surrogate described by a predictive distribution that encodes the available information from a limited number of forward model evaluations. We make use of the fully Bayesian formalism developed in [1] to account for the uncertainty in the observational data as well as epistemic uncertainty arising from a limited number of simulations.

Our main contribution is the design of an adaptive algorithm to guide the selection of training inputs for the construction of the GP surrogate. Our algorithm aims at building a GP surrogate that is effective for the purpose of solving a specific inverse problem. In each step of the algorithm, we maximize an acquisition function that quantifies a potential improvement in the fit of the GP model to the observational data. This greedy approach explores the prior distribution of the parameters sufficiently in order to inform the surrogate model globally while emphasizing the regions that are most likely to be of interest for the construction of the posterior.

The acquisition function in our algorithm is commonly used in Bayesian global optimization under the name of expected improvement. In that context, it is used within an algorithm called Efficient Global Optimization (EGO) [17] to find the global optimum of an expensive-to-evaluate function. We, however, do not merely apply EGO to our objective. The difference between our approach of employing the expected improvement function and the EGO approach is explained in Section 4.1.

Previous work that applied Bayesian optimization and EGO to the solution of inverse problems is reported in [34]. The authors of this work apply the EGO algorithm directly to minimize the error between the model and the experimental data. However, the authors do not treat the problem in the fully Bayesian setting as we do here. A recent work [38]

also considers a sequential design strategy for the solution of inverse problems with GP emulators. In this work, however, the forward model is assumed to be a realization of the GP model which is known completely (i.e., with fixed hyperparameters of the covariance function, see Section

3). We do not make such assumptions. Furthermore, our choice of acquisition function leads to a more tractable auxiliary problem.

Active learning and Bayesian optimization methods have also been applied to the related problem of estimating the likelihood functions for Bayesian inference. Gaussian process models have been applied to directly approximate the likelihood, for example, in [19, 42]

. In these works, the training points for the GP model are chosen adaptively based on a measure of uncertainty such as predictive variance or entropy. A similar approach is developed in

[3]

where the authors propose a method for approximating high-dimensional expensive-to-evaluate probability density functions (p.d.f.‘s) with adaptive Gaussian approximations. The p.d.f.‘s of interest are the ones arising from the Bayesian solution of inverse problems with Gaussian priors and likelihoods. The proposed method is based on Gaussian processes with covariance functions utilizing the Hessian of the negative log-likelihood of the posterior density. The training points are selected adaptively by maximizing the squared error between the true posterior density and the GP-based predictor. This approach requires derivatives of the forward model.

Our approach differs from the methods described above in that we build the surrogate model for the forward simulation rather than for the likelihood function. While approximating the likelihood function directly might prove advantageous if it is multi-modal and the forward model is highly nonlinear, we argue that having a surrogate of the forward model has its own merits. In particular, once the surrogate is built, it can be used not only for estimating the parameter posterior but also for forward uncertainty propagation and prediction, albeit in a limited way due to the localized nature of the surrogate.

Regarding the choice of Gaussian process models as surrogates, we refer to the recent comparison of surrogate-based uncertainty quantification methods conducted in [33]. Gaussian process models have several advantages over polynomial chaos, for example, in terms of their flexibility and the freedom in the choice of the design. Gaussian process models also prove to be more suitable for modeling nonlinear simulator behavior, and provide estimates of the prediction uncertainty. This last feature is particularly important for the method developed here.

The remainder of this paper is organized as follows. In Section 2, we provide the definition of the inverse problem and a brief summary of Bayesian inference. In Section 3, we review Gaussian process models for the single and the multiple output cases and Bayesian inference with GP models. This section provides a review of the existing methodology for training GP models and motivates our modeling choices. It culminates in the derivation of the GP-based likelihood function used for inference. We describe our adaptive approach to constructing the GP models in Section 4 and we show its performance in numerical experiments in Section 5. Here, we contrast our approach with the commonly used randomized designs that are not goal-oriented, such as Latin hypercube designs [37, Section 5.2.2]. We do not compare our results to any other adaptive or sequential experimental designs that aim to reduce the predictive errors in GP regression, see, for example, [36, 12] for the summaries. The objectives of such designs are different from those considered in our work. In Section 6, we draw conclusions and outline future research directions.

## 2 Bayesian inference

We start by formulating the inference problem and introducing the notation. Let the vector of parameters of interest be denoted by

. These parameters serve as an input into the simulation model (i.e., the computer code) that represents a given physical system. Let denote a mapping from the inputs to the outputs of the deterministic forward model. The components of the output vector will be denoted by : . Multiple outputs arise, for example, if the forward model depends on an additional (deterministic) variable that takes on values; in this case, each component of the output represents the value for a fixed : , . For example, could represent time in time-dependent problems.

The goal of inference is to learn the parameters from the direct observations of the physical system. We will denote such observations (experimental data) of the output quantities by a vector . The measured quantities are never perfect and contain measurement noise that will be denoted by a vector . As in classical statistical inverse problems [18], we will view all the variables as random and use capital letters to represent them. Lower case letters will be reserved for their realizations.

We assume the following statistical model for the measurements with additive noise:

 Zi=fi(Θ)+Ei,i=1,…,q.

We further assume that the components of the measurement noise are normally distributed (

) and potentially correlated. The probability density of the measurement noise is thus given by a -variate normal density:

 pE(e)=Nq(e|0q,% \boldmathΣE)\coloneqq(2π)−q/2|\boldmathΣE|−1/2exp[−12eT\boldmathΣ−1Ee]

with known covariance ; here denotes a vector of zeros.

From the assumed statistical model it follows that conditioned on is distributed like , which leads to the following measurement likelihood function:

 L(\boldmathθ|z)\coloneqqpE(z−f(\boldmathθ))=Nq(z−f(% \boldmathθ)|0q,\boldmathΣE). (1)

Here, we use the notation as in [24, Section 6.3], i.e., the likelihood is the density of the data considered as a function of the parameters for fixed .

Assuming the Bayesian framework, any prior information on the parameters is encoded in the prior density function . Given the prior and the observed measurements , the solution of the inverse problem is the posterior density obtained by applying Bayes’ rule:

 p(\boldmathθ|z)∝L(\boldmathθ|z)p(\boldmathθ). (2)

The posterior density usually does not have a closed form solution and is explored via sampling, for example, with MCMC methods [26]. Applying MCMC requires repeated evaluations of the likelihood function . Since these evaluations involve computing the forward model , they are expensive, making direct application of MCMC methods infeasible. The goal of the next sections is to develop a method for approximating the forward model with a Bayesian surrogate that allows an efficient and accurate computation of the likelihood function . We use a Gaussian process (GP) model as the surrogate model. Before describing the proposed method, we review the standard GP methodology in the next section.

## 3 Gaussian Process models

In this section, we describe a GP model for the single and the multiple output cases followed by the Bayesian inference with GP models.

### 3.1 Single output case

We start with a one-dimensional GP model for the case of a single output . Formally, a Gaussian process model

is a collection of random variables such that any finite number of them has a joint Gaussian distribution

[35, Chapter 2]. This distribution is characterized by its mean and covariance functions. In the following, we take the mean function to be zero, and we let the covariance function be the squared exponential:

 cov(fGP(\boldmathθ),fGP(\boldmathθ′))=c(\boldmathθ,\boldmathθ′)\coloneqqσ2cexp[−p∑i=1(θi−θ′i)2ℓ2i]. (3)

This covariance function expresses prior information about : it prescribes the common variance to the values at different and expresses correlations through the distance between inputs weighted by characteristic length-scales in each input dimension . We will denote the vector of the parameters of the covariance function by ,

 \boldmathψ=(σc,ℓ1,…,ℓp)T∈Rp+1,

and refer to them as hyperparameters. We will write the covariance function in the form to emphasize its dependence on the hyperparameters .

The choice of the zero mean function does not affect our methodology, but is assumed for convenience. In practice, it is common to model the mean using a fixed basis which leads to the introduction of additional regression parameters, see, e.g., [21]. The modeling choice of the covariance function is usually more important as it encodes certain assumptions on the smoothness of . We choose the squared exponential covariance (3) for reasons of its interpretability and widespread use. However, depending on the application problem and the underlying physical process at hand, other choices might be more appropriate, see, e.g., [35, Section 4.2].

Besides specifying the mean and the covariance functions, constructing a GP model requires choosing a set of input parameter values for training: , . Together with the corresponding values of the forward model, we form the training set :

 D\coloneqq{\boldmathθ(j)train,f(\boldmathθ(j)train)}ntrainj=1.

Given the training set and the hyperparameters , the distribution of at a test input is given by

 p(fGP∣∣\boldmathθ,D,\boldmathψ)=N(fGP∣∣m(\boldmathθ;D,\boldmathψ),V(\boldmath% θ;D,\boldmathψ)), (4)

with the mean and the variance given by

 m(\boldmathθ;D,\boldmathψ) =cTψ(Cψ)−1y, (5a) V(\boldmathθ;D,\boldmathψ) =c(\boldmathθ,\boldmathθ;% \boldmathψ)−cTψ(Cψ)−1cψ. (5b)

Here, is the vector of covariances between the test input and the inputs in the training set given the hyperparameters , , and is the matrix of covariances between the inputs in the training set given the hyperparameters :

 (Cψ)i,j=c(\boldmathθ(i)train,%\boldmath$θ$(j)train;\boldmathψ), i,j=1,…,ntrain.

To train a GP model means to prescribe the hyperparameters using the training set . This is commonly done using the evidence framework [28]: hyperparameters are chosen by maximizing the logarithm of the marginal likelihood (or evidence) of the training values:

 \boldmathψ∗=argmaxψ∈BψlogL(\boldmathψ|D), (6)

where is a compact subset of , and

 L(\boldmathψ|D)=Nntrain(y|0ntrain,Cψ). (7)

In practice, the maximization of the log-marginal likelihood in (6) is performed by using a multi-start strategy to avoid getting trapped in local maxima. For a small number of training inputs, the slope of the log-marginal likelihood can be very low leading to multiple hyperparameter values being consistent with the training data. In such cases, choosing the hyperparameter values by maximizing the log-marginal likelihood can become unreliable and produce estimates with high empirical variances [10]. Furthermore, the predictive variance of the GP model (5b) with the plug-in estimator (6) is known to underestimate the true mean-squared prediction error of the model [43].

An alternative way of training a GP model is to adopt a fully Bayesian perspective. Instead of taking the point-estimate of the hyperparameters as in (6), we can condition the predictive distribution (4) on the hyperparameter distribution. It has been reported that the fully Bayesian approach gives wider confidence bounds than predictors based on plug-in estimators, thus, better accounting for the uncertainty about the covariance function [14]. In the following we adopt the fully Bayesian approach to the GP model training. That is, we specify a prior on the hyperparameters, , and use the likelihood function from (7) to obtain the hyperparameter posterior using MCMC methods:

As a result, we obtain an ensemble of samples of the hyperparameter vector distributed according to : .

The predictive distribution of the GP model at a test point can then be obtained by marginalizing over (integrating out) the hyperparameters:

 p(fGP∣∣\boldmathθ,D)=∫p(fGP∣∣\boldmathθ,D,\boldmathψ)p(\boldmathψ|D)d\boldmathψ,

where is given by (4). Using the samples of the hyperparameter posterior computed with MCMC, this integral can be discretized as follows:

 p(fGP∣∣\boldmathθ,D)≈1nψnψ∑j=1p(fGP∣∣% \boldmathθ,D,\boldmathψ(j)). (8)

Thus, we obtain a Gaussian mixture model of the predictive distribution. The mean of this model is simply the average of the means from (

5a) with ,

 (9)

and the variance can be obtained using with from (5b) as follows:

 V(% \boldmathθ;D)=1nψnψ∑j=1V(\boldmathθ;D,\boldmathψ(j))+1nψnψ∑j=1(m(% \boldmathθ;D,\boldmathψ(j)))2−(1nψnψ∑j=1m(\boldmathθ% ;D,\boldmathψ(j)))2. (10)

### 3.2 Multiple output case

Gaussian process models for the multi-output case are discussed in detail in [6], where the authors employ a matrix-normal distribution for the training data to account for possible correlations between the outputs. In our work, we take a simplified approach from [2] and [1] that assumes that the outputs are conditionally independent given the covariance function. This treatment of the multi-output case is similar to [6] if a diagonal correlation matrix and a constant mean are used, and is based on an assumption that the regularity of the outputs is approximately the same. In order to account for potentially different scales of the outputs, we normalize the training outputs as described below.

Let denote the -th output of the forward model, . Now, for the same set of training inputs , we have sets of output values which we will denote by :

 Di={\boldmathθ(j)train,fi(\boldmathθ(j)train)}ntrainj=1,i=1,…,q.

We will write . Assuming conditionally independent outputs, the marginal likelihood of the training outputs becomes

 L(\boldmathψ|D)=q∏i=1L(\boldmathψ|Di), (11)

with the one-dimensional likelihoods given by

 L(\boldmathψ|Di)=Nntrain(yi|0ntrain,Cψ).

Here, the training outputs represent scaled responses:

 yi=(ˆfi(\boldmathθ(1)train),…,ˆfi(\boldmathθ(ntrain)train))T (12)

with

 ˆfi(\boldmathθ)=fi(\boldmathθ)−miVi1/2,

where

 mi=1ntrainntrain∑j=1fi(\boldmathθ(j)train),Vi=1ntrainntrain∑j=1(fi(\boldmathθ(j)train)−mi)2.

Similarly to the single-output case, the likelihood (11) is used to obtain the samples of the hyperparameter posterior : . Under the standing assumption of conditional independence,

 p(fGP∣∣\boldmathθ,D,% \boldmathψ)=q∏i=1p(fGPi∣∣% \boldmathθ,Di,\boldmathψ),

which leads to the following predictive density of the combined output vector

 (13)

In a compact form, (13) can be written as a mixture of -variate Gaussians:

 p(fGP∣∣\boldmathθ,D)≈1nψnψ∑j=1Nq(fGP∣∣m(\boldmathθ;D,% \boldmathψ(j)),\boldmathΣGP(% \boldmathθ;D,\boldmathψ(j))), (14)

where for each , , (applying the necessary re-scaling)

 (15)

with as in (5a), and

 (16)

with as in (5b). The mean of the mixture distribution (14) is given by

 m(\boldmathθ;D)=1nψnψ∑j=1m(\boldmathθ;D,% \boldmathψ(j)), (17)

and the covariance is given by

 \boldmathΣGP(\boldmathθ;D)

### 3.3 Bayesian inference with GP models

Now, we re-formulate problem (2) using the multi-output GP surrogate with the predictive distribution given by (14). While the simplest approach would be to substitute in the likelihood definition in (1) with the mean vector in (17), such an approach would ignore the uncertainty of the surrogate. The availability of uncertainty estimates is the strength of the GP model and should therefore be exploited. Hence, we follow the approach in [1] and substitute in (1) with —the so-called -restricted likelihood function defined as follows:

 L(\boldmathθ|z,D)\coloneqq∫L(% \boldmathθ∣∣z,fGP)p(fGP∣∣\boldmathθ,D)dfGP, (18)

where is the likelihood from (1) evaluated with instead of :

 L(\boldmathθ∣∣z,fGP)\coloneqqNq(z−fGP(\boldmathθ)∣∣0q,\boldmathΣE). (19)

Next, we plug in the mixture approximation of from (14) and the likelihood from (19) into (18) and integrate the product of the two Gaussians:

 L(\boldmathθ|z,D)≈1nψnψ∑j=1∫Nq(fGP∣∣z,\boldmathΣE)Nq(fGP∣∣m(% \boldmathθ;D,\boldmathψ(j)),% \boldmathΣGP(\boldmathθ;D,% \boldmathψ(j)))dfGP=nψ∑j=1k(j)nψexp[−(z−m(\boldmathθ;D,\boldmathψ))T(\boldmathΣE+\boldmathΣGP(% \boldmathθ;D,\boldmathψ))−1(z−m(\boldmathθ;D,\boldmathψ))2] (20)

with and The obtained -restricted likelihood function incorporates both the measurement errors and the uncertainty of the GP model. Once the posterior samples of the hyperparameters are obtained, the approximation of in (20) at a given test input requires operations for each for computing the predictive means and covariances (15)–(16), and for inverting the sum of the noise and the GP covariances, bringing the total cost to operations. In the special case of uncorrelated measurement noise and with the assumption of conditional independence of the outputs that we make, the cost of inverting becomes instead of . The dominant cost then becomes that of computing the predictive means and the variances. In our target applications, this cost is negligible since the number of training inputs is small and the cost of the forward model evaluation is large. In the numerical examples, we consider training inputs. In general, the choice of is motivated by design considerations and may depend on the smoothness of the forward model mapping and on the dimension of the input space. For cases in which the number of training inputs is large, various approximations of the covariance matrix can be considered, see, for example [35, Chapter 8].

Analogously to (2), the -restricted likelihood leads to the -restricted posterior :

 p(\boldmathθ|z,D)∝L(\boldmathθ|z,D)p(\boldmathθ). (21)

Next, we analyze the approximation of in (20) and develop a sequential adaptive strategy for selecting training inputs based on the current data .

## 4 Adaptive construction of GP models

Given the current training set , it is of interest how to select additional training inputs in order to make the GP-based likelihood more accurately represent the unattainable (due to its cost) “true” likelihood . In particular, we would like to correctly capture the modes of the true likelihood . Therefore, we attempt to find the minima of the “true” misfit function

 g(\boldmathθ)\coloneqq(z−f(\boldmathθ))T\boldmathΣ−1E(z−f(% \boldmathθ)). (22)

We start by defining the misfit function of the -restricted likelihood (20):

 g(\boldmathθ;D,\boldmathψ)\coloneqq(z−m(\boldmathθ;D,\boldmathψ))T(\boldmathΣE+\boldmathΣGP(% \boldmathθ;D,\boldmathψ))−1(z−m(\boldmathθ;D,\boldmathψ)). (23)

With this definition we can re-write (20) as

 L(\boldmathθ|z,D)≈nψ∑j=1k(j)nψexp[−12g(\boldmathθ;D,\boldmathψ(j))].

The important properties of are summarized in the following proposition.

###### Proposition 4.1

The misfit function in (23) has the following properties:

1. [label=)]

2. it interpolates the true misfit function

at the inputs in the training set ;

3. it is continuously differentiable with respect to .

Proof:

1. [label=)]

2. Observe that for each output , and for due to the interpolative properties of the GP model (see [37, Section 4.1]). Then, from (15) and (16) we have and . Hence,

 g(\boldmathθ(j)train;D,\boldmathψ)=g(\boldmathθ(j)train),%forj=1,…,ntrain, and all \boldmathψ.
3. The predictive mean and variance of a Gaussian process inherit their smoothness properties from the underlying covariance function . The squared exponential function (3) considered here is in fact infinitely differentiable.

Outside of the training set , provides an estimate of the misfit between the GP model with the hyperparameter vector and the measurement data . By treating as a random function of for a given test input with distribution induced by , we can explore the minima of using an auxiliary “acquisition function”. Specifically, we employ the expected improvement idea from Bayesian optimization [17].

Denote the best (i.e., the smallest) misfit value for the points in the training set as

 (24)

Consider the following problem:

 maxθ∈BθI(% \boldmathθ)\coloneqq1nψnψ∑j=1[gmin−g(\boldmathθ;D,\boldmathψ(j))]+, (25)

where denotes the positive part function, , and is a closed and bounded subset of . The idea behind formulation (25) is to find the input that offers the largest expected improvement in the fit to the measurement data under the mixture GP model conditioned on the misfit being smaller than the current best true misfit value. More specifically, consider the following relative improvement in fit function:

 Irel(\boldmathθ;\boldmathψ)\coloneqq[1−g(\boldmathθ;D,\boldmathψ)gmin]+.

For fixed and , if , the improvement in fit is zero—the misfit between the GP model corresponding to and the measurement data at a given input is the same or larger than the current best misfit value. This is reflected in . The maximum achievable relative improvement value is , which corresponds to the case when the GP model with a fixed takes the exact value of the measurement data at a given . The fact that we have a mixture of GP models for a given induced by means that, for the same , can take a range of values from to . By taking the average of the relative improvement values at a given , we obtain an “expected improvement” under the current GP mixture model. This motivates the formulation in (25) (with the objective multiplied by ). By adding a maximizer of (25) to the training set , we strive to improve our GP model in a way that improves the likelihood approximation .

The strategy described above is a greedy one-step look-ahead strategy that is focused on finding the modes of the likelihood function . This strategy is designed to explore the parameter space globally just enough to make sure that the GP-based likelihood does not have modes in the regions where the true likelihood is flat, and to generate a sufficient number of training inputs locally around the regions of the modes of where the GP accuracy is needed the most. Thus, our strategy balances exploration and exploitation—the global search reduces the uncertainty in the GP model while the local search samples in regions where the data misfit is likely to be minimized. The pseudocode of the full algorithm is described in Algorithm 1.

The relative improvement function motivates a natural stopping criterion for our algorithm. Specifically, we terminate the algorithm if the maximum average relative improvement value is smaller than some threshold . In the numerical examples, we set this threshold to be . For the formulation (25) that we use for the solution, this means that we terminate when the objective value is less than or equal to .

### 4.1 Difference with the original expected improvement criterion

In the original paper [17] that popularized the expected improvement idea, this criterion was applied to the problem of finding the global minimum of a model function approximated by a Gaussian process. Since integration was performed with respect to a Gaussian variable, the expected improvement function could be derived in a closed form. It was then maximized using a branch-and-bound algorithm. The closed form expression from [17] does not apply in our case since the distribution of the misfit function at a given is not normal but determined by . In this sense, our expected improvement in fit criterion is different from the expected improvement in the Bayesian optimization literature where the hyperparameters are usually fixed prior to computing the expectation with respect to the random GP variable. A notable exception to that is approach taken in [39] where the original expected improvement criterion is marginalized over the hyperparameter distribution.

### 4.2 Analysis of the proposed algorithm

Problem (25) is well-defined: the misfit function is continuous, see part (b) of Proposition 4.1, and the maximization is performed over a compact set . Thus, there exists a solution to the problem (25). In case of multiple maximizers, we select any one of them; we will consider the simultaneous selection of multiple maximizers in the future.

Furthermore, observe that, as Algorithm 1 progresses, the value of either decreases or stays the same. The optimal value might not decrease with every iteration; however, as our numerical experiments in Section 5 demonstrate, with the addition of new training points, it does decrease and eventually falls below the threshold value. The following simple observation ensures that new information about the forward model is obtained in every iteration.

###### Proposition 4.2

At each iteration , if the new training input selected by Algorithm 1 is added to the training set , it is necessarily distinct from the other points in .

Proof: Suppose that at a -th iteration the maximizer of the problem (25) is already in the training set . Due to part (a) of Proposition 4.1 and (24), for we have for . Therefore, , which means that the stopping criterion in line 6 of Algorithm 1 has been satisfied. The algorithm terminates without adding to the training set .

If the hyperparameters of the GP covariance function were known and fixed, the predictive mean would be the best linear unbiased predictor and the mean squared prediction error, given by the predictive variance, would decrease with the addition of each new training point. As more data would be accumulated, under certain assumptions on the generating process, the mean of the GP model would converge to the model function and the variance would go to zero, see, for example, [40] for the case of stationary covariance functions with known hyperparameters.

In the case of hyperparameters estimated from the data as considered here, the predictive mean estimator is neither linear nor unbiased, therefore, it becomes difficult to make statements about the mean squared prediction error and the accuracy of its estimates.

[41] have analyzed the convergence of the standard expected improvement algorithm of [17] for the fixed Gaussian process prior. The results in this paper, while interesting from a theoretical point of view, are not applicable in practice since the prior on the GP model cannot be known in advance. Convergence rates of the expected improvement strategy for finding the global minimum of a function modeled by a Gaussian process with a fixed prior have been derived in [4]. The authors also extended their results for a case when parameters of the GP prior were estimated from the data by maximizing the marginal likelihood and provided an automatic choice of the parameters that retains the convergence rate of a fixed prior. In both papers, convergence results were stated in the norm of the Reproducing Kernel Hilbert Space associated with the chosen covariance function of a Gaussian process [35, Section 6.1]. As explained in Section 4.1, we apply the expected improvement to the misfit function , which depends on the mean and the variance of a Gaussian process but is not a Gaussian process itself. Thus, the above mentioned results cannot be directly applied in our case and further detailed analysis of the proposed algorithm is needed.

## 5 Numerical experiments

In this section, we demonstrate our method on a one-dimensional model problem with a single output and on a two-dimensional source inversion problem with multiple outputs. Implementation details and additional experiments on a higher-dimensional problem are reported in the supplementary materials A and B.

### 5.1 One-dimensional example

We start by testing the proposed algorithm on a univariate scalar function. This allows us to illustrate the steps of the algorithm and to provide intuition behind it. We use the forward model function

 f(θ)=θ2−5θ+6θ2+1,θ∈[−6,+6]. (26)

We generate measurement data by evaluating at and adding zero-mean Gaussian noise with . The function defined in (26) and the measurement are shown in Figure 1.

We use three equidistant input values and the corresponding function values as initial design . Next, we apply Algorithm 1 in the following way. For the hyperparameter estimation, we start with uniform priors on the covariance parameters and (see (1)): , . Using MCMC we obtain posterior samples (see A for details). Figure 1(a) shows the predictive means (see (5a)) corresponding to different hyperparameter vectors together with the confidence regions around them based on the predictive variances (5b). We observe that a small number of initial training inputs results in a broad hyperparameter posterior with a variety of corresponding GP models. The variances at the untested inputs appear to be quite large leading to wide confidence regions. This ensemble of GP models consistent with the initial training data gives a better idea of the uncertainty of the surrogate model than any single surface approximation would, for example, the one corresponding to the maximum likelihood estimate of the hyperparameters (6).

Using the ensemble of the GP models, we now formulate the optimization problem (25). The objective function (scaled by ) is presented in Figure 1(b). By comparing Figures 1(a) and 1(b), we observe that the expected improvement is largest in the regions where the majority of predictive means is close to the measurement value; it also grows with the distance to the points in the training set. This is the desired behavior: we want to explore the regions with highest uncertainty that are most likely to result in values at the level of the measurement .

We solve the problem (25) using gradient-based optimization as described in the supplement A with initial points taken equidistantly on the interval . The obtained maximizers are shown as blue circles in Figure 1(b); the larger red circle at corresponds to the largest maximum. We add this maximizer to the training set and proceed. We terminate the algorithm once the stopping criterion in line 6 of Algorithm 1 is satisfied with . At the final step (), which corresponds to total training inputs, the means of the predictive distributions corresponding to the ensemble of the hyperparameters look as shown in Figure 1(c). We observe that our adaptive algorithm has primarily added to the training set the inputs to the right of where the function (26) takes values closest to . With the large number of training inputs in this region, the GP models are indistinguishable from the true function and the variances are low. On the other hand, to the left of , the adaptive algorithm has added only two more inputs to the training set; as a result, there is still considerable uncertainty associated with the GP predictions in that region. However, most of the ensemble means in this region are far from the measurement value , and the uncertainty of the ensemble values is not sufficiently large to imply that the measurement data could have originated in this region and to warrant further exploration. This is reflected in the maximum value of the relative expected improvement being below the threshold (see Figure 1(d)).

The obtained mixture GP approximation is presented in Figure 2(a). Here, we plot the mean of the ensemble as in (9) and base the confidence region on (10). The inputs in the final training set are shown as red circles. With this final design, we evaluate and plot the approximate -restricted likelihood function (20) (see Figure 3(a)). In the same figure, we plot the “true” likelihood evaluated using the forward model function (26). We observe that the two are identical.

Finally, we contrast our adaptively constructed GP-based likelihood with a GP-based likelihood constructed using a naive fixed design. The mixture GP model constructed with equidistant training inputs is shown in Figure 2(b) and the -restricted likelihood based on this design is shown in Figure 3(b). We observe that the GP-based likelihood using the naive non-adaptive design is of considerably worse quality than the one that was built adaptively, even though the GP model corresponding to the naive design looks reasonably good and has narrow confidence regions.

### 5.2 Source inversion

Next, we consider the source inversion problem studied in [30] and later in [25]. The forward model is given by a diffusion equation in two dimensions:

 ∂u∂t−∇2u =s(x,t) x∈Ω\coloneqq[0,1]2 (27a) ∇u⋅n =0 x∈∂Ω (27b) u(x,0) =0 x∈Ω. (27c)

The source term is given by

 s(x,t)={a2πh2exp(−∥\boldmath% θ−x∥2/2h2),0≤t≤τ,0,t>τ.

The following parameters have fixed values: , , . The location of the source center is denoted by and is the (two-dimensional) parameter of interest. We solve (27) in FEniCS [27] using a uniform finite element mesh with piecewise-linear Lagrange elements, and backward Euler time discretization with a time step of .

The measurements are taken at times and on a uniform grid covering resulting in a total of measurements. The measurement noise is assumed to be a vector of independent zero-mean Gaussian random variables. Thus, the model is:

 z=f(\boldmathθ)+e,ei∼N(0,σ2i),i=1,…,18.

We fix for all . The measurement data is generated by solving the forward model with and adding noise. To avoid the obvious “inverse crime” [18, Section 1.2], the measurement data is generated with a finer grid and a smaller time step of .

We start with the initial design consisting of inputs arranged in a Latin hypercube design (see the blue squares in Figure 4(a)). The priors for the hyperparameters are taken as follows: , , . The hyperparameter posterior is obtained with MCMC using the likelihood function (11) with normalized outputs (12).

We run Algorithm 1 with , , and . To solve (25), we initialize the optimization algorithm with points from a two-dimensional Sobol sequence. If the resulting maximum expected improvement is less than the threshold, we perform another search initialized at an additional Sobol points.

Figure 6 shows the iteration history of Algorithm 1. The red solid line shows the values of over iterations, where is at iteration , and is the minimum of that we find by exhaustively searching in the region around and use here only as a reference value (this value is, of course, unknown in practice). The blue dashed line shows , i.e., one minus the relative expected improvement (recall, that corresponds to the maximizer of problem (25) at iteration ). As the algorithm progresses, we expect both lines to approach .

Comparing Figure 6 with the order in which the inputs were added to the training set (see the numbers in Figure 4(a)), we can make a few observations. At the initial stages, , the inputs that maximize the expected improvement are located in the interior of and around . The relative expected improvement value is high—over —and the value remains unchanged. Upon addition of input , the value of drops, and the algorithm starts adding inputs corresponding to high variance—inputs and . It is expected that these inputs lie on the boundary where the uncertainty is highest. At this time, the relative expected improvement steadily decreases. Finally, adding input leads to further reduction of . This time, the maximum relative expected improvement drops below the threshold value and the algorithm terminates. The final design in Figure 4(a) contains training inputs, and so does the Latin hypercube design in Figure 4(b) that we use for comparison below.