Convergence of Gaussian Process Regression with Estimated Hyper-parameters and Applications in Bayesian Inverse Problems

08/31/2019 ∙ by Aretha L. Teckentrup, et al. ∙ 0

This work is concerned with the convergence of Gaussian process regression. A particular focus is on hierarchical Gaussian process regression, where hyper-parameters appearing in the mean and covariance structure of the Gaussian process emulator are a-priori unknown, and are learnt from the data, along with the posterior mean and covariance. We work in the framework of empirical Bayes, where a point estimate of the hyper-parameters is computed, using the data, and then used within the standard Gaussian process prior to posterior update. We provide a convergence analysis that (i) holds for any continuous function f to be emulated; and (ii) shows that convergence of Gaussian process regression is unaffected by the additional learning of hyper-parameters from data, and is guaranteed in a wide range of scenarios. As the primary motivation for the work is the use of Gaussian process regression to approximate the data likelihood in Bayesian inverse problems, we provide a bound on the error introduced in the Bayesian posterior distribution in this context.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mathematical modelling and simulation are indispensable tools frequently used to inform decisions and assess risk. In practice, the parameters appearing in the models are often unknown, and have to be inferred from indirect observations. This leads to an inverse problem, where one infers the parameters of the model given incomplete, noisy observations of the model outputs. Adopting a Bayesian approach [16, 38]

, we incorporate our prior knowledge of the parameters into a probability distribution, referred to as the

prior distribution, and obtain a more accurate representation of the parameters in the posterior distribution, which results from conditioning the prior distribution on the observations.

The goal of simulations is typically to (i) sample from the posterior distribution, using methods such as Markov chain Monte Carlo (MCMC), and/or (ii) compute a point estimate of the parameters, such as the most likely value under the posterior distribution (known as the maximum a-posteriori (MAP) estimate). Both of these tasks quickly become computationally infeasible when the mathematical model involved is complex. In many applications, for example when the forward model is given by a partial differential equation, computing one instance of the forward model is computationally very expensive, and the shear number of model evaluations required for the sampling and/or optimisation is prohibitively large.

This drawback of fully Bayesian inference for complex models was recognised several decades ago in the statistics literature, and resulted in key papers which had a profound influence on methodology [32, 17, 26]. These papers advocated the use of a Gaussian process surrogate model (also called emulator) to approximate the solution of the governing equations, and in particular the data likelihood, at a much lower computational cost.

The focus of this work is on the convergence analysis of Gaussian process surrogate models, in the case where the hyper-parameters in the distribution of the Gaussian process are a-priori unknown and inferred as part of the construction of the surrogate. This situation is of significant importance and interest, for, amongst others, the following reasons. Firstly, by correctly tuning the hyper-parameters, we will obtain a Gaussian process surrogate model that mimics closely the behaviour of the function we are approximating, resulting in a smaller error in the approximation. Secondly, the variance of the Gaussian process surrogate model is often used to represent the error in the approximation. However, for this interpretation to make sense, the hyper-parameters have to be chosen correctly. For example, the variance of the Gaussian process surrogate model can artificially be driven to zero by letting the marginal variance of the covariance kernel go to zero, but the error does not vanish in reality.

We adopt an empirical Bayes approach, also known as a plug-in approach, where we compute an estimate of the hyper-parameters, and plug this into the predictive equations for a Gaussian process surrogate model with known hyper-parameters. We present a convergence analysis of these hierarchical Gaussian process surrogate models, which shows that convergence of the mean and variance of the Gaussian process emulator is guaranteed under very mild assumptions on the estimated hyper-parameters. In particular, the convergence rates of the hierarchical Gaussian process emulator are the same as the convergence rates obtained for Gaussian process emulators with fixed, known values of the hyper-parameters, if the estimated hyper-parameters converge to the known values.

As particular examples of covariance kernels used to construct the emulators, we consider the family of Matèrn kernels and the family of separable (or multiplicative/tensor-product) Matèrn kernels. As we will see in section 4, the type of covariance kernel one should employ depends on the structure and smoothness of the function being emulated. The use of Matèrn kernels corresponds to assuming a certain Sobolev smoothness, whereas the use of separable Matèrn kernels assumes a tensor-product Sobolev structure. For more details, see section

3.

The question of how the estimation of hyper-parameters influences the error in Gaussian process emulators is not new, and has been dealt with in the spatial statistics literature [35, 36, 28, 10, 33, 40, 6]. A summary of relevant results is given in section 2. As evident from this summary, these results are of a different nature to our new results presented in section 3, and to the type of error bounds needed to justify the use of Gaussian process emulators in Bayesian inverse problems [37]. In particular, our results (i) are non-asymptotic results, (ii) give bounds for a fixed, deterministic function being emulated, rather than averaging over a certain distribution of functions, and (iii) do not require any notion of equivalence or contiguity of the Gaussian process distributions with the true and the estimated hyper-parameters.

A further distinction to previous studies, is that we do not require a notion of ”true” values of the hyper-parameters. The customary (and often necessary) definition in spatial statistics (cf [35, 36, 28, 10]

) is to choose the true parameter values such that the function being emulated is a sample of the corresponding Gaussian process. In our analysis, we do not require any such assumption on the function being emulated. True parameter values in our context would simply represent a good choice of hyper-parameters, and can be defined in any way that the user finds suitable (including the customary definition above). Likewise, the estimated hyper-parameters can be defined in many suitable ways, e.g through maximum likelihood or maximum a-posteriori estimation (cf

[14]) or cross-validation (cf [41]). Our results are independent of how the hyper-parameters are estimated, and also do not rely on the sequence of estimated hyper-parameters being convergent.

1.1 Our Contributions

In this paper, we make the following contributions to the analysis of Gaussian process regression:

  1. We provide a convergence analysis of Gaussian process regression with estimated hyper-parameters, which shows convergence of the emulators to the true function as the number of design points tends to infinity,

  2. We justify the use of hierarchical Gaussian process emulators to approximate the forward model in Bayesian inverse problems, by bounding the error introduced in the posterior distribution. Previous results, well known in the spatial statistics literature and summarised in section 2, are not sufficient for this purpose.

1.2 Paper Structure

The paper is organised as follows. Section 2 introduces hierarchical Gaussian process regression, and summarises relevant results from the spatial statistics literature. Section 3 analyses the error in hierarchical Gaussian process regression in a wide range of scenarios. We set up the Bayesian inverse problem of interest in section 4, whereas Section 5 then considers the use of hierarchical Gaussian process emulators to approximate the posterior distribution in the Bayesian inverse problem. Section 6 provides a summary and discussion of the main results.

2 Hierarchical Gaussian Process Regression

We want to use Gaussian process regression (also known as Gaussian process emulation or kriging) to derive a computationally cheaper approximation to a given function , where . Our focus here is on the case where the hyper-parameters defining the Gaussian process emulator are unknown a-priori, and are inferred as part of the construction of the emulator. We will denote these hyper-parameters by , and treat them using an empirical Bayes approach.

2.1 Set-up

Let now be an arbitrary function. To derive the Gaussian process emulator of , we again use a Bayesian procedure and assign a Gaussian process prior distribution to :

(2.1)

To avoid confusion between the true function and its prior distribution, we have added the subscript zero in the above prior. Here, are now hyper-parameters defining the mean function and the two-point covariance function , assumed to be positive-definite for all , for any compact subset . Particular examples of covariance kernels are the Matèrn and separable Matèrn families discussed in sections 2.2 and 2.3. For the mean function , we can for example use polynomials, in which case the hyper-parameters are typically the unknown polynomial coefficients. We will write when we want to explicitly distinguish between the hyper-parameters appearing in the mean and covariance function, respectively.

We further put a prior distribution on , with Lebesgue density . The joint prior distribution on is then given by

Then, given data in the form of a set of distinct design points , together with corresponding function values

we condition the prior distribution on the observed data to obtain the posterior distribution

The distribution is again a Gaussian process, with explicitly known mean function and covariance kernel :

(2.2)
(2.3)

where , is the matrix with entry equal to and . These are the well-known formulae for Gaussian process emulation [29], here adopting notation that will enable us to make use of the analysis of such emulators in [37]. When we wish to make explicit the dependence on the prior mean , we will denote the predictive mean in (2.2) by .

The marginal distribution

is typically not available in closed form, since the integrals involved are intractable. In practice one therefore often uses a plug-in approach, also known as empirical Bayes. This consists of calculating an estimate of using the data , and then approximating

This corresponds to approximating the distribution by a Dirac measure at . For the remainder of this work, we will use

(2.4)

as a Gaussian process emulator of . The process in (2.4) is also referred to as the predictive process, and we shall refer to and as the predictive mean and the predictive variance, respectively.

The remainder of this section is devoted to a literature review on issues related to computing a good estimate of the hyper-parameters from the data , and analysing the effect that approximating has on the outcome of the regression. For this purpose, the notion of identifiability plays an important role.

Definition 2.1.

[18] Two probability measures and , defined on the same probability space , are called equivalent if they are mutually absolutely continuous, i.e. if for any event we have if and only if . The measures and are called orthogonal if there exists an event such that and .

It can be shown that two Gaussian measures are either equivalent or orthogonal [13, 15].

Definition 2.2.

[44] A random field model on is called identifiable if different values of give rise to orthogonal Gaussian measures.

Following [44], the random field model is identifiable if it is theoretically possible to learn the true value of after obtaining an infinite number of observations of on . A relevant result in this context is the Cameron-Martin Theorem.

Proposition 2.3.

(Cameron-Martin Theorem, [5]) Let , for , be two probability measures such that under , is a Gaussian process on with mean and covariance function . Then and are equivalent if and only if is an element of the reproducing kernel Hilbert space associated to .

In particular, Proposition 2.3 implies that models with polynomial mean functions , where the parameters represent the coefficients or the degree of the polynomial, are in most cases not identifiable, since polynomials are typically contained in the reproducing kernel Hilbert space associated to . In particular, this is the case for the Matèrn and separable Matèrn kernels presented below.

2.2 Matèrn Covariance Kernels

Covariance functions frequently used in applications are the Matèrn covariance functions

(2.5)

with hyper-parameters . Here, denotes the Gamma function, and denotes the modified Bessel function of the second kind [22]. The parameter is usually referred to as the (marginal) variance, as the correlation length and as the smoothness parameter. The expression for the Matèrn covariance kernel simplifies for particular choices of . Notable examples include the exponential covariance kernel with , and the Gaussian covariance kernel in the limit .

We have the following results on the identifiability of the Matèrn model [45, 44, 2].

Proposition 2.4.

Let , for , be two probability measures such that under , is a Gaussian process on with mean 0 and Matèrn covariance function with variance , correlation length and smoothness parameter .

  • If , then and are orthogonal.

  • If and , then is equivalent to if and only if .

  • If and , then and are orthogonal if or .

Proof.

Part was proven in [44], and part in [2]. Part follows from the fact , for all and , for all . ∎

It follows from Proposition 2.4 that the hyper-parameters are not identifiable in the setting of , and in general, ML and MAP estimators of the hyper-parameters do not contract around the true parameter values as [44]. Only the quantity is identifiable in this setting. To alleviate this problem, the recent paper [14] discusses good choices for the prior distribution on the hyper-parameters , such that the MAP estimate gives a good estimate of the true value of .

We would also like to point out here that the identifiability issues in Proposition 2.4 are related to the fact that our parameter space is bounded, which means that we are dealing with what is usually referred to as in-fill asymptotics. If were unbounded, we would be dealing with increasing domain asymptotics, where all parameters are identifiable also in the setting of [45].

2.3 Separable Matèrn Covariance Kernels

As an alternative to the classical Matèrn covariance functions in the previous section, one can consider using their separable versions (also called multiplicative or tensor-product versions). These are obtained by taking the product of one-dimensional Matèrn covariance functions:

(2.6)

Since the marginal variances only enter as multiplicative pre-factors, the hyper-parameters in this case are and , leading to . A particular example is the separable exponential covariance kernel, which corresponds to and hence takes the form

(2.7)

The separable versions of Matèrn kernels can have better properties than the classical Matèrn kernels in terms of identifiability, as illustrated by the following result.

Proposition 2.5.

[43, Theorem 1] Provided , the Gaussian process model
on is identifiable.

The case of general separable Matèrn covariance kernels appears to be open, but related results in this direction can be found in [20, 21, 9]. Note that for , the classical and separable Matèrn kernels coincide and Proposition 2.4 applies. In particular, for the stationary Ornstein-Uhlenbeck process given by and , only the quantity is identifiable on the bounded domain .

2.4 Optimal Prediction Error

In this section, we review some results on how the estimate influences the prediction error of . To this end, we assume that there exists a true parameter value such that , and we view the predictive mean as an estimator of . The prediction error of is defined as its mean square error

where the expectation is taken with respect to the distribution .

Classically, the question of how the misspecification of influences the prediction error has been addressed in the setting where is approximated by , with fixed and independent of (see e.g. [35, 36]).

Proposition 2.6.

[35, Theorem 1] Suppose as . Let be a point in , and suppose the design points form a sequence in not containing but having as a limit point (possibly along a sub-sequence). Denote by the measure corresponding to , and by the measure corresponding to . If the measures and are equivalent, then

The dependency of the estimate on the evaluations complicates the analysis. For example, the predictive mean is no longer a linear function of . The following result was proven in [28].111We are here considering a zero mean in the Gaussian process prior (2.1). The result proved in [28] is proved for universal kriging, where one assumes an unknown constant mean in (2.1). The predictive equations (2.2) - (2.3) are different in the two settings. However, since the proof of [28, Theorem 2.1]

only relies on the best unbiased predictor property of the Gaussian process emulator, as well as properties of Gaussian random variables, the same proof technique can be applied in the zero mean (or

simple kriging) set-up considered here.

Definition 2.7.

Sequences of probability measures and , defined on the same sequence of measurable spaces , are mutually contiguous if for any sequence , we have as if and only if as .

Proposition 2.8.

[28, Theorem 2.1] Suppose as . Let be a point in , and suppose the design points form a sequence in not containing but having as a limit point (possibly along a sub-sequence). For , denote by the distribution of under , and by the distribution of under . If the sequences and are mutually contiguous, then

Verifiable conditions for checking the mutual contiguity of the measures and in Proposition 2.8 are discussed in [28], although the examples for which these conditions have been verified remains limited. To the authors’ knowledge, the question of mutual contiguity remains open for general parametric families of covariance functions such as the Matèrn family.

3 Error Analysis of Hierarchical Gaussian Process Regression

In this section, we are concerned with the convergence of the hierarchical Gaussian process emulator to the function . Although the main idea behind the error estimates in this section is related to section 2.4, we are here interested in error bounds which (i) are non-asymptotic, (ii) bound the error for a fixed function , rather than averaging over a distribution on , and (iii) are flexible with respect to the definition of the estimated and the true hyper-parameters, so do not require any notion of equivalence or contiguity of the corresponding distributions. Furthermore, the error analysis here will be performed in norms amenable to the use of the hierarchical Gaussian process emulators as surrogate models in Bayesian inverse problems, see section 5 for details.

Since the error analysis depends on various properties of the covariance kernel, such as the corresponding reproducing kernel Hilbert space (also known as the native space or Cameron-Martin space), we will consider two particular examples, namely the classical and the separable Matèrn covariance kernels already considered in sections 2.2 and 2.3.

The definition of the estimated parameter values is open, and our analysis does not require any assumptions on how these estimates are computed. We do not require that the sequence converges, neither do we require the parameters to be identifiable. We could for example use maximum likelihood or maximum a-posteriori estimators, choose to minimise the error , or use a combination of different approaches for different hyper-parameters. Note, however, that we do not want to minimise the predictive variance , since this can be made arbitrarily small by letting . We want to choose such that is a good representation of our remaining uncertainty about the function , after observing the .

We would like to quantify the performance of the estimated mean and covariance functions and . In particular, in light of the error bounds required for the Bayesian posterior distribution in section 5, we are interested in the quantities and . We recall the following fundamental results, which hold for any kernel .

Proposition 3.1.

[42] The function , given by (2.2) with

, is the minimal norm interpolant of

in the native space corresponding to , with interpolation points :

In particular, .

Proposition 3.2.

[33, 37] Suppose is given by (2.3). Then

where is given by (2.2) with .

Given the set of design points , we define the fill distance , separation radius and mesh ratio by

(3.1)

The fill distance (also known as the maximin distance) is the maximum distance any point in can be from , and the separation radius is half the smallest distance between any two distinct points in . The three quantities above provide measures of how uniformly the design points are distributed in .

The fill distance and the separation radius are decreasing functions of , and these quantities will tend to zero as tends to infinity fo space-filling designs. The best possible rate of convergence for any choice of and is (see e.g. [30]). The mesh ratio , on the other hand, is an non-decreasing function of . Point sets for which can be bounded uniformly in , i.e. sets for which the fill distance and the separation radius decrease at the same rate with , are called quasi-uniform. In general, however, the mesh ratio can be strictly increasing in .

3.1 Matèrn Covariance Kernels

Suppose we use a Matèrn covariance kernel , defined in (2.5), to construct the hierarchical Gaussian process emulator , defined in (2.4).

3.1.1 Predictive Mean

We first consider the predictive mean . The main result is given in Theorem 3.4. To prove explicit error bounds, recall the following characterisation of the native space (also known as reproducing kernel Hilbert space) of the Matèrn kernel.

Proposition 3.3.

[42] Let , with , for some compact set . Then the native space is equal to the Sobolev space

as a vector space, and the native space norm and the Sobolev norm are equivalent.

There hence exist constants and such that for all , there holds

We then have the following result on the convergence of to as . In particular, it shows that we obtain convergence in a wide range of scenarios, under very mild assumptions on the estimated hyper-parameters. If the estimated hyper-parameters converge, we obtain the same convergence rate as in the case where all the hyper-parameters are fixed at the limiting value, cf [37, Proposition 3.4]. Note that Theorem 3.4 trivially also applies to the special case , where a fixed value of the hyper-parameter is used.

Theorem 3.4.

(Convergence in of ) Suppose we have a sequence of estimates , for some compact set . Assume

  • is a bounded Lipschitz domain that satisfies an interior cone condition,

  • the native space is isomorphic to the Sobolev space ,

  • , for some , with , and ,

  • for all ,

  • for some , the quantities and satisfy , with , and .

Then there exists a constant , which is independent of and , such that

for any and .

Proof.

First, we note that it follows from (2.2) that . By [23, Theorems 4.1 and 4.2], we have, for fixed ,

for some constant independent of , and . An inspection of the proof shows that the constant is such that . Using Proposition 3.1, an appropriate choice for is hence , which is a locally bounded function of . The claim then follows by the compactness of . ∎

Assumption (a) in Theorem 3.4 is an assumption on the domain being sufficiently regular, containing no sharp corners or cusps, and is satisfied, for example, for the unit cube . Assumption (b) reiterates that the native space of Matèrn kernels is a Sobolev space. Theorem 3.4 applies in fact not just to Matèrn kernels, but to any kernel which has a Sobolev space as native space, including the compactly supported Wendland functions [42].

Assumption (c) is an assumption on the regularity of the function being emulated. We point out here that this assumption is rather mild, and modulo some technicalities (cf Remark 3.6), this assumption simply means that should be at least continuous. We also point out here that Theorem 3.4 does not require the function to be in the native space of any of the kernels . The smoothness of , denoted by , can be both greater or smaller than the estimated smoothness . The best possible convergence rates are obtained when the estimated smoothness matches the true smoothness of (cf Remark 3.5). Recall that is decreasing in , whereas is either constant or increasing in . If we are underestimating the smoothness of , then , and we do not achieve the best possible exponent in . If we are overestimating the smoothness of , then and we obtain a positive power of .

Assumption (d) ensures that the chosen mean has at least the same regularity as . This can be relaxed, but less regularity in would lead to lower convergence rates in the error, so in practice, one should ensure that is sufficiently smooth.

The quantities and in assumption (e) can be thought of as and , respectively. If exists, then this can be substituted for both quantities. Assumption (e) is the only assumption we make on the estimated hyper-parameters, other than that , for some compact set . In particular, this means that the only assumptions required on and are that they are bounded away from zero and infinity. For the estimated smoothness , we again essentially require , however, due some technical issues in the proof (cf Remark 3.6), we require a slightly larger lower bound on .

Remark 3.5.

(Choice of ) Under the assumptions of Theorem 3.4, we have

for any . The best possible convergence rate in of this bound is obtained when , i.e. when is in the reproducing kernel Hilbert space corresponding to : . This is different to defining such that is a sample of the Gaussian process , since samples of a Gaussian process are almost surely not in the corresponding reproducing kernel Hilbert space. This point has also already been noted in [33].

Remark 3.6.

(Valid choice of ) The Sobolev space is a reproducing kernel Hilbert space for any . The restriction on in assumption (c) of Theorem 3.4 is hence slightly stronger than expected, requiring that the integer part of is greater than . This is due to a technical detail in the proofs of [23, Theorems 4.1 and 4.2], and could be fixable. The same comment applies to assumption (e) in Theorem 3.4, and assumptions (d) and (f) in Theorem 3.9.

3.1.2 Predictive Variance

Next, we investigate the predictive variance . An application of Proposition 3.2 gives the following result on the convergence of to , which in particular shows that we obtain the same rate of convergence as when the hyper-parameters are fixed, cf [37].

Theorem 3.7.

(Convergence in of ) Let the assumptions of Theorem 3.4 hold. Then there exists a constant , independent of , such that

for any and .

Proof.

An application of Proposition 3.2, gives

The Sobolev embedding theorem (see e.g. [1]), together with Theorem 3.4, then gives

Finally, using Proposition 3.3 and the compactness of proves the claim. ∎

3.2 Separable Matèrn Covariance Kernels

Rather than the Matèrn kernels employed in the previous section, suppose now that we use a separable Matèrn covariance kernel , as defined in (2.6), to define the Gaussian process emulator (2.4). Due to the tensor product structure of the kernel , we will assume that our parameter domain also has a tensor product structure , with bounded.

3.2.1 Predictive Mean

We again start with the predictive mean . The main result in this section is Theorem 3.9. We have the following equivalent of Proposition 3.3, characterising the native space of separable Matèrn covariance kernels on the tensor-product domain .

Proposition 3.8.

[24, 31] Suppose , with bounded. Let be a separable Matèrn covariance kernel with , for some compact set . Then the native space is equal to the tensor product Sobolev space as a vector space, and the native space norm and the Sobolev norm are equivalent.

There hence exist constants and such that for all , there holds

We will write if for all .

For our further analysis, we now want to make use of the convergence results from [24], related results are also found in [31] and the references in [24]. For the design points , we will use Smolyak sparse grids [4]. For , we choose a sequence , , of nested sets of points in . We then define the sparse grid as the set of points

(3.2)

where for a multi-index