# Semivariogram Hyper-Parameter Estimation for Whittle-Matérn Priors in Bayesian Inverse Problems

We present a detailed mathematical description of the connection between Gaussian processes with covariance operators defined by the Matérn covariance function and Gaussian processes with precision (inverse-covariance) operators defined by the Green's functions of a class of elliptic stochastic partial differential equations (SPDEs). We will show that there is an equivalence between these two Gaussian processes when the domain is infinite -- for us, R or R^2 -- which breaks down when the domain is finite due to the effect of boundary conditions on Green's functions of PDEs. We show how this connection can be re-established using extended domains. We then introduce the semivariogram method for obtaining point estimates of the Matérn covariance hyper-parameters, which specifies the Gaussian prior needed for stabilizing the inverse problem. We implement the method on one- and two-dimensional image deblurring test cases to show that it works on practical examples. Finally, we define a Bayesian hierarchical model, assuming hyper-priors on the precision and Matérn hyper-parameters, and then sample from the resulting posterior density function using Markov chain Monte Carlo (MCMC), which yields distributional approximations for the hyper-parameters.

There are no comments yet.

## Authors

• 2 publications
• 1 publication
• 15 publications
11/02/2018

### Efficient Marginalization-based MCMC Methods for Hierarchical Bayesian Inverse Problems

Hierarchical models in Bayesian inverse problems are characterized by an...
10/22/2019

### Objective Bayesian Analysis of a Cokriging Model for Hierarchical Multifidelity Codes

Autoregressive cokriging models have been widely used to emulate multipl...
04/23/2020

### Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

Gaussian processes are powerful non-parametric probabilistic models for ...
09/11/2017

### A determinant-free method to simulate the parameters of large Gaussian fields

We propose a determinant-free approach for simulation-based Bayesian inf...
04/13/2018

### Infinite dimensional adaptive MCMC for Gaussian processes

Latent Gaussian processes are widely applied in many fields like, statis...
08/18/2019

### Hierarchical Bayesian Operational Modal Analysis: Theory and Computations

This paper presents a hierarchical Bayesian modeling framework for the u...
07/03/2021

### Scale Mixtures of Neural Network Gaussian Processes

Recent works have revealed that infinitely-wide feed-forward or recurren...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Inverse problems are ubiquitous in science and engineering. They are characterized by 1) the estimation of parameters in a mathematic model from measurements of model output, and 2) a high-dimensional parameter space, typically resulting from the discretization of a function defined on a computational domain. For typical inverse problems, the problem of estimating model parameters from measurements is ill-posed, which motivates the use of regularization in the deterministic setting and the choice of a prior probability density in the Bayesian setting. In this paper, we consider linear models of the form

 b=Ax+ϵ,ϵ∼N(0,λ−1IM), (1)

where

is the vector of measurements,

is the forward model matrix, is the vector of unknown parameters, and means that is an -dimensional Gaussian random vector with mean and covariance matrix , with denoting the identity. The random vector in (1

 p(b|x,λ)∝λM/2exp(−λ2∥Ax−b∥2), (2)

where denotes proportionality, and we include from the normalization constant because it will be needed in the hierarchical model described below. The maximizer of with respect to is known as the maximum likelihood estimator, and we denote it by . As stated above, due to ill-posedness, is unstable with respect to errors in , i.e., small changes in result in large relative changes in .

There are various methods to stabilize the solution of inverse problems, but they all involve some form of regularization. In this paper, we take the Bayesian approach [1], which requires the definition of a prior probability density function on . We make the assumption that the prior is Gaussian of the form , which has probability density function

 p(x|δ)∝det(δP)1/2exp(−δ2xTPx), (3)

where is the precision (inverse-covariance) matrix, and we include from the normalization constant for the hierarchical model described below.

Now that we have defined the prior (3) and the likelihood (2), using Bayes’ law, we multiply them together to obtain the posterior density function

 p(x|b,λ,δ) ∝p(b|x,λ)p(x|δ) ∝λM/2det(δP)1/2exp(−λ2∥Ax−b∥2−δ2xTPx), (4)

whose maximizer, , is known as the maximum a posteriori (MAP) estimator. The MAP estimator can be equivalently expressed

 xλ,δ

### 1.1 The Matérn Class of Covariance Matrices and Whittle-Matérn Priors

It remains to define the prior covariance matrix . The Matérn class of covariance matrices has garnered much praise [2] for its flexibility in capturing many covariance structures and its allowance of direct control of the degree of correlation in the vector [3]. The Matérn covariance matrix is defined by the Matérn covariance function, which was first formulated by Matérn in 1947 [4],

 C(r)=σ2(κr)νKν(κr)2ν−1Γ(ν), (5)

where is the separation distance, is the modified Bessel function of the second kind of order [5], is the gamma function, is the scale parameter, is the smoothness parameter, and

is the marginal variance. Omitting

gives the Matérn correlation function. In the isotropic case, when the covariance depends only on the distance between elements, given the covariance parameters and , one can obtain the covariance matrix of a vector with spatial positions by letting

 [C]ij=Cov(xi,xj)=C(∥ui−uj∥),

where is defined by (5). is then obtained by inverting .

The parameters of the Matérn covariance function are not as straightforward to interpret as the parameters of some other covariance functions. When is small (), the spatial process is said to be rough, and when it is large (), the process is smooth [3, 6]. Figure 1 shows how the covariance function behaves with different values of and . On the left, and varies, while on the right and varies. Note that as increases, the behavior at small lags changes, leading to more correlation at smaller distances and a larger practical range, which is defined to be the distance at which the correlation is equal to 0.05. Meanwhile, as increases, or decreases, the decay rate of the covariance increases considerably. It is typical to think of as a range parameter, since as increases (decreases), the practical range increases (decreases). However, for the Matérn covariance, the parameter also affects the practical range. In [7], a range approximation is used where .

Despite the benefits of using the Matérn class of covariance matrices, its use is problematic for inverse problems because computing the precision matrix , which is what appears in the posterior (4), requires inverting a dense matrix. Fortunately, the Matérn covariance function has a direct connection to a class of elliptic SPDEs [7] whose numerical discretization yields sparse precision matrices, , that are computationally feasible to work with even when is large. Connections of this type were first shown to exist by Whittle in [8], where he showed the connection held for a special case of the Matérn covariance class. Hence, priors that depend on this connection are often referred to as Whittle-Matérn priors. The connection between the general Matérn covariance function and SPDEs has been used in a wide range of applications for defining computationally feasible priors for high-dimensional problems [9, 10, 11]. Moreover, work has been done in establishing convergence theorems for, and lattice approximations of, these Whittle-Matérn priors [12].

The remainder of the paper is organized as follows. In Section 2, we describe in detail the connection between zero-mean Gaussian processes with the Matérn covariance operator and those that arise as solutions of a class of elliptic SPDEs. In Section 3, we show how to estimate the hyper-parameters in the Whittle-Matérn prior using the semivariogram method, and then we show how to use this approach to define the prior when solving a Bayesian inverse problem. In Section 4, we instead assume hyper-priors on the hyper-parameters and then use MCMC to sample from the resulting posterior distribution, which yields a distribution over – rather than point estimates of – the hyper-parameters. For both approaches, we present numerical tests on one- and two-dimensional image deblurring test cases. We end with conclusions in Section 5.

## 2 Whittle-Matérn Class Priors via SPDEs

In this section, we will show that the Whittle-Matérn class of priors can be specified as the solution of the SPDE

 (κ2−Δ)β/2x(u)=W(u),u∈Rd,β=ν+d/2,κ,ν>0, (6)

where is the Laplacian operator in one or two dimensions (i.e., ), and

is spatial Gaussian white noise, which we define below. Although this connection has been shown to exist

[8, 7, 10], here we provide a significantly more detailed derivation of this result than we have seen elsewhere.

### 2.1 Preliminary Definitions

Before deriving the solution of (6), we need some preliminary definitions.

#### 2.1.1 Gaussian Fields

A stochastic process , with , is a Gaussian field [13] if for any and any locations ,

is a normally distributed random vector with mean

, where denotes expected value, and covariance matrix , for . The covariance function is defined . It is necessary that the covariance function is positive definite, i.e., for any , with , the covariance matrix defined above is positive definite. The Gaussian field is called stationary if the mean is constant and the covariance function satisfies and isotropic if .

#### 2.1.2 White Noise

The term white noise [14, 15] comes from light. White light is a homogeneous mix of wavelengths, as opposed to colored light, which is a heterogeneous mix of wavelengths. In a similar way, white noise contains a homogeneous mix of all the different basis functions. The mixing of these basis functions is determined by a random process. When this random process is Gaussian, we have Gaussian white noise. Consider a domain and let be an orthonormal basis of where . Then Gaussian white noise is defined by

 W(u)=∞∑j=1ξjϕj(u),ξjiid∼N(0,η2).

If we are dealing with spatial Gaussian white noise, then refers to location. With this definition, it is clear that Gaussian white noise has mean zero: Moreover, one can show that where is the Dirac delta function [16]. A well-known and very important property of the Dirac delta function is that it satisfies the sifting property:

#### 2.1.3 Green’s Functions

We now consider differential equations of the form , , where is a linear, differential operator. A Green’s function, , of is any solution of [17], where is the Dirac delta function. The Green’s function can be used to solve ; specifically, since

 Lx(u)=f(u)=∫Rdδ(u−v)f(v)dv=∫RdLg(u,v)f(v)dv=L∫Rdg(u,v)f(v)dv,

where the last equality holds because is linear and acts only upon the first argument, , of [18]. The solution of is therefore given by

 x(u)=∫Rdg(u,v)f(v)dv. (7)

### 2.2 The Gaussian Field Solution of the SPDE (6)

First, we derive the Green’s function for (6), which is the solution of

 (κ2−Δ)β/2g(u,v)=δ(v−u). (8)

Using (7), the solution to (6) is given by

 x(u)=∫Rdg(u,v)W(v)dv, (9)

making

a Gaussian field since it is a linear transformation of Gaussian white noise. To derive the Green’s function

in (9), we first define . Then (8) implies

 (κ2−Δ)β/2g(u)=δ(u). (10)

Taking the Fourier transform of both sides of (

10) yields [19, 20]

 F{(κ2−Δ)β/2g(u)}(ω) =F{δ(u)}(ω) ⟹(κ2+∥ω∥2)β/2^g(ω) =1,

and thus

 ^g(ω)=(κ2+∥ω∥2)−β/2, (11)

which is the Fourier transform of the Green’s function.

We can now compute the mean and covariance of the Gaussian field, , defined by (9). Since the Green’s function is a strictly-positive, symmetric, and rapidly decaying function, we can apply Fubini’s theorem [21] to obtain the mean of :

 E[x(u)]=E[∫Rdg(u,v)W(v)dv]=∫Rdg(u,v)E[W(v)]dv=0.

Since has mean zero, the covariance is given by

 Cov(x(u),x(u′)) =E[x(u)x(u′)] =E[∫Rdg(u,v)W(v)dv∫Rdg(u′,v′)W(v′)dv′] =∫Rd(∫RdE[W(v)W(v′)]g(u,v)dv)g(u′,v′)dv′ =∫Rd(∫Rdδ(v−v′)g(u,v)dv)g(u′,v′)dv′ =∫Rdg(u,v′)g(u′,v′)dv′.

If we define , the previous result implies that if , then for our linear acting only on ,

 LC(u,u′) =L∫Rdg(u,v′)g(u′,v′)dv′ =∫Rd(Lg(u′,v′))g(u,v′)dv′ =∫Rdδ(u′−v′)g(u,v′)dv′ =g(u,u′). (12)

Next, we assume stationarity so that the covariance only depends on the relative locations of the points, i.e., . Then and (12) can be expressed If we take the Fourier transform of both sides of this equation, and appeal to (11), we obtain

 F{(κ2−Δ)β/2C(r)}(ω)=F{g(r)}(ω) ⟹(κ2+∥ω∥2)β/2^C(ω)=^g(ω) ⟹^C(ω)=(κ2+∥ω∥2)−β.

Since the Laplacian, , is invariant under rotations and translations, we have radial symmetry, which is analogous to isotropy in the covariance. Thus we can let and to obtain the equivalent expression

 ^C(s)=(κ2+s2)−β.

To transform back to the original () space, we use the Hankel transform and its relationship to the radially symmetric Fourier transform, i.e.,

 sd−22^C(s)=(2π)d2∫∞0Jd−22(sr)rd−22C(r)rdr, (13)

where is the original (untransformed) covariance function and is the Bessel function of the first kind of order ; see [22, Section 2] for a proof. Note that if and then (13) implies

 F(s)=∫∞0Jd−22(sr)f(r)rdr

which is the Hankel transform of [23]. Hence, using the inverse Hankel transform, we obtain

 f(r)=∫∞0Jd−22(sr)F(s)sds. (14)

Plugging the expressions for and into (14) and solving for yields

 C(r) =(2π)−d2rd−22∫∞0Jd−22(sr)sd−22^C(s)sds =(2π)−d2rd−12∫∞0Jd−22(sr)sd−12(κ2+s2)−β(sr)1/2ds.

Finally, using the integral identity [24, Eq. 20, p. 24, vol. II] and some algebra, we obtain

 C(r) =κd2−βrβ−d2Kd2−β(κr)(2π)d22β−1Γ(β). (15)

Using the fact that , and defining with , it can be shown that (15) is exactly the Matérn covariance function (5). Thus, we have proved the following theorem.

###### Theorem 1

The solution of (6) is a Gaussian field with mean zero and Matérn covariance function defined by (5).

### 2.3 The Effect of a Finite Domain and Boundary Conditions

The proof of Theorem 1 above assumed that the domain was all of . However, when solving inverse problems, is restricted to a finite domain . In such cases, boundary conditions must be assumed, and the equivalence between the Gaussian fields defined by the SPDE (6) and those defined by the Matérn covariance function no longer holds. To see this, consider the case where and with Dirichlet (zero) boundary conditions, . Additionally, we assume so that the exponent of the differential operator is equal to one, making the discretization straightforward. In this case, (6) simplifies to

 (κ2−Δ)x(u)=W(u),u∈R,κ>0.

Using a uniform mesh on with a step size of , so that , yields the numerical discretization

 (κ2In+(1/h2)L)x=δ−1/2ξ,ξ∼N(0,In),

where is the standard discretization of on a uniform mesh [25], and is the scaling parameter for the prior. Then the probability density for is given by

 x|δ,κ∼N(0,δ−1(κ2I+(1/h2)L)−2),

or equivalently,

 p(x|δ,κ)∝det((δ/h4)(h2κ2I+L)2)1/2exp(−δ/h42xT(h2κ2I+L)2x). (16)

We generate 100,000 samples from (16) for each of the values and calculate the empirical correlation between the samples and compare this with the theoretical correlation defined by the Matérn covariance function. We do this for , 5, and 10 and plot the results in Figure 2, together with the Matérn correlation function. Recall that is inversely related to the degree of correlation in the prior, i.e., larger corresponds to lower correlation. The plots on the top in Figure 2 suggest that the larger the correlation is (and the smaller is), the more impact the boundary conditions have, which makes intuitive sense.

Fortunately, we can restore the connection between the Gaussian fields defined by the SPDE and by the Matérn covariance function, which is desirable because then the hyper-parameters in the SPDE can be estimated using the semivariogram method described in Section 3. The restoration requires extending the computational domain: in one dimension, we define , for , e.g., if then . We then generate realizations for the values on the extended domain () and compute the empirical correlation only for . The results are plotted on the bottom in Figure 2, where it is clear that the empirical correlations are nearly indistinguishable from those obtained using the Matérn correlation function for each value of . We note that as decreases, the extension necessary to preserve the connection rises sharply. For example, if , we must have and when , is required. However, having implies that correlation is likely to persist across most of the region (depending on the value), which rarely occurs in practice, making a reasonable assumption.

To determine the value that extends the domain far enough to restore the Matérn/SPDE connection, but not so far as to introduce unnecessary computational cost, we look to the Matérn correlation function itself. We want to extend the domain far enough so that all values in are nearly uncorrelated with the values at the end of the extended domain. In tests, it was found that we should always extend the domain at least slightly. If we let be the distance for which the Matérn correlation is approximately equal to 0.10, then our tests showed that setting

 a=max{r0.10,⌈n(1+⌈ν+0.5⌉/κ)⌉/n},

where denotes ceiling, restores the connection to the Matérn covariance for . For and , must be set equal to and , respectively, as shown in Figure 2.

The same results hold in two dimensions. Consider the case where and with Dirichlet (zero) boundary conditions, , where . In two dimensions, in (16) is replaced by , where denotes Kronecker product, which corresponds to the standard discretization of the negative-Laplacian for a uniform grid on . We now assume so the exponent of the differential operator is again equal to one. Since we are assuming isotropy, we need only consider the distance between points when calculating the correlation. The distance, however, can extend as far as in our domain since we are working in the unit square. Following the same procedure as in the one-dimensional case, again using (so ), we obtain the results shown in Figure 3 for . The plot on the left shows the disconnect between the true and empirical correlation when using the domain and the plot on the right shows the reconnection on the extended domain .

For the above discussion, we focused on zero boundary conditions. Similar results hold if periodic boundary conditions are assumed, in which case can be diagonalized by the fast Fourier transform (FFT) [25], assuming that is a power of 2. The FFT-based diagonalization of can be exploited to greatly reduce computational cost, thus when extending the domain in two-dimensions, it is advantageous to use periodic boundary conditions and the extended domain so that defined on can be diagonalized by the FFT.

Finally, in our numerical experiment above, we chose specific values of , but other values of can be chosen. The general form of the prior density, with included as a hyper-parameter, is

 p(x|δ,ν,κ)∝det(δh(h2κ2I+L)ν+d/2)1/2exp(−δh2xT(h2κ2I+L)ν+d/2x), (17)

where . If is a non-integer, a fractional power of must be computed, which is possible, generally speaking, if we have a diagonalization of in hand, but the resulting precision matrix is not sparse. Such a diagonalization is typically computable in one-dimensional examples, even with dense matrices. In two dimensions, however, an efficient diagonalization is possible only if periodic boundary conditions are assumed. We will restrict the exponent to be an integer in this paper to preserve the sparsity in the precision matrix.

### 2.4 Computing MAP Estimators for Whittle-Matérn Priors

Using Bayes’ law, we multiply the prior (17) by the likelihood (2) to obtain the posterior density function

 p(x|b,λ,δ,ν,κ) ∝ p(b|x,λ)p(x|δ,ν,κ) ∝ exp(−λ2∥Ax−b∥2−δh2xT(h2κ2I+L)ν+d/2x).

The maximizer of is known as the MAP estimator, and it can be computed by solving

 xα =argminx{12∥Ax−b∥2+α2xT(h2κ2I+L)ν+d/2x} =(ATA+α(h2κ2I+L)ν+d/2)−1ATb, (18)

where . Assuming we know and , can be estimated using a regularization parameter selection method. We use generalized cross validation (GCV):

 α=argminη>0⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩∥∥∥A(ATA+η(h2κ2I+L)ν+d/2)−1ATb−b∥∥∥2tr(I−A(ATA+η(h2κ2I+L)ν+d/2)−1AT)⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭. (19)

In practice, is often fixed [26, 9] and is estimated manually. This can be time consuming, subjective and unintuitive, so we present a method for selecting these parameters next.

## 3 The Semivariogram Method for Estimating ν and κ

In the inverse problem formulation above, the components of the vector correspond to values of an unknown function at numerical mesh points within a spatial region . This motivates using methods from spatial statistics to estimate the Whittle-Matérn prior hyper-parameters and . One such method uses a variogram, and a corresponding semivariogram [27], which requires the assumption of intrinsic stationarity, i.e., that the elements of have constant mean and the variance of the difference between the elements is constant throughout the region. This is a weaker assumption than is required by many other parameter estimation tools, which is one of the reasons variograms have become popular in spatial statistical applications [28], and it is the reason we will use semivariograms here.

The semivariogram is defined by where and is a spatial process. Due to our stationarity assumption, , which we use to derive the following alternative expression for :

 γ(r) =12(Var[X(ui)]+Var[X(uj)]−2Cov[X(ui),X(uj)]) =σ2−Cov[X(ui),X(uj)].

Thus, the semivariogram simplifies to the difference between the variance in the region and the covariance between two points with a difference . The variogram is formally defined as , hence the terms variogram and semivariogram are often used interchangeably. To remain consistent, we will continue to refer to as a semivariogram throughout the paper.

We now need a way to estimate the semivariogram from given data. For this, we use what is known as the sample, or empirical, semivariogram. Assuming that is isotropic, if , then the empirical semivariogram can be expressed

 ^γ(r)=12n(r)∑(i,j)∣∣∥ui−uj∥=r(x(ui)−x(uj))2, (20)

where is a realization of , and is the number of points that are separated by a distance . The values are often referred to as the semivariance values. In a typical semivariogram, the semivariance values increase as increases since points tend to be less similar the further apart they are, which increases the variance of their differences.

Although the empirical semivariogram is useful in obtaining semivariance values from data, it is not ideal for modeling data for various reasons (see [28] for details), thus it is typical to fit a semivariogram model to the empirical semivariogram. Since our prior distribution for has a Matérn covariance, we will use the theoretical Matérn semivariogram model [4, 2] given by

 γ(r,θ)=⎧⎪⎨⎪⎩0if r=0a0+(σ2−a0)(1−12ν−1Γ(ν)(κr)νKν(κr))if r>0, (21)

where is the nugget, is the sill, and . The nugget is the term given to the semivariance value at a distance just greater than zero and the sill is the total variance contribution or the semivariance value where the model levels out. The sill, , is also the variance parameter in the Matérn covariance function (5). We can estimate , , , and by fitting the semivariogram model to the empirical semivariogram.

There are a number of ways to fit the semivariogram model to the empirical semivariogram. We use weighted least squares, as is commonly done [28], choosing the that minimizes

 W(θ)=∑rn(r)2[γ(r,θ)]2(^γ(r)−γ(r,θ))2. (22)

To minimize , we adapt the MATLAB codes from [29, 30]. More specifically, we adapt [29] for computing the empirical semivariance and we adapt [30] for minimizing . Although it is possible to optimize both and continuously, we will require to be an integer.

For an illustration, we look to the one-dimensional deblurring example in Section 2.3.2 of [25] and construct the semivariogram (20) using the data . The optimized parameters of the model are and , which corresponds to a practical range of . The sill and nugget are estimated to be and , respectively. A plot of the resulting, fitted Matérn semivariogram model together with the empirical semivariogram is given in Figure 4.

The values of and from the obtained by fitting the Matérn semivariogram model to , as described in the previous paragraph, can be used to define the Whittle-Matérn prior (17). The sill, , and the nugget, , are not especially useful outside of fitting the semivariogram model because they do not correspond to any parameter in (17). They are helpful only in determining the best estimates for and . Any contribution these parameters may have made to the prior distribution will be accounted for in the regularization parameter, . Therefore, after fitting the semivariogram models, and are discarded.

With estimates for and in hand, the MAP estimator, , can then be computed as in Section 2.4, from which we can recompute by fitting the Matérn semivariogram model to the empirical semivariogram values of . Repeating this process iteratively yields the following algorithm.

The Semivariogram Method for MAP Estimation with Whittle-Matérn Prior:

1. Estimate by fitting a Matérn semivariogram model to .

2. Define the prior (17) using and , compute using (19), and compute using (18).

3. Update by fitting a Matérn semivariogram model to .

4. Return to step 1 and repeat until and stabilize.

### 3.1 Numerical Experiments

We now implement the semivariogram method on deblurring examples in both one and two dimensions. Recall that the connection between the Matérn covariance and the Whittle-Matérn prior depends on a stationarity assumption, which the following examples may not exhibit. For simplicity, we will still assume stationarity and acknowledge that future work should be done in the case when no stationarity or isotropy is present.

#### 3.1.1 One-Dimensional Results

When implementing the semivariogram method for the deblurring problem in Section 2.3.2 of [25], we will fit the semivariograms using 15 approximately equally spaced distances up to a max distance of (nearly 30% of the way across the region), yielding the 15 semivariance values that we fit using the Matérn semivariogram model. We chose as a cutoff because it balances the need to capture the covariance structure at short distances, which are well-known to be the most important [28], with those at longer distances.

We implement three iterations of the semivariogram method, and obtain the Whittle-Matérn parameter values and , and regularization parameter . A plot of the resulting is given in Figure 5. It is plotted together with the true image on the left and with the Tikhonov reconstruction on the right. The relative error between the true image and the reconstructions is for the Whittle-Matérn prior and for the Tikhonov reconstruction. When using periodic boundary conditions instead of zero boundary conditions, we obtain nearly identical results, with , , and a relative error of .

In addition to giving a good estimate for , the semivariogram method has another advantage: fitting a semivariogram to in step 0 yields an estimate of that we use to determine how far we need to extend our domain in order to maintain the connection between the SPDE solution and the Matérn covariance. In the above example, the parameter values and were computed in step 0, which suggested that only a slight extension of the computational domain was needed in order maintain the connection.

#### 3.1.2 Two-Dimensional Results

We now consider a two-dimensional image deblurring test case similar to that in [25, Section 3.1.3]. We assume periodic boundary conditions on the extended domain, but due to the restriction from the extended domain to , circulant structure is lost in the forward model matrix, and hence, linear system solves must be done using an iterative method. As in [25, Section 3.1.3], we use preconditioned conjugate gradient (PCG) iteration, both for computing using GCV and for computing . We attempt to deblur a image of Main Hall on the University of Montana (UM) campus. To do this, we begin with a image, given in Figure 6, and then restrict to the center image. This smaller image in the middle will be thought of as being on a domain and the larger, full image will then be defined on .

To obtain , we first perform the blurring operation on the full 256256 true image plotted in Figure 6. Since this is a color image, the deblurring process is done individually for the red, green, and blue intensity arrays. We then restrict to the central 128128 pixels (with boundaries denoted in Figure 6) to obtain the blurred image plotted on the left in Figure 7. We seek an estimate of in the same central subregion. Omnidirectional semivariograms with 25 approximately equally spaced grid points in are used. We add ten additional grid points and reduce the distance considered to one-tenth the maximum from one-fifth because there are many more pairs of points in the two-dimensional case. The semivariogram method is used to obtain for each color band, and for the red, green, and blue intensities, respectively, and . We also compute the Tikhonov solution, as defined in [25, Section 3.1.3], with . The two solutions are plotted in Figure 7, where it appears that the solution with the Whittle-Matérn prior, plotted on the right, is better able to reconstruct the true image than is the Tikhonov solution, plotted in the middle.

A more objective (non-visual) comparison of quality for the two reconstructions is presented in Table 1, which contains the following statistics calculated for the elements of the color image: the mean (), the variance (

), the first and third quartiles (

and ), the correlation (), the maximum, minimum and median, and finally, the mean absolute error (MAE) and the mean squared error (MSE). From the table, it is clear that the distribution of color intensities when using the Whittle-Matérn prior is much more accurate than when using the identity covariance with the Tikhonov solution. Note, specifically, that the median and first and third quartiles line up well with the true percentiles. Additionally, the variance is lower and the correlation between true and estimated values is much higher for the Whittle-Maérn solution. The MAE and MSE of the residuals are also much lower for this solution. Therefore, all evidence is pointing to the fact that the Whittle-Matérn prior, with and estimated using the semivariogram method, yields a better solution than the Tikhonov solution.

In both of the previous examples, it is probable that both the Tikhonov and Whittle-Matérn reconstructions could be improved with a more-optimal value. We did not fine-tune the selection procedure and simply used the one chosen by GCV.

## 4 Hierarchical Modelling and Markov chain Monte Carlo

In the previous examples, we calculated point estimates of and using semivariograms and of using GCV. Although these estimates are useful by themselves, an approach that allows for the quantification of uncertainty in and is desirable.

Sampling and simultaneously with a fully-Bayesian approach can be done using the Metropolis algorithm or, as is suggested in [31], via slice sampling. The problem is that these variables, especially and , are so highly correlated that the sampling can be wildly inefficient [32] without careful selection of the prior distributions and considerable adjusting of the proposal distributions [26]. In our tests in one dimension, the essential sample size (ESS) [25] was more than 100 times smaller than the number of samples generated.

Due to the inefficiency of sampling from this posterior, we will fix , as is commonly done [26, 9], setting it equal to the estimated value determined by the semivariogram method. Since , , and are fixed in these examples, we will simplify notation and refer to as . We now focus on quantifying the uncertainty in . To this end, we employ Markov chain Monte Carlo (MCMC) methods to obtain samples from .

First, we assume and

are random variables with Gamma-distributed hyper-priors, as in

[25]:

 p(λ) ∝λαλ−1exp(−βλλ) p(δ) ∝δαδ−1exp(−βδδ)

where and . Gamma hyper-priors lead to Gamma conditional posterior densities for and (a property known as conjugacy), which can be sampled from directly. Since no such conjugacy exists for , we assign it a uniform hyper-prior, , where is a selected upper bound for . We assign the uniform hyper-prior in an effort to be as data-driven as possible. By Bayes’ law, the posterior density function is given by

 p(x, λ,δ,κ|b,ν)∝p(b|x,λ)p(x|δ,ν,κ)p(λ)p(δ)p(κ) ∝λM/2+αλ−1δN/2+αδ−1det(Pν,κ)1/2 exp(−λ2∥Ax−b∥2−δ2xTPν,κx−βλλ−βδδ),0<κ<κmax

where ,

. It can be shown that the conditional probability distributions for

and are given by

 x|b,λ,δ,ν,κ∼N((λATA+δPν,κ)−1λATb,(λATA+δPν,κ)−1), λ|b,x,δ,ν,κ∼Γ(M/2+αλ,12∥Ax−b∥2+βλ), δ|b,x,λ,ν,κ∼Γ(N/2+αδ,12xTPν,κx+βδ),

and the probability density for is

 p(κ|x,λ,δ,ν,b)∝det(Pν,κ)1/2exp(−δ2xTPν,κx),0<κ<κmax.

Since no conjugacy exists for , we will use the Metropolis-Hastings algorithm to sample from . For the proposal density, we use a log-normal proposal, i.e.,

 q(κ|κk)∝1κexp(−12ρ2(lnκ−lnκk)2), (23)

where chosen offline. A proposed sample is then accepted with probability

 r(κ∗,κk)=max{1,p(κ∗|xk,λk+1,δk+1,ν,b)q(κk|κ∗)p(κk|xk,λk+1,δk+1,ν,b)q(κ∗|κk)}.

That is, we set with probability , and otherwise we set