 # Bernstein-von Mises theorems and uncertainty quantification for linear inverse problems

We consider the statistical inverse problem of approximating an unknown function f from a linear measurement corrupted by additive Gaussian white noise. We employ a nonparametric Bayesian approach with standard Gaussian priors, for which the posterior-based reconstruction of f corresponds to a Tikhonov regulariser f̅ with a Cameron-Martin space norm penalty. We prove a semiparametric Bernstein-von Mises theorem for a large collection of linear functionals of f, implying that semiparametric posterior estimation and uncertainty quantification are valid and optimal from a frequentist point of view. The result is illustrated and further developed for some examples both in mildly and severely ill-posed cases. For the problem of recovering the source function in elliptic partial differential equations, we also obtain a nonparametric version of the theorem that entails the convergence of the posterior distribution to a fixed infinite-dimensional Gaussian probability measure with minimal covariance in suitable function spaces. As a consequence, we show that the distribution of the Tikhonov regulariser f̅ is asymptotically normal and attains the information lower bound, and that credible sets centred at f̅ have correct frequentist coverage and optimal diameter.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Inverse problems arise naturally in a variety of scientific disciplines, where the relationship between the quantity of interest and the data collected in an experiment is determined by the physics of the underlying system and can be mathematically modelled. Real world measurements are always discrete and carry statistical noise, which is often most naturally modelled by independent Gaussian random variables. The observation scheme then gives rise to an inverse regression model of the form

 Mi=(Af)i+Wi,i=1,…,n, Wiiid∼N(0,1), (1)

where the forward operator is assumed to be linear between separable Hilbert spaces .

However, formulation and analysis of the inverse problem is usually best done by restrictions from an underlying continuous model. This guarantees, among other things, discretisation invariance, allowing to switch consistently between different discretisations [11, 29, 30, 48]. Thus in this paper we consider the continuous (nonparametric) linear inverse problem of recovering an unknown function from a noisy indirect measurement

 Mε=Af+εW. (2)

Model (2) is asymptotically equivalent to (1) when and is assumed to be Gaussian white noise in the separable Hilbert space [4, 42]. Note that while can be defined by its actions on , it does not take values there almost surely.

In the present paper we follow the Bayesian approach to inverse problems, employing a standard nonparametric procedure based on a centred Gaussian prior for , see[11, 48]. The solution to the statistical inverse problem is then the conditional distribution of given

, the mean or mode of which can be used as a point estimator. The main appeal of the method is, however, that it automatically delivers quantification of the uncertainty in the reconstruction, obtained through credible sets, i.e. regions of the parameter space with specified high posterior probabilities. In many applications this methodology can be efficiently implemented using modern MCMC algorithms that allow fast sampling from the posterior distribution.

In the Bayesian approach the prior distribution serves as a regularisation tool, and it is natural to ask whether the methodology delivers correct, prior-independent - and if so, in some sense optimal - inference on the unknown parameter in the small noise limit. These questions can be addressed under the frequentist assumption that is in reality generated through the scheme (2) by a fixed true solution (instead of being randomly drawn from ). We then investigate how the posterior distribution concentrates around when . The speed of convergence can be characterised through posterior contraction rates, first studied in [17, 46], and further investigated by [1, 2, 10, 24, 26, 28, 27, 25, 41, 55] among the others. See also [37, 39, 40] for results relative to non-linear inverse problems.

However, determining whether the associated uncertainty quantification is objectively valid requires finer analysis of the posterior distribution. The central question is: Do credible sets have the correct frequentist coverage in the small noise limit? That is, do we have, for some ,

 Π(f∈C|Mε)≈1−α⇔P(f†∈C(M†ε))≈1−α, (3)

with a small as ? The importance of the above questions is not restricted just to the Bayesian paradigm. In linear Bayesian inverse problems with Gaussian priors the conditional mean estimator coincides with the maximum a posteriori (MAP) estimator, which in turn can be shown to coincide with a Tikhonov regulariser with a Cameron-Martin space norm penalty, see [10, 20]. Thus, if (3) holds for a credible set centred at the posterior mean, we can use as an (asymptotic) frequentist confidence region for the Tikhonov regulariser .

Obtaining optimal contraction rates is not enough to answer the above question even in the parametric case. For finite-dimensional models the Bernstein–von Mises (BvM) theorem establishes, under mild assumptions, that the posterior distribution is approximated by a Gaussian distribution which is centred at the maximum likelihood estimator for

and has minimal covariance matrix. As a consequence, credible sets coincide asymptotically with frequentist confidence regions (see, e.g., [53, Chapter 10]). On the other hand, understanding the frequentist properties of nonparametric credible sets presents a more delicate matter. It was observed by , and later in , that the BvM phenomenon may fail to hold even in a simple nonparametric regression model, where credible balls in are shown to have null asymptotic coverage.

Positive results, both in the direct and inverse setting, have been obtained in subsequent developments [6, 7, 26, 31, 49]. In particular, [6, 7] showed that a natural way of investigating the nonparametric BvM phenomenon is from a semiparametric perspective, by showing the weak convergence of the posterior to a fixed infinite-dimensional Gaussian distribution on a large enough function space. Recently, this program has been successfully adjusted for inverse problems: a semiparametric result was obtained in  for geodesic X-ray transforms, while a nonparametric BvM theorem was proved in  for the non-linear inverse problem of recovering the potential term in an elliptic partial differential equation (PDE); see also  for non-linear inverse problems with jump processes.

In this paper we follow ideas presented in  by extending the results to linear inverse problems of the general form (2). In particular, we prove BvM theorems for functionals , with a large family of test functions , which entails the convergence of to a limiting Gaussian process with optimal covariance structure that recovers the semiparametric information lower bound. As a consequence, we deduce the statistical efficiency of plug-in Tikhonov regularisers

and that credible intervals centred at such estimators constitute asymptotically valid and optimal confidence intervals. The applicability of the general theory is illustrated by deriving sufficient conditions on the test functions for the BvM phenomenon to occur in case of recovering an unknown source function in elliptic boundary value problems (BVP), and in a BVP for the heat equation. We then show for the elliptic BVP example, in which the properties of the crucial ’inverse Fisher information’ operator

are well-understood (by PDE theory), that the techniques employed previously can be refined to further relax the assumptions on the test functions to depend only on the smoothing properties of . Finally, by requiring a slightly stronger smoothness, we adapt the program laid out in  to the problem at hand, and obtain a nonparametric BvM theorem which implies that certain nonparametric credible sets built around the Tikhonov regulariser have asymptotically correct coverage and optimal diameter. Note that we do not make additional assumptions about the smoothness of . Instead of assuming a source condition to achieve convergence in a desired space, we study the convergence in a larger space which is defined by the smoothness of .

This article is organised as follows: we introduce the general setting in Section 2.1, and state and prove the semiparametric BvM theorem for linear functionals of the unknown in Section 2.2. In Section 3 we deduce from the previous results the asymptotic normality of and the coverage properties of credible intervals. Section 4 is dedicated to the examples. Finally, in Section 5 we formulate the nonparametric BvM theorem for the problem of recovering the source function in an elliptic BVP. Appendix A provides some background on the theory of semiparametric statistical efficiency.

## 2 General posterior results

### 2.1 The Bayesian approach for linear inverse problems

We are interested in the following continuous (nonparametric) model for indirect measurements

 Mε=Af+εW,ε>0. (4)

The forward operator is assumed to be linear, bounded and injective, and are separable Hilbert spaces of real valued functions defined on , which are -, -dimensional Riemannian manifolds respectively. Below we often denote , . The forward operator has a well defined adjoint for which

 ⟨Af,g⟩W2=⟨f,A∗g⟩W1,

for all and .

###### Condition 1.

Suppose there exists a separable Hilbert space , such that is continuous and that is dense in the norm of . In particular, for some ,

 ∥Af∥W2≤c∥f∥¯¯¯¯¯W, ∀f∈¯W. (5)

Note that the more smoothing the forward operator is, the larger we can choose . For example, if we assume that is a -times smoothing elliptic differential operator we may choose .

The noise amplitude is modelled by . The measurement noise is a centred Gaussian white noise process with covariance

 E(W(φ)W(ψ))=⟨φ,ψ⟩W2,φ,ψ∈W2.

Below we often write for the random variable . Observing data means that we observe a realisation of the Gaussian process with marginal distributions ; and we denote by the probability space supporting .

Let be the law of for fixed . Arguing as in [37, Section 7.4], we can use the law of as a common dominating measure, and then apply the Cameron–Martin theorem (e.g. Proposition 6.1.5 in ) to define the likelihood function

 f↦ℓ(f)=ℓ(f,Mε)=logpf(Mε)=logdPMfdPM0(Mε)=1ε2⟨Mε,Af⟩W2−12ε2∥Af∥2W2. (6)

We assume that follows a prior measure which is a Gaussian Borel probability measure on and denote its reproducing kernel Hilbert space (RKHS) or Cameron-Martin space by . Noticing (again, as argued in [37, Section 7.4]) that given by (6

) can be taken to be jointly measurable, we can then use Bayes’ theorem to deduce that the posterior distribution of

arising from observation (4) can be written as

 Π(B|Mε) =∫Bpf(Mε)dΠ(f)∫¯¯¯¯¯Wpf(Mε)dΠ(f)B∈B¯¯¯¯¯W % a Borel set in ¯¯¯¯¯¯W. (7)

We are interested in analysing under the assumption that the measurement is generated from a true deterministic unknown , that is, when in (4) . In the following we assume that the prior satisfies the following concentration condition for a given .

###### Condition 2.

Let be a Gaussian Borel probability measure on a separable Hilbert space and let be its RKHS. Define the concentration function of for a fixed as

 ϕΠ,f†(δ)=infg∈VΠ,∥g−f†∥¯¯¯¯W≤δ⎧⎨⎩∥g∥2VΠ2−logΠ(f:∥f∥¯¯¯¯¯W≤δ)⎫⎬⎭. (8)

We assume that for a fixed and some sequence , such that , as , satisfies

 (9)

The above condition characterises the asymptotic behaviour of the small ball probabilities , when , and guarantees that the prior puts enough mass around in -norm. The concentration function of Gaussian priors is well studied and Condition 2 is rather mild, see e.g. [18, 54] or [19, Chapter 2.6].

### 2.2 A semiparametric Bernstein–von Mises theorem

Next we formulate a semiparametric Bernstein–von Mises theorem for linear inverse problems. Theorem 3 below states the convergence of random laws in probability, which means that for any metric for weak convergence of laws, the real random variables converge to zero in probability, see .

###### Theorem 3.

Let be the law generating , where fulfils Condition 1, is white noise in , and is the noise level. Let be the posterior distribution arising from observations (4) and prior with satisfying Condition 2 for a given .

Let be such that , for all , and suppose , for some . Then we have in -probability

 L(ε−1(⟨f,ψ⟩W1−ˆΨ)|Mε)→LN(0,∥A˜ψ∥2W2) (10)

as , where

 ˆΨ=⟨f†,ψ⟩W1−ε⟨A˜ψ,W⟩W2. (11)

Note that, since is dense and is assumed to be uniformly continuous, we can extend continuously to .

###### Proof.

The proof follows ideas developed in  for the special case of being the X-ray transform. We start by showing that it is enough to consider convergence of instead of with some large enough set . Here denotes the posterior arising from the prior restricted to and renormalised. The second step is to find an appropriate set

. We then proceed to study the moment generating function of

under the posterior and finally conclude that converges weakly to .

###### Lemma 4.

Let be a Gaussian Borel probability measure in . Suppose that for a fixed its concentration function satisfies Condition 2 with some sequence , such that . Let be the posterior distribution arising from the measurement (4). Then for any Borel set for which

 Π(Dcε)≲e−K(δε/ε)2,for some K>3, (12)

and all small enough we have

 Π(Dcε|Mε)→0and% ∥Π(⋅|Mε)−ΠDε(⋅|Mε)∥TV→0

when in -probability. Above is the posterior arising from the prior restricted to and renormalised.

###### Proof.

We start by noting that one can write and furthermore

 Π(B∩Dε)−ΠDε(B)=Π(B∩Dε)Π(¯¯¯¯¯¯W)−Π(B∩Dε)Π(Dε)=−Π(Dcε)ΠDε(B)

which implies . Hence it suffices to prove the first limit.

We can write

 Π(B|Mε) =∫Beℓ(f)−ℓ(f†)dΠ(f)∫¯¯¯¯¯Weℓ(f)−ℓ(f†)dΠ(f)B∈B¯¯¯¯¯W.

Under , we have for any (see [37, Lemma 3])

 ℓ(f)−ℓ(g)=1ε⟨A(f−g),W⟩W2−12ε2∥A(f−f†)∥2W2+12ε2∥A(g−f†)∥2W2 (13)

and hence

 ℓ(f)−ℓ(f†) =1ε⟨A(f−f†),W⟩W2−12ε2∥A(f−f†)∥2W2.

Let be any probability measure on the set . Applying Jensen’s inequality to the exponential function we get for any

 PMf†⎛⎝∫Beℓ(f)−ℓ(f†)dν(f)≤e−(1+˜C)(δεε)2⎞⎠ =PMf†⎛⎝Eν(eℓ(f)−ℓ(f†))≤e−(1+˜C)(δεε)2⎞⎠ ≤P(Eν(1ε⟨A(f−f†),W⟩W2−12ε2∥A(f−f†)∥2W2)≤−(1+˜C)(δεε)2).

Denote where, using again Jensen’s inequality, we can estimate

 CZ =1ε2E(∫B⟨A(f−f†),W⟩W2dν(f))2 =1ε2E(Eν⟨A(f−f†),W⟩W2)2 ≤1ε2Eν(E⟨A(f−f†),W⟩2W2) =1ε2∫B∥A(f−f†)∥2W2dν(f)≤(δεε)2.

We can then conclude

 PMf†⎛⎝∫Beℓ(f)−ℓ(f†)dν(f)≤e−(1+˜C)(δεε)2⎞⎠ ≤P(∣∣∣1ε∫B⟨A(f−f†),W⟩W2dν(f)∣∣∣≥(12+˜C)(δεε)2) =P(|Z−EZ|≥(12+˜C)(δεε)2)≤e−(1/2+˜C)22(δε/ε)2

where the last inequality follows from standard Gaussian tail bound .

Next we choose and let

 Fε={f:∫Beℓ(f)−ℓ(f†)dν(f)≤e−32(δε/ε)2}.

Using the above with we see that . Using Markov’s inequality, denoting the expectation with respect to , it suffices to to prove that

tends to zero, when . Since we see that

 EMf†(Π(Dcε|Mε)1Fε)≤EMf†(1Fε)≤e−δ2ε2ε2.

The second term can be written as

 EMf†(Π(Dcε|Mε)1Fcε) =EMf†⎛⎜⎝∫Dcεeℓ(f)−ℓ(f†)dΠ(f)∫¯¯¯¯¯Weℓ(f)−ℓ(f†)dΠ(f)1Fcε⎞⎟⎠ ≤EMf†⎛⎜⎝∫Dcεeℓ(f)−ℓ(f†)dΠ(f)∫Beℓ(f)−ℓ(f†)Π(B)dν(f)1Fcε⎞⎟⎠ ≤e2(δε/ε)2Π(f:∥A(f−f†)∥2W2≤δ2ε)∫DcεEMf†(eℓ(f)−ℓ(f†))dΠ(f).

Denote

 ˜ϕΠ,f†(δ)=infAg∈V˜Π,∥A(g−f†)∥W2≤δ⎡⎣∥Ag∥2V˜Π2−logΠ(f:∥Af∥W2≤δ)⎤⎦,

where is the RKHS of . Following the approach of [19, Proposition 2.6.19] we next show that

 Π(f:∥A(f−f†)∥2W2≤δ2)≥e−˜ϕΠ,f†(δ/2).

Let be such that . Then . We denote . Using the Cameron-Martin theorem [3, Corollary 2.4.3.] and the fact that is a centred Gaussian random variable we can write

 Π( f:∥A(f−f†)∥W2≤δ)≥Π(f:∥A(f−g)∥W2≤δ/2) =12(Π−Ag(f:∥Af∥W2≤δ2)+ΠAg(f:∥Af∥W2≤δ2)) =12(∫{∥˜f∥W2≤δ2}d˜Π−˜g(˜f)d˜Π(˜f)d˜Π(˜f)+∫{∥˜f∥W2≤δ2}d˜Π˜g(˜f)d˜Π(˜f)d˜Π(˜f)) =12⎛⎜⎝∫{∥˜f∥W2≤δ2}(e−⟨˜g,˜f⟩V˜Π+e⟨˜g,˜f⟩V˜Π)e−∥˜g∥2V˜Π2d˜Π(˜f)⎞⎟⎠ ≥e−∥Ag∥2V˜Π2Π(f:∥Af∥W2≤δ2)

where and . The last inequality follows from the fact for all . We can then conclude

 EMf†(Π(Dcε|Mε)1Fcε)≤e2(δε/ε)2e˜ϕΠ,f†(δε/2)Π(Dcε)

since .

Note that is assumed to be linear and injective and hence the Cameron-Martin space of is isometric to . We know that

 ∥Af∥W2≤c∥f∥¯¯¯¯¯W

which implies

 −logΠ(f:∥Af∥W2≤δ)≤−logΠ(f:∥f∥¯¯¯¯¯W≤δc).

We also have and hence for all . Thus by (9) and assumption (12) we can conclude

 EMf†(Π(Dcε|Mε)1Fcε)≤e2(δε/ε)2eϕΠ,f†(δε/2c)Π(Dcε)≤e(3−K)(δε/ε)2.

Let be such that

 |⟨ψ,φ⟩W1|≤∥φ∥¯¯¯¯¯W

for all and with some . Note that, since is dense and is uniformly continuous, we can extend continuously to . When we have and the standard Gaussian tail bound guarantees for all that

 Π(f:|⟨˜ψ,f⟩VΠ|∥˜ψ∥VΠ>tδεε)≤e−t2δ2ε2ε2.

Hence we can choose

 Dε={f:|⟨˜ψ,f⟩VΠ|∥˜ψ∥VΠ≤Tδεε},T>√6 (14)

and restrict to studying the posterior distribution arising from the prior .

###### Lemma 5.

Assume condition 2. For fulfilling the conditions in Theorem 3 define random variables

 ˆΨ=⟨f†,ψ⟩W1−ε⟨A˜ψ,W⟩W2.

Then for all we have

 EΠDε(eτε(⟨f,ψ⟩W1−ˆΨ)|Mε)=eτ22∥A˜ψ∥2W2(1+oPMf†(1)) (15)

when .

###### Proof.

Denote . Then the left hand side of (15) can be written as

 EΠDε (eτε⟨f−f†,ψ⟩W1+τ⟨A˜ψ,W⟩W2|Mε) =∫¯¯¯¯¯Weτε⟨f−f†,ψ⟩W1+τ⟨A˜ψ,W⟩W2+ℓ(fτ)−ℓ(fτ)+ℓ(f)dΠDε(f)∫¯¯¯¯¯Weℓ(f)dΠDε(f).

Using (13) we see that under

 ℓ(f)−ℓ(fτ)=−12ε2(∥A(f−f†)∥2W2−∥A(fτ−f†)∥W2)−1ε⟨A(τε˜ψ),W⟩W2=τε⟨A(f−f†),A˜ψ⟩W2+τ22∥A˜ψ∥W2−τ⟨A˜ψ,W⟩W2 (16)

and hence

 EΠDε(eτε(⟨f,ψ⟩W1−ˆΨ)|Mε)=eτ22∥A˜ψ∥W2∫Dεeℓ(fτ)dΠ(f)∫Dεeℓ(f)dΠ(f). (17)

Let be the shifted law of , . Then by the Cameron-Martin theorem (see [3, 19]) we get

 ∫Dε,τeℓ(g)dΠτ(g)dΠ(g)dΠ(g)∫Dεeℓ(g)dΠ(g)=∫Dε,τeℓ(g)eτε⟨˜ψ,g⟩VΠ−(τε)22∥˜ψ∥VΠdΠ(g)∫Dεeℓ(g)dΠ(g). (18)

Above . We see that , when . Using the definition of we can write

 εsupg∈Dε,τ|⟨˜ψ,g⟩VΠ|=εsupf∈Dε|⟨˜ψ,f+τε˜ψ⟩VΠ|≤Tδε∥˜ψ∥VΠ+|τ|ε2∥˜ψ∥2