 # Posterior contraction for empirical Bayesian approach to inverse problems under non-diagonal assumption

We investigate an empirical Bayesian nonparametric approach to a family of linear inverse problems with Gaussian prior and Gaussian noise. We consider a class of Gaussian prior probability measures with covariance operator indexed by a hyperparameter that quantifies regularity. By introducing two auxiliary problems, we construct an empirical Bayes method and prove that this method can automatically select the hyperparameter. In addition, we show that this adaptive Bayes procedure provides optimal contraction rates up to a slowly varying term and an arbitrarily small constant, without knowledge about the regularity index. Our method needs not the prior covariance, noise covariance and forward operator have a common basis in their singular value decomposition, enlarging the application range compared with the existing results.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Inverse problems for partial differential equations arise naturally from medical imaging, seismic exploration and so on. There are two difficulties for such types of problems: one is non-uniqueness and another one is instability. Bayesian approach formulates inverse problems as statistical inference problems, which enables us to overcome both of these difficulties. Because of that, Bayes’ inverse method has attracted a lot of interest in recent years; see for instance

[4, 5, 7, 9, 15, 16, 21, 26]. However, compared with the classical regularization techniques, the development of a theory of Bayesian posterior consistency is still in its infancy.

For clarity, let us provide some basic settings. Let be a separable Hilbert space, with norm and inner product , and let be a self-adjoint and positive-definite linear operator with bounded inverse. Then, we usually assume the following problem

 (1.1) y=A−1u+1√nξ,

where is noise and is a noisy observation of . The inverse problem can be thought to find from . Following the framework shown in , we denote as the prior measure and as the noise measure and assume that

• Prior: ,

• Noise: ,

where is a Gaussian measure with zero mean and covariance operator . Let us denote

to be the posterior measure which is the solution of the inverse problem. Usually, algorithms, such as Markov chain Monte Carlo (MCMC), can be employed to probe the posterior probability measure

. If we assume the range of is included in the domain of , then (1.1) can be rewritten as follow

 (1.2) d=Tu+1√nη,

where , and . Obviously, we have is a white Gaussian noise. We denote to be the posterior measure and still denote to be the prior measure. Here we will be particularly interested in the small noise limit where . Under the frequentist setting, the data are generated by

 (1.3) d†=Tu†+1√nη,η∼N(0,I),

where is a fixed element of . Our aim is to show that the posterior probability measure contracts to a Dirac measure centered on the fixed true solution . In the following, we denote

to be the law of a random variable

and denote to be the expectation corresponding to the distribution of the data . A formal description of the concept of posterior consistency can be stated as follow.

Definition 1. A sequence of Bayesian inverse problems is posterior consistent for with rate if for (1.3), there exists positive sequence such that

 (1.4) E0(μdn{u:∥u−u†∥≥Mnϵn})→0,∀Mn→∞.

Until now, there are a number of studies for posterior consistency and inconsistency for Bayes’ inverse method. When the forward operator is a nonlinear operator, Vollmer  provides a general framework and studies an inverse problem for elliptic equation in detail. Recently, Knapik and Salomond  provide a general framework for investigating the posterior consistency of linear inverse problems, which allows non-Gaussian priors. However, due to the difficulties brought by nonlinearity and non-Gaussian, the existing results are mainly focus on the situation that the forward operator is linear, the prior measure is Gaussian and the noise measure is also Gaussian . When , and are all simultaneous diagonalizable, Knapik et al  provide a roadmap for what is to be expected regarding posterior consistency. This work reveals an important fact that is the optimal convergence rates can be obtained if and only if the regularity of the prior and the truth are matched. Therefore, Knapik et al 

propose an empirical Bayes procedure which provides an estimate of the regularity of the prior through the data. By choosing the regularity index adaptively, the optimal convergence rates are obtained up to a slowly varying factor. Later, Szabó et al

 introduce the “polished tail” condition and investigate the frequentist coverage of Bayesian credible sets by choosing the regularity of the prior through an empirical Bayes method. Recently, by employing abstract tools from regularization theory, Agapiou and Mathé  study the posterior consistency by choosing a non-centered prior through empirical Bayes method. In 2018, Trabs  obtained optimal convergence rates up to logarithmic factors when the forward linear operator depends on an unknown parameter.

When , and are not simultaneous diagonalizable, optimal contraction rates can not be obtained by employing similar methods developed for the simultaneous diagonalizable case. Concerning this case, theories about partial differential equation (PDE) have been employed to obtain nearly optimal convergence rates in  by using precision operators. Using properties of the hypoelliptic operators, Kekkonen et al  obtain nearly optimal convergence rates for Bayesian inversion with hypoelliptic operators. Recently, Mathé  studies the posterior consistency for non-commuting operators by employing the ideas of link conditions originating from classical regularization method in Hilbert scales. Besides these studies, to the best of our knowledge, there are little investigations of the posterior consistency for the non-simultaneous diagonalizable case. In order to achieve the optimal contraction rates, the regularity of the true solution should be known in the above mentioned investigations. Hence, how to generalize the empirical Bayes procedure developed for the simpler simultaneous diagonalizable case to the non-diagonal case is crucial for the applicability of the posterior consistent theory.

In [19, 27]

, the structures of the posterior probability measure for the regularity index are fully analyzed. For the non-diagonal case, the posterior probability measure for the regularity index is hard to define. The reason is that probability density functions can not be defined in infinite-dimensional space and some integrals can not be calculated easily to obtain the log-likelihood of the regularity index obtained in

. Hence, we can hardly define an empirical Bayes procedure for the non-diagonal case. To overcome these difficulties, we notice that regularity index is derived from data and the posterior contraction rates are only depending on the regularity index of the prior measure and the true function . So, if we introduce some artificial diagonal problem with the same regularity properties as for the non-diagonal problem, we may obtain optimal posterior contraction rates relying on the relations between the artificial diagonal problem and the non-diagonal problem.

For a positive constant , we consider Gaussian prior measure . Firstly, we consider the following problem

 (1.5) d=m+1√nη,

where . Then the prior measure for obviously is . Concerning this problem, the forward operator and the covariance operator of the noise are all the identity operator . Hence, the posterior contraction rates of the problem (1.5) seems can be obtained by the ideas developed for the diagonal case. However, the operator

appears in the prior probability measure, which makes difficult to derive estimations of the corresponding eigenvalues. By constructing an artificial prior probability measure with similar regularity properties as for the prior probability measure, we can construct maximum likelihood-based estimate for the regularity index and prove the optimal contraction rates for (

1.5) up to a slowly varying term and an arbitrarily small constant. At last, we transform the results for problem (1.5) to the original problem (1.2) which can be achieved by employing the method developed in .

The outline of this paper is as follows. In Section 2, we provide a brief introduction about Hilbert scales and give some essential assumptions about the covariance operators of the prior and noise probability measure. In Section 3, we exhibit two auxiliary problems. Relying on detailed analysis about the two auxiliary problems, we prove the posterior consistency of our original problem. In Section 4, two examples are given, which illustrate the usefulness of our results. In Section 5, we provide a brief summary and outlook. At last, the proofs of some auxiliary lemmas and properties are collected in Section 6.

## 2. Basic settings and Assumptions

In this section we present basic settings and show some important assumptions made in this paper. For the reader’s convenience, let us provide an explanation for some frequently used notations firstly.

Notations:

• The set of all bounded linear operators mapping from some Hilbert space to is denoted by , and the corresponding operator norm is denoted by .

• The range of some operator is denoted by , and the domain of the operator is denoted by .

• For a linear operator , its dual operator is denoted by .

• The notation stands for the expectation corresponding to the distribution of the data generated from the truth.

• For two sequences and of numbers, we denote by () that there are such that () for large enough. If , we write .

### 2.1. Preliminaries

Firstly, we will present a brief introduction about Hilbert scales  which provides powerful tools for measuring the smoothness of the noise, the forward operator and the samples of the prior. Let be a self-adjoint, positive-definite, trace class, linear operator with eigensystem . Considering , we know that is a densely defined, unbounded, symmetric and positive-definite operator in . Denote to be the norm defined on the Hilbert space . We define the Hilbert scales , with , where

 Sf:=∞⋂n=0D(C−n),(u,v)Ht:=(C−t/2u,C−t/2v),∥u∥Ht:=∥∥C−t/2u∥∥.

In addition, the norms defined above possess the following properties.

###### Lemma 2.1.

(Proposition 8.19 in ) Let be the Hilbert scale induced by the operator defined above. Then the following assertions hold:

1. Let . Then the space is densely and continuously embedded in .

2. If , then and is the dual space of .

3. Let

then the following interpolation inequality holds

 ∥u∥Hr≤∥u∥s−rs−qHq∥u∥r−qs−qHs,

where .

Next, let us provide some necessary notations about infinite sequences. For a sequence , we denote the -norm by , that is, . The hyperrectangle and Sobolev space of sequence of order and radius are the sets

 (2.1) Θβ(R)={ms∈ℓ2:supi≥1i1+2βm2i≤R},
 (2.2) Sβ(R)={ms∈ℓ2:∞∑i=1i2βm2i≤R}.

For a sequence , the hyperrectangle norm and Sobolev norm can be defined by

 (2.3) ∥ms∥2h,β=supi≥1i1+2βm2i,∥ms∥2β=∞∑i=1i2βm2i.
###### Definition 2.2.

A sequence is self-similar if, for some fixed positive constants , and ,

 (2.4) ρN∑i=Nm2i≥ϵMN−2β,∀N≥N0.

The class of self-similar elements of are denoted by . The parameters and are fixed and omitted from the notation.

This definition is employed by Giné and Nickl  and Bull  to remove a “small” set of undesirable truths from the model, which allows to generate candidate confidence sets for the true parameter that are routinely used in practice. In , the authors propose the polished tail condition which includes the set of self-similar sequences. The set of self-similar sequences has been shown natural once one has adopted the Bayesian setup with variable regularity prior probability measures.

### 2.2. Assumptions

In this section, we give the main assumptions employed in our work. We assume that the prior probability measure and the probability measure of the noise have the following form

 μ0=N(0,Cα),Pnoise=N(0,C1),

where is a self-adjoint, positive-definite, trace class, linear operator and is assumed to be a self-adjoint, positive-definite linear operator. The operator is assumed to be a self-adjoint and positive-definite, linear operator with bounded inverse, .

Let us firstly present the following assumptions which describe the relations between Hilbert space and the space of sequences.

Assumptions 1: The covariance operator has eigenpairs on Hilbert space . For the singular values , there exists a positive constant such that

 (2.5) λ2i≍i−2pd,∀i=1,2,….

The formula (2.5) means that there exist a series of constants and a positive constant such that with . Let , for all , we have

 (2.6) ∞∑i=1i−2pαd<∞.

Inequality (2.6) reflects that the operator is a trace class operator for .

Secondly, we give the assumptions which are mainly concerned with the interrelations between the three operators , and . Similar to the assumptions presented in , these assumptions reflect the ideas that

 (2.7) C1≃Cβ,A−1≃Cℓ,

for some , , where is used loosely to indicate two operators which induce equivalent norms. Specifically speaking, we make the following assumptions.

Assumptions 2: Let us denote and . Suppose there exist , , and for all such that

1. , where ;

2. ;

3. ;

4. ;

5. ;

6. ,  ;

7. .

At last, we assume the truth belongs to with . Denote for , and . Since

 (2.8) ∥u†∥2Hγ=∞∑i=1λ−2γi(u†i)2≍∞∑i=1i2pdγ(u†i)2,

we know that for some . In addition, we easily find that for some . Concerning the truth, we make the following assumption which is crucial for our estimations.

Assumption 3: We assume that for some , and the sequence induced by belongs to for some with .

## 3. Posterior contraction

In this section, we will present our main results and show the proof details. Due to the proofs are quite technical, it is worth to give a sketch of our main ideas firstly.

As stated in the introduction, we will introduce two auxiliary problems which are crucial for our analysis. Hence, it is necessary to give a clear summarization of the three problems employed in our work.

Original Problem: Under the assumption , we summarize the essential elements of the original inverse problem as follow

• Forward operator: ,

• Data: ,

• Prior probability measure: with ,

• Noise probability measure: .

Since the operator and can not be diagonalized simultaneously, we introduce to transform the forward operator to be the identity operator and propose the following transformed problem.

Transformed Problem: For the transformed inverse problem, the necessary elements can be summarized as follow

• Forward operator: ,

• Data: ,

• Prior probability measure: with ,

• Noise probability measure: .

The covariance operator of the transformed problem can not be diagonalized under the eigenbasis presented in Assumptions 1. So we can hardly obtain useful estimations of the corresponding eigenvalues, which inspired us to introduce the following artificial diagonal problem.

Artificial Diagonal Problem: For the artificial diagonal problem, the essential elements can be summarized as follow

• Forward operator: ,

• Data: ,

• Prior probability measure: with ,

• Noise probability measure: .

We will recast the artificial diagonal problem into the framework introduced in [19, 27] and prove a corresponding posterior consistency result. Then relations of the artificial diagonal problem and the transformed problem will be explored to provide a posterior contraction estimation for the transformed problem. At last, the general approach developed in  will be employed to transfer the posterior contraction estimation to our original problem.

### 3.1. Posterior contraction for the artificial diagonal problem

Denote . Relying on Assumptions 1, we easily know that

 (3.1) mi∼N(0,λ2(Δ−1+α)i)=N(0,c2(Δ−1+α)ii−2pd(Δ−1+α)).

Denote , and , equality (1.5) can be rewritten as follow

 (3.2) di=c(Δ−1+α)i~mi+1√nηi,for i=1,2,…

with and . For convenience, let us define

 (3.3) ~α:=pd(Δ−1+α)−12.

Since and , we find that . From the formula (3.3), we easily know that . Following the formula (2.2) employed in , we introduce log-likelihood for (relative to an infinite product of -distribution) as follow

 (3.4) ℓn(~α)=−12∞∑i=1(log(1+ni1+2~α)−n2i1+2~α+nd2i).

For the artificial diagonal problem, we denote to be the truth which is equal to , and denote and with . For convenience, let us define

 (3.5) ~m†:= ∞∑i=1~m†iϕi.

Similarly, for problem (3.2), we denote to be the truth. Considering and

 ∥m†∥Hγ+Δ−1=∥Tu†∥Hγ+Δ−1≍∥C12(Δ−1)u†∥Hγ+Δ−1=∥u†∥Hγ<∞,

we find that . Relying on Assumptions 1, we have

 ∥m†∥2Hγ+Δ−1= ∞∑i=1λ−2(γ+Δ−1)i(m†i)2 = ∞∑i=1c−2(γ+Δ−1)ii2pd(γ+Δ−1)(m†i)2.

The above equality indicates that

 (3.6) ∥m†s∥~β≲∥m†∥Hγ+Δ−1≲∥m†s∥~β,

where . Through similar deductions, we can also find that

 (3.7) ∥~m†s∥~β≲∥~m†∥Hγ+Δ−1≲∥~m†s∥~β.

With these preparations, we can define

 (3.8) hn(~α)=1+2~αn1/(1+2~α)logn∞∑i=1n2i1+2~α(logi)(~m†i)2(i1+2~α+n)2,

which is similar to the formula (5.1) defined in . In addition, we define

 (3.9) ~α––n (3.10) ¯¯¯¯~αn =inf{~α>0:hn(~α)>L(logn)2},

where and the infimum of the empty set is considered . Before going further, let us present some estimates about and .

###### Lemma 3.1.

For any constant , there exist and such that

 (3.11) inf~m†s∈S~β(R)~α––n≥ ~β−C0/logn, (3.12) sup~m†s∈Θ~βss(R)∩S~β(R)¯¯¯¯~αn≤ ~β+C1(loglogn)/logn,

for large enough.

The proof details are shown in the Appendix. Then we define

 (3.13) ^~αn:=argmax~α∈[0,logn]ℓn(~α)−C1loglognlogn,

where is a positive constant appeared in Lemma 3.1. Through some small modifications of the proof of Theorem 1 presented in , we have

 (3.14) inf~m†s∈S~β(R)∩Θ~βss(R)P0(~α––n−C1loglognlogn≤^~αn≤¯¯¯¯~αn−C1loglognlogn)→1.

Obviously, when restricted to the following interval

 (3.15) In:=[~α––n−C1loglognlogn,¯¯¯¯~αn−C1loglognlogn],

we have

 (3.16) ^~αn≤~βand^~αn≥~β−C2loglognlogn,

where . The notation will be used in all of the sections below.

Considering formula (3.3), we define

 (3.17) ^αn:=dp(^~αn+12)+1−Δ,

which is the estimation for the regularity index of the prior of . For , we easily deduce that

 (3.18) ^αn≤γ+d2p,and^αn≥γ+d2p−dC2loglognplogn.

Now, we provide the following theorem which gives the convergence rates estimation for our artificial diagonal problem.

###### Theorem 3.2.

Let be the posterior mean estimator for our artificial diagonal problem when the regularity index is , then we have

 (3.19) supm†s∈S~β(R)∩Θ~βss(R)E0{sup~α∈In∥^mdn(~α)−m†∥2}=O(ϵ2n),

where and .

Since the proof is not the main ingredient of our work, it has been postponed to the Appendix.

### 3.2. Posterior contraction for the transformed problem

Denote for . Now let us come back to the following transformed problem

 (3.20) d=m+1√nη,

where

 (3.21) η∼N(0,I)andm∼μm0=N(0,M(^αn))

with . Before diving into the discussions on posterior contraction, let us provide some important estimates. Consider the equation

 (3.22) (nM(α)+I)u=r,

Define the bilinear form ,

 (3.23) B(u,v):=(n1/2C−1/2T∗u,n1/2C−1/2T∗v)+(u,v),∀u,v∈H.
###### Definition 3.3.

Let . An element is called a weak solution of (3.22), if

 B(u,v)=(r,v),∀v∈H.
###### Lemma 3.4.

Under the Assumptions 2, for any , there exists a unique weak solution of (3.22). In addition, if with , the weak solution .

Since the proof is simple, we postpone the proof to the Appendix. Relying on the definition of weak solution and interpolation inequalities, we can prove the following lemma.

###### Lemma 3.5.

For any constant , under the Assumptions 2, the following norm bound holds:

 (3.24) ∥C12(α+Δ−1)−s2(nM(α)+I)−1C12(α+Δ−1)−s2∥B(H)≲n−1+sα+Δ−1.

For fluency of the description, we defer the proof to the Appendix. Denote to be the posterior probability measure for our original problem and to be the posterior probability measure for the transformed problem. The two posterior probability measures are all Gaussian due to our assumptions. Using the relations between and , we know that

 μmd^αn{m:∥m−m†∥≥Mn~ϵn}=μd^αn{u:∥Tu−Tu†∥≥Mn~ϵn},

where and are positive sequences satisfy and , respectively. Let us denote and to be the expectation and the conditional mean estimator with respect to the posterior probability measure , respectively. By Markov’s inequality, we find that

 (3.25) supm†s∈S~β(R)∩Θ~βss(R)E0μd^αn{u:∥Tu−Tu†∥≥Mn~ϵn}≤1M2n~ϵ2nsupm†s∈S~β(R)∩Θ~βss(R)E0sup~α∈InEα,n(∥Tu−Tu†∥2)+o(1).

Denote to be the expectation with respect to a probability measure with zero mean and the same covariance operator as the posterior probability measure of the original problem. Let , then we know that . Obviously, and are independent. Since

 (3.26) Eα,n∥m−m†∥2=Ec,α∥wn+^mn(α)−m†∥2,

we can insert the posterior mean estimator of our artificial diagonal problem into the righthand side of (3.25). Relying on Assumptions 2 and some calculations, we can prove the following theorem.

###### Theorem 3.6.

For every positive constant and an arbitrarily small positive constant , we have

 (3.27) supm†s∈S~β(R)∩Θ~βss(R)E0μd^αn{u:∥Tu−Tu†∥≥Mn~ϵn}→0,asn→∞,

where and .

###### Proof.

Denote , then the conditional mean estimator for the artificial diagonal problem has the following form

 (3.28) ^mdn(α)=m†−1nFdn(α)−1m†+1√nCα+Δ−1Fdn(α)−1η.

For the non-diagonal problem with , the conditional mean estimator can be written as follow

 (3.29) ^mn(α)=m†−1nFn(α)−1m†+1√nM(α)Fn(α)−1η,

where . Following (3.26) and recalling that and are independent, we have

 E0sup~α∈InEα,n∥m−m†∥2≤ E0sup~α∈InEc,α∥wn∥2+2E0sup~α∈In∥^mn(α)−^mdn(α)∥2 (3.30) +2E0sup~α∈In∥^mdn(α)−m†∥2 = I+II+III.

For the term III, Theorem 3.2 gives an appropriate estimation. For the term I, let us denote the covariance operator as which can be written as follow

 (3.31) Cpn=(nM(α)+I)−1M(α).

Considering , both of and are bounded linear operators. Because of is a trace class operator, we know that the operator is a compact operator and can be diagonalized, which implies that

. Then for a white noise

, we have the following estimate

 (3.32) I=sup~α∈InTr(Cpn)=sup~α∈InTr(M(α)1/2(nM(α)+I)−1M(α)1/2)=sup~α∈InE∥∥M(α)1/2(nM(α)+I)−1M(α)1/2ζ∥∥2=sup~α∈InE(Cs/2ζ,C−s/2M(α)1/2(nM(α)+I)−1M(α)1/2C−s/2Cs/2ζ)≤sup~α∈In∥∥C−s/2M(α)1/2(nM(α)+I)−1M(α)1/2C−s/2∥∥B(H)E∥Cs/2ζ∥2.

Choosing