 # Bayesian inverse problems with unknown operators

We consider the Bayesian approach to linear inverse problems when the underlying operator depends on an unknown parameter. Allowing for finite dimensional as well as infinite dimensional parameters, the theory covers several models with different levels of uncertainty in the operator. Using product priors, we prove contraction rates for the posterior distribution which coincide with the optimal convergence rates up to logarithmic factors. In order to adapt to the unknown smoothness, an empirical Bayes procedure is constructed based on Lepski's method. The procedure is illustrated in numerical examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian procedures to solve inverse problems became increasingly popular in the last years, cf. Stuart . In the inverse problem literature the underlying operator of the forward problem is typically assumed to be known. In practice, there might however be some uncertainty in the operator which has to be taken into account by the procedure. While there are some frequentist approaches in the statistical literature to solve inverse problems with an unknown operator, the Bayesian point of view has not yet been analysed. The aim of this work is to fill this gap.

Let be a function on a domain and , , be an injective, continuous linear operator depending on some parameter . We consider the linear inverse problem

 Y=Kϑf+εZ, (1.1)

where

Gaussian white noise in

and is the noise level which converges to zero asymptotically. If the operator is known, the inverse problem to recover non-parametrically, i.e. as element of the infinite dimensional space , from the observation is well studied, see for instance Cavalier . The Bayesian approach has been analysed by Knapik et al.  with Gaussian priors, by Ray 

non-conjugate priors and many subsequent articles including

[1, 2, 21, 23]. Also non-linear inverse problems have been successfully solved via Bayesian methods, for example, [3, 9, 27, 28, 29, 35].

Focussing on linear inverse problems, we will extend the Bayesian methodology to unknown operators. To this end, the unknown parameter is introduced in (1.1) where may depend non-linearly on . Unknown operators are relevant in numerous applications. Examples include semi-blind and blind deconvolution for image analysis. Therein, the operator is given by with some unknown convolution kernel [4, 20, 32]

. More general integral operators such as singular layer potential operators appear in the context of partial differential equations, cf. examples in

[7, 15]. If the coefficients of the underlying PDE are unknown, then the operator itself is only partially known. A typical example of this type is the backwards heat equation where the solution of the PDE (with Dirichlet boundary conditions) is observed at some time and the aim is to estimate the initial value function . Here, we take into account an unknown diffusivity parameter . The solution depends linearly on

and the resulting operator admits a singular value decomposition (SVD) with respect to the sine basis and with

dependent singular values , cf. Section 6. In particular, the resulting inverse problem is severely ill-posed.

Even without measurement errors the target function is in general not identifiable any more for unknown operators, i.e., there may be several solutions to the equation . For instance, if admits a SVD for a orthonormal systems and unknown singular values , then we have for any function and any scalar . We thus require some extra information.

There are different approaches in the inverse problem literature to deal with this identifiability problem, particularly in the context of semi-blind or blind deconvolution. One approach is to find the so called minimum norm solution which has a minimal distance to some a priori estimates for and , cf. [4, 20]. Another idea is to assume that some approximation of the unknown operator is available for the reconstruction of , cf. [18, 32]. Similarly, we may assume to have some noisy observation of the unknown parameter which then allows to construct an estimator for .

In this paper we will study this last setting. More precisely, we suppose that the parameter set is (a subset of) some Hilbert space and we consider the additional sample

 T=ϑ+δW (1.2)

where is white noise on , independent of , and is some noise level. Thereby, is considered as a nuisance parameter and we will not impose any regularity assumptions on . Our aim is the estimation of from the observations (1.1) and (1.2). This setting includes several exemplary models with different levels of uncertainty in the operator :

1. If , we have a parametric characterization of the operator and can be understood as an independent estimator for .

2. Cavalier and Hengartner 

have studied the case where the eigenfunctions of

are known, but only noisy observations of the singular values are observed: , with i.i.d. standard normal . In this case , supposing is Hilbert-Schmidt, and is the sequences of singular values of .

3. Efromovich and Koltchinskii , Hoffmann and Reiß  as well as Marteau  have assumed the operator as completely unknown and considered additional observations of the form

 L=K+δW

where the operator is blurred by some independent white noise on the space of linear operators from to with some noise level . Fixing basis and of and , respectively, is characterised by the infinite matrix and

can be identified with the random matrix

consisting of i.i.d. standard Gaussian entries.

In contrast to the just mentioned articles [6, 11, 16, 25], we will investigate the Bayesian approach. We thus put a prior distribution on

. Denoting the probability density of

under the parameters with respect to some reference measure by , the posterior distribution given the observations

is given by Bayes’ theorem:

 Π(B|Y,T)=∫Bpf,ϑ(Y,T)dΠ(f,ϑ)∫L2×Θpf,ϑ(Y,T)dΠ(f,ϑ) (1.3)

for measurable subsets . Due to the white noise model, the density

inherits the nice structure from the normal distribution, cf. Section

2. Although we cannot hope for nice conjugate pairs of prior and posterior distribution due to the non-linear structure of

, there are efficient Markov chain Monte Carlo algorithms to draw from

, cf. Tierney .

To analyse the behaviour of the posterior distribution, we will take a frequentist point of view and assume the observations are generated under some true, but unknown and . In a first step we will identify general conditions on a prior under which the posterior for concentrates in a neighbourhood of with a certain rate of contraction : We show for some constant the convergence

 Π(f∈L2(D):∥f−f0∥>Dξε,δ|Y,T)→0 (1.4)

in -probability as and go to zero. This contraction result verifies that whole probability mass the posterior distribution is asymptotically located in a small ball around with radius of order . Hence, draws from the posterior distribution will be close to the unknown function with high probability. This especially implies that the posterior mean and the posterior median are consistent estimators of the unknown function . Interestingly, the difficulty to recover from is same in all three above mentioned models.

The proof of the contraction result follows general principles developed by Ghosal et al. . The analysis of the posterior distribution requires to control both, the numerator in (1.3) and the normalising constant. To find a lower bound for the latter, a so-called small ball probability condition is imposed ensuring that the prior puts some minimal weight in a neighbourhood of the truth. Given this bound, the contraction theorem can be shown by constructing sufficiently powerful tests for the hypothesis against the alternative for the constant from (1.4). To find the test, we follow Giné and Nickl  and use a plug-in test based on a frequentist estimator. This estimator obtained by the Galerkin projection method, as proposed in [11, 16] for the Model C.

The main difficulty is that without structural assumptions on , e.g. if , an infinite dimensional nuisance parameter cannot be consistently estimated. We thus cannot expect a concentration of . Why should then concentrate around the truth? Fortunately, is regular, such that a finite dimensional projection suffices to reconstruct with high accuracy. Under the reasonable assumption that the projection of depends only on a finite dimensional projection of , we can indeed estimate without estimating the full . Similarly, we show in the Bayesian setting that a concentration of this finite dimensional projection is sufficient resulting in a small ball probability condition depending only on and .

The conditions of the general result are verified in the mildly ill-posed case and in the severely ill-posed case, assuming some Sobolev regularity of . We use a truncated product prior of and a product prior on . Choosing the truncation level of prior in an optimal way, the resulting contraction rates coincide with the minimax optimal rates which are known in several models up to logarithmic factors. These rates are indeed the same as for the known parameter case, cf. Ray , if .

Since the optimal level depends on the unknown regularity of , a data-driven procedure to select is desirable. There are basically two ways to tackle this problem. Setting a hyper prior on , a hierarchical Bayes procedure could be considered. Alternatively, although not purely Bayesian, we can try to select some empirically from the observations and then use this in the Bayes procedure. Both possibilities are only rarely studied for inverse problems. Using known operators, the few articles on this topic include Knapik et al.  considering both constructions with a Gaussian prior on and Ray  who has considered a sieve prior which could be interpreted as hierarchical Bayes procedure. We will follow the second path and choose empirically using Lepski’s method  which yields an easy to implement procedure (note that  used a maximum likelihood approach to estimate ). We prove that the final adaptive procedure attains the same rate as the optimized non-adaptive method.

This paper is organised as follows: The posterior distribution is derived in Section 2. The general contraction theorem is presented in Section 3. In Section 4 specific rates for Sobolev regular functions in the mildly and the severely case are determined using a truncated product prior. An adaptive choice of the truncation level is constructed in Section 5. In Section 6 we discuss the implementation of the Bayes method using a Markov chain Monte Carlo algorithm and illustrate the method in two numerical examples. All proofs are postponed to Section 7.

## 2 Setting and posterior distribution

Let us fix some notation: and denote the scalar product and the norm of or . We write if there is some universal constant such that . If and we write . We recall that noise process in (1.1) is the standard iso-normal process, i.e., is -distributed for any and covariances are given by

 E[⟨Z,g1⟩⟨Z,g2⟩]=⟨g1,g2⟩for all g1,g2∈L2(Q).

We write . Note that cannot be realised as an element of , but only as an Gaussian process .

The observation scheme (1.1) is equivalent to observing

 ⟨Y,g⟩=⟨Kϑf,g⟩+ε⟨Z,g⟩for all g∈L2(Q).

Choosing an orthonormal basis of with respect to the standard -scalar product, we obtain the series representation

 Yk:=⟨Y,φk⟩=⟨Kϑf,φk⟩+εZk

for i.i.d. random variables

. Note that the distribution of does not depend on . If is compact, it might be tempting to choose from the singular value decomposition of simplifying . However, such a basis of eigenfunctions will in general depend on the unknown and thus cannot be used. Since

are i.i.d., the distribution of the vector

is given by

 PYϑ,f=⨂k⩾1N(⟨Kϑf,φk⟩,ε2).

By Kakutani’s theorem, cf. Da Prato [8, Theorem 2.7], is equivalent to the law of the white noise . Writing with some abuse of notation, since is not in , we obtain the density

 dPYϑ,fdPY0=exp(1ε⟨Kϑf,Z⟩−12ε2∑k⩾1⟨Kϑf,φk⟩2)=exp(1ε2⟨Kϑf,Y⟩−12ε2∥Kϑf∥2),

where we have used under for the second equality.

Since any continuous operator can be described by the infinite matrix , we may assume with loss of generality that . The distribution of is then similarly given by being equivalent to . Writing and , we obtain the density

 dPTϑdPT0=exp(1δ2⟨ϑ,T⟩−12δ2∥ϑ∥2).

Therefore, the likelihood of the observations with respect to is given by

 dPYϑ,f⊗PTϑ,fdPY0⊗PT0=exp(1ε2⟨Kϑf,Y⟩−12ε2∥Kϑf∥2+1δ2⟨ϑ,T⟩−12δ2∥ϑ∥2). (2.1)

Applying a prior on the parameter , we obtain the posterior distribution

 Π(B|Y,T)=∫Beε−2⟨Kϑf,Y⟩−(2ε2)−1∥Kϑf∥2+δ−2⟨ϑ,T⟩−(2δ2)−1∥ϑ∥2dΠ(f,ϑ)∫L2×Θeε−2⟨Kϑf,Y⟩−(2ε2)−1∥Kϑf∥2+δ−2⟨ϑ,T⟩−(2δ2)−1∥ϑ∥2dΠ(f,ϑ),B∈B, (2.2)

with the Borel--algebra on . Under the frequentist assumption that and are generated under some and , we obtain the representation

 Π(B|Y,T)=∫Bpf,ϑ(Z,W)dΠ(f,ϑ)∫L2×Θpf,ϑ(Z,W)dΠ(f,ϑ),B∈B, (2.3)

for

 pf,ϑ(z,w):=exp(1ε⟨Kϑf−Kϑ0f0,z⟩−12ε2∥Kϑf−Kϑ0f0∥2+1δ⟨ϑ−ϑ0,w⟩−12δ2∥ϑ−ϑ0∥2)

corresponding to the density of with respect to .

Note that even if a Gaussian prior is chosen, the posterior distribution is in general not Gaussian, since might be non-linear. Hence, the posterior distribution cannot be explicitly calculated in most cases, but has to be approximated by an MCMC algorithm, see for instance Tierney  and Section 6.

## 3 Contraction rates

For simplicity we throughout suppose such that and assume to be self-adjoint. The general case is discussed in Remark 7.

Taking a frequentist point of view, we assume that the observations (1.1) and (1.2) are generated by some fixed unknown and . As a first main result the following theorem gives general conditions on the prior which ensure a contraction rate for the posterior distribution from (2.3) around the true .

Let be an orthonormal basis of . We use here the double-index notation which is especially common for wavelet bases, but also the single-indexed notation is included if contains only one element. For any index we write . Let moreover be a sequence of approximation spaces with dimensions associated to . We impose the following compatibility assumption on :

###### Assumption 1.

There is some such that for any if .

If is compact and admits an orthonormal basis of eigenfunction being independent of , then this is assumption is trivially satisfied for and . On the other hand this assumption allows for more flexibility for the considered approximation spaces and can be compared to Condition 1 by Ray . As a typical example, the possibly depended eigenfunctions of may be the trigonometric basis of while are generated by band-limited wavelets.

Having and thus fixed, we write for the operator norm for any bounded linear operator where is equipped with the -norm. We denote by the orthognal projection of onto and define the operator

 Kϑ,j:=PjKϑ|Vj

as restriction of to an operator from to . Note that is given by the finite dimensional matrix .

###### Assumption 2.

Let depend only on a finite dimensional projection of for some integer , . Moreover, let be Lipschitz continuous with respect to in the following sense:

 ∥Kϑ,j−Kϑ′,j∥Vj→Vj⩽L∥Pj(ϑ−ϑ′)∥jfor all ϑ,ϑ′∈Θ

where is a constant being independent of and where is a norm on . We suppose that the norm satisfies

Although projections on and on are both denoted by , it will be always clear from the context which is used such that this abuse of notation is quite convenient. Since is fully described by a matrix, we naturally have the upper bound . Let us illustrate the previous assumptions in the models A,B and C from the introduction:

###### Examples 3.
1. In Model A we have a finite dimensional parameter space with fixed . Assumption 1 is, for instance, satisfied if is a convolution operator with a kernel

whose Fourier transform has compact support and if we choose a band-limited wavelet basis. Note that in this case we do not have know the SVD of

. For Assumption 2 we may choose and as the Euclidean distance on leading to the Lipschitz condition . Then follows from the Gaussian concentration of .

2. In Model B let be compact and let

be an orthonormal basis consisting of eigenfunctions with corresponding eigenvectors

and let be a wavelet basis fulfilling . Then Assumption 1 is satisfied if there is some such that only if . Since then for any if , we moreover have for any

 ∥∥(Kϑ,j−Kϑ′,j)v∥∥2 =∥∥Pj∑i⩾1(ρϑ,i−ρϑ′,i)⟨ei,v⟩ei∥∥2 ⩽supi⩽C2dj|ρϑ,i−ρϑ′,i|2∑i⩽C2dj⟨ei,v⟩2⩽supi⩽C2dj|ρϑ,i−ρϑ′,i|2∥v∥2.

We thus choose and as the supremum norm on . Since are i.i.d. Gaussian, we have for some

 P(supk⩽C2dj|Wk|>κ+√clogdj) ⩽2C2dje−(κ2+clogdj)/2⩽2Ce−κ2/2.

Therefore, Assumption 2 is satisfied.

3. In Model C the projected operators are given by matrices. Assumption 1 is satisfied if and only if all are band matrices with some fixed bandwidth independent from and . To verify Assumption 2, can be chosen as the operator norm or spectral norm of these matrices. The Lipschitz condition is then obviously satisfied. Moreover is a random matrix where all entries are i.i.d. random variables. A standard result for i.i.d. random matrices is the bound for the operator norm, cf. [33, Cor. 2.3.5]. Together with the Borell-Sudakov-Tsirelson concentration inequality for Gaussian processes, cf. [14, Thm. 2.5.8], we immediately obtain the concentration inequality in Assumption 2.

Finally, the degree of ill-posedness of can be quantified by the smoothing effect of the operator:

###### Assumption 4.

For a decreasing sequence and some constant let the operator satisfy for all and .

Note that Assumptions 1 and 4 with imply that is compact, because it can be approximated by the operator sequence having finite dimensional ranges. The rate of the decay of will determine the degree of ill-posedness of the inverse problem. If decays polynomially or exponentially, we obtain a mildly or severely illposed problem, respectively.

Recall that the nuisance parameter cannot be consistently estimated without additional assumptions. Therefore, we study the contraction rate of the marginal posterior distribution . While we allow for a general prior on for , we will use a product prior on . For densities on we thus consider prior distributions of the form

 dΠ(f,ϑ)=dΠf(f)⊗⨂k⩾1βk(ϑk)dϑk. (3.1)
###### Theorem 5.

Consider the model (1.1), (1.2) generated by some and with and for , respectively, and let Assumptions 1, 2 and 4 be satisfied. Let be a sequence of prior distributions of the form (3.1) on the Borel--algebra on . Let two positive sequences converging to zero and a sequence of integers with . Suppose as as well as

 djn⩽c1κ2n(εn∨δn)2,κnσjn⩽c2ξn% andδnσjn√djn→0

for constants and all Suppose satisfies and for some . Let be a sequence and such that

 Πn(L2∖Fn)⩽e−(C1+4)κ2n/(εn∨δn)2. (3.2)

Moreover assume for sufficiently large

 Πn((f,ϑ)∈Vjn×Θ:∥Pjn+m(Kϑf−Kϑ0f0)∥2ε2n+∥Pjn+m(ϑ−ϑ0)∥2δ2n⩽κ2n(εn∨δn)2)⩾e−C1κ2n/(εn∨δn)2. (3.3)

Then there exists a finite constant such that the posterior distribution from (2.3) satisfies

 Πn(f∈Vjn:∥f−f0∥>Dξn|Y,T)→0 (3.4)

as in -probability.

Theorem 5 states that the posterior distribution is consistent and concentrates asymptotically its whole probability mass in a ball around the true with decaying radius , that is, the posterior “contracts to ” with the rate . This result is similarly to Ray [30, Theorem 2.1] who has proven a corresponding theorem for known operators. However, the contraction rate is now determined by the maximum instead of , which is natural in view of the results by Hoffmann and Reiß  who have included the case in their frequentist analysis.

To gain some intuition on the interplay between and the noise level , let us set for simplicity in Assumption 1 and . Using Assumption 4 (with Lemma 14) and Assumption 2, we then can decompose

 ∥f−f0∥ ⩽∥Pjnf0−f0∥+∥f−Pjnf0∥ ≲∥Pjnf0−f0∥+σ−1jn∥Kϑf−KϑPjnf0∥ ⩽∥Pjnf0−f0∥+σ−1jn∥Kϑf−Kϑ0Pjnf0∥+σ−1jnL∥Pjn(ϑ−ϑ0)∥j∥f0∥

The first term in the last line is the approximation error being bounded by . It corresponds to the classical bias. Indeed, the prior sequence is concentrated on a subset of due to (3.2) such that the projection of to the level serves as reference measure for the prior and the deterministic error remains bounded by . The last two terms in the previous display correspond to the stochastic errors in and and are of the order owing the the minimal spread of imposed by the small ball probability condition (3.3). In particular, we recover the ill-posedness of the inverse problem due to in the denominator. To obtain the best possible contraction rate, we need to choose in way that ensures that is close to , i.e., we will balance the deterministic and the stochastic error. The conditions on the dimension are mild technical assumptions.

The crucial small ball probability assumption (3.3) ensures that the prior sequence has some minimal mass in a neighbourhood of the underlying and . The distance from is measured in a (semi-)metric which reflects the structure of our inverse problem. If , it would be sufficient if and are smaller than . However, condition (3.3) is more subtle. Firstly, the maximum of and on the right-hand side within the probability introduces some difficulties. The prior has to weight a smaller neighbourhood of or , respectively, depending on whether is smaller than or the other way around. If, for instance, the contraction rate is determined by but the prior has to put enough probability to the smaller -ball around . We see such effects also in the construction of lower bounds, cf. , where we may have in the extreme case a distance between and while . Secondly, (3.3) depends only on finite dimensional projections of both and . This is particularly important as we do not assume any regularity conditions on such that we cannot expect the projection remainder to be small.

To allow for this relaxed small ball probability condition, the contraction rate is restricted to the set . The result can be extended to by appropriate constructions of the prior, in particular, if the support of is contained in we can immediately replace by in (3.4). Another possibility are general product prior if the basis is chosen according to the singular value decomposition of .

To prove Theorem 5, we use the techniques by Ghosal et al. [12, Thm. 2.1], cf. also [14, Thm. 7.3.5]. A main step is the construction of tests for the testing problem

 H0:f=f0vs.H1:f∈Fn,∥f−f0∥⩾Dξn.

To this end, we first study a frequentist estimator of which then allows to construct a plug in test as proposed by Giné and Nickl .

The natural estimator for is itself. In order to estimate , we use a linear Galerkin method based on the perturbed operator similar to the approaches in [11, 16]. We thus aim for a solution to

 ⟨KTˆfε,δ,v⟩=⟨Y,v⟩for all v∈Vj. (3.5)

Choosing , we obtain a system of linear equations depending only on the projected operator . There is a unique solution if is invertible. Noting that for the unperturbed operator Assumption 4 implies (cf. Lemma 14 below), we set

 ˆfj:={K−1T,jPjY,if ∥K−1T,j∥Vj→Vj⩽τ/σj,0,otherwise, (3.6)

for a projection level and a cut-off parameter . Adopting ideas from [13, 16], we obtain the following non-asymptotic concentration result for the estimator .

###### Proposition 6.

Let , such that for some . Under Assumptions 2 and 4 there are constants such that, if and , then from (3.6) fulfils

 Pf,ϑ(∥ˆfj−f∥⩾Cσ−1j(∥f∥∨1)κ+∥f−Pjf∥)⩽3e−κ2/(ε∨δ)2.

Note that some care will be needed to analyse the above mentioned tests since also the stochastic error term depends on the unknown function and, for instance, a Gaussian prior on will not sufficiently concentrate on a fixed ball .

###### Remark 7.

While the assumption that is self-adjoint simplifies the analysis and the presentation of our approach, the methodology can be generalised to general compact operators . In this case Assumption 4 should be replaced by the assumption which is consistent with the original condition, cf. Remark 15. The Galerkin projection method (3.5) can then be generalised to solve

 ⟨K∗TKTˆfε,δ,v⟩=⟨Y,KTv⟩for all v∈Vj,

cf. Cohen et al. [7, Appendix A]. This modified estimator should have a similar behaviour as above such that we can construct the tests which we needed to prove Theorem 5. The rest of the proof of the contraction theorem and the subsequent results would remain as before.

## 4 A truncated product prior and the resulting rates

For the ease of clarity we fix a (-regular) wavelet basis of with the associated approximation spaces . We write as before. Investigating a bounded domain , we have in particular . The regularity of will be measured in the Sobolev balls

 Hs(R):={f∈L2([0,1]):∥f∥2Hs:=∞∑j=−122sj∑l⟨f,φj,l⟩2⩽R2},s∈R. (4.1)

We will use Jackson’s inequality and the Bernstein inequality: For and , we have

 ∥(Id−Pj)f∥Hs≲2−j(t−s)∥f∥Htand∥g∥Ht≲2j(t−s)∥g∥Hs. (4.2)
###### Remark 8.

The subsequent analysis applies also to the trigonometric as well as the sine basis in the case of periodic functions. Considering more specifically , we may set for . Since holds for any , it is then easy to see that the inequalities (4.2) are satisfied for if is replaced by . Alternatively we may set which gives exactly (4.2).

For we use the product prior as in (3.1) with a fixed density . For we also a apply a product prior. More precisely, we take a prior determined by the random series

 f(x)=∑|k|⩽Jτ|k|Φkφk(x),x∈[0,1],

for a sequence , i.i.d. random coefficients (independent of ) distributed according to a density and a cut-off . Hence,

 dΠ(ϑ,f)=∏|k|⩽Jτ−d|k|α(τ−1|k|fk)dfk⋅∏k⩾1β(ϑk)dϑk. (4.3)

Under appropriate conditions on the distributions and on we will verify the conditions of Theorem 5.

###### Assumption 9.

There are constants such that the densities and satisfy

 α(x)∧β(x)⩾Γe−γ|x|2for all x∈R.

Assumption 9 is very weak and is satisfied for many distributions with unbounded support, for example, Gaussian, Cauchy, Laplace distributions or Student’s -distribution. Also uninformative priors where or are constant are included. A consequence of the previous assumption is that any random variable with probability density (or ) satisfies

 P(|Φ−x|⩽κ)⩾Γ∫|y|⩽κe−γ|