Linear inverse problems have been studied since long in the statistical and numerical analysis literature; see, e.g., alquier2011inverse , bissantz2007inverseregularization , cavalier2008nonparametricinverseproblems , cavalier2002sharp , cohen2004adaptive_galerkin , donoho1995WVD , kaipio2006statistical , kirsch2011introduction , wahba1977integraloperator
, and references therein. Emphasis in these works has been on the signal-in-white noise model,
where the parameter of interest lies in some infinite-dimensional function space, is a linear operator with values in a possibly different space, is white noise, and is the noise level. Applications of linear inverse problems include, e.g., computerized tomography, see natterer2001mathematics_CTisakov2013inverse , and scattering theory, see colton2012inverse_acoustic_electromagnetic .
Arguably, in practice one does not have access to a full record of observations on the unknown function as in the idealised model (1), but rather one indirectly observes it at a finite number of points. This statistical setting can be conveniently formalised as follows: let the signal of interest be an element in a Hilbert space of functions defined on a compact interval . The forward operator maps to another Hilbert space . We assume that , are subspaces of , typically collections of functions of certain smoothness as specified in the later sections, and that the design points are chosen deterministically,
Assuming continuity of and defining
i.i.d. standard Gaussian random variables, our observations are the pairsand we are interested in estimating . A prototype example we think of is the case when is the solution operator in the Dirichlet problem for the heat equation acting on the initial condition ; see Example 2.8 below for details.
Model (3) is related to the inverse regression model studied e.g. in birke2010inverse_regression and bissantz2012inverse_regression . Although the setting we consider is somewhat special, our contribution is arguably the first one to study from a theoretical point of view a nonparametric Bayesian approach to estimation of in the inverse problem setting with partial observations (see aad:book:2017 for a monographic treatment of modern Bayesian nonparametrics). In the context of the signal-in-white noise model (1), a nonparametric Bayesian approach has been studied thoroughly in knapik2011bayesianmild and knapik2013bayesianextreme , and techniques from these works will turn out to be useful in our context as well. Our results will deal with derivation of posterior contraction rates and study of asymptotic frequentist coverage of Bayesian credible sets. A posterior contraction rate can be thought of as a Bayesian analogue of a convergence rate of a frequentist estimator, cf. vandervaart2000posterior and aad:book:2017 . Specifically, we will show that as the sample size , the posterior distribution concentrates around the ‘true’ parameter value, under which data have been generated, and hence our Bayesian approach is consistent and asymptotically recovers the unknown ‘true’ . The rate at which this occurs will depend on the smoothness of the true parameter and the prior and the ill-posedness degree of the problem. Correct combinations of these values lead to optimal posterior contraction rates (up to logarithmic factors). Furthermore, a Bayesian approach automatically provides uncertainty quantification in parameter estimation through the spread of the posterior distribution, specifically by means of posterior credible sets. We will give an asymptotic frequentist interpretation of these sets in our context. In particular, we will see that the frequentist coverage will depend on a combination of smoothness of the true parameter and the prior, and the ill-posedness of the problem. Oversmoothing priors lead to zero coverage, while undersmoothing priors produce highly conservative results.
The article is organized as follows: in Section 2, we give a detailed description of the problem, introduce the singular value decomposition and convert the model (3) into an equivalent truncated sequence model that is better amenable to our theoretical analysis. We show how a Gaussian prior in this sequence model leads to a Gaussian posterior and give an explicit characterisation of the latter. Our main results on posterior contraction rates and Bayesian credible sets are given in Section 3, followed by simulation examples in Section 4 that illustrate our theoretical results. Section 5 contains the proofs of the main theorems, while the technical lemmas used in the proofs are collected in Section 6.
The notational conventions we use in this work are the following: definitions are marked by the symbol; denotes the absolute value and indicates the norm related to the space ; is understood as the canonical inner product in the inner product space ; subscripts are omitted when there is no danger of confusion;
denotes the Gaussian distribution with meanand covariance operator ; subscripts and may be used to emphasize the fact that the distribution is defined on the space or on the abstract space ; denotes the covariance or the covariance operator, depending on the context; for positive sequences of real numbers, the notation and mean respectively that there exist positive constants independent of such that or hold for all ; finally, indicates that the ratio is asymptotically bounded from zero and infinity, while means as .
2 Sequence model
2.1 Singular value decomposition
We impose a common assumption on the forward operator from the literature on inverse problems, see, e.g., alquier2011inverse , bissantz2007inverseregularization and cavalier2008nonparametricinverseproblems .
Operator is injective and compact.
It follows that is also compact and in addition self-adjoint. Hence, by the spectral theorem for self-adjoint compact operators, see conway1990courseonFA , we have a representation where and are the eigenbasis on
and eigenvalues, respectively, (corresponding to the operator), and are the Fourier coefficients of . This decomposition of is known as the singular value decomposition (SVD), and are also called singular values.
It is easy to show that the conjugate basis of the orthonormal basis is again an orthonormal system in and gives a convenient basis for , the range of in . Furthermore, the following relations hold (see alquier2011inverse ),
Recall a standard result (see, e.g., haase2014functional ): a Hilbert space is isometric to and Parseval’s identity holds; here are the Fourier coefficients with respect to some known and fixed orthonormal basis.
We will employ the eigenbasis of to define the Sobolev space of functions. This will define the space in which the unknown function resides.
We say is in the Sobolev space with smoothness parameter , if it can be written as with and if its norm is finite.
The above definition agrees with the classical definition of the Sobolev space if the eigenbasis is the trigonometric basis, see, e.g., tsybakov2008introduction . With a fixed basis, which is always the case in this article, one can identify the function and its Fourier coefficients . Thus, we use to denote both the function space and the sequence space. For example, it is easy to verify that (correspondingly ), for any nonnegative , and when .
Recall that Then we have if and if decays exponentially fast. Such a lifting property is beneficial in the forward problem, since it helps to obtain a smooth solution. However, in the context of inverse problems it leads to a difficulty in recovery of the original signal , since information on it is washed out by smoothing. Hence, in the case of inverse problems one does not talk of the lifting property, but of ill-posedness, see cavalier2008nonparametricinverseproblems .
An inverse problem is called mildly ill-posed, if as , and extremely ill-posed, if with as , where is strictly positive in both cases.
In the rest of the article, we will confine ourselves to the following setting.
As an immediate consequence of the lifting property, we have .
We conclude this section with two canonical examples of the operator .
Example 2.7 (mildly ill-posed case: Volterra operator knapik2011bayesianmild )
The classical Volterra operator and its adjoint are
The eigenvalues, eigenfunctions of
The eigenvalues, eigenfunctions ofand the conjugate basis are given by
Example 2.8 (extremely ill-posed case: heat equation knapik2013bayesianextreme )
Consider the Dirichlet problem for the heat equation:
where is defined on and satisfies . The solution of (5) is given by
where are the coordinates of in the basis .
For the solution map , the eigenvalues of are , the eigenbasis and conjugate basis coincide and .
2.2 Equivalent formulation
In this subsection we develop a sequence formulation of the model (3), which is very suitable for asymptotic Bayesian analysis. First, we briefly discuss the relevant results that provide motivation for our reformulation of the problem.
In Examples 2.7 and 2.8, the sine and cosine bases form the eigenbasis. In fact, the Fourier basis (trigonometric polynomials) frequently arises as an eigenbasis for various operators, e.g. in the case of differentiation, see efromovich1998differentiation , or circular deconvolution, see cavalier2002sharp . For simplicity, we will use Fourier basis as a primary example in the rest of the article. Possible generalization to other bases is discussed in Remark 2.10.
Restriction of our attention to the Fourier basis is motivated by its special property: discrete orthogonality. The next lemma illustrates this property for the sine basis (Example 2.8).
Lemma 2.9 (discrete orthogonality)
Let be the sine basis, i.e.
Discrete orthogonality holds:
Here is the Kronecker delta.
Fix . For any fixed and all , there exits only one depending only on the parity of , such that for the equality
holds, while for all such that
For other trigonometric bases, discrete orthogonality can also be attained. Thus, the conjugate eigenbasis in Example 2.7 is discretely orthogonal with design points . We refer to akansu2010GDFTs and references therein for details. With some changes in the arguments, our asymptotic statistical results still remain valid with such modifications of design points compared to (2). We would like to stress the fact that restricting attention to bases with discrete orthogonality property does constitute a loss of generality. However, there exist classical bases other than trigonometric bases that are discretely orthogonal (possibly after a suitable modification of design points). See, for instance, quarteroni2010numerical for an example of Lagrange polynomials.
Motivated by the observations above, we introduce our central assumption on the basis functions.
Using the shorthand notation
we obtain for that
By Assumption 2.11, we have
which leads to (via Cauchy-Schwarz)
Hence, for a mildly ill-posed problem, i.e. the following bound holds, uniformly in the ellipsoid
for any when .
If the problem is extremely ill-posed, i.e. we use the inequality
Since , it follows that is up to a constant bounded from above by . Hence
In knapik2011bayesianmild , knapik2013bayesianextreme , the Gaussian prior is employed on the coordinates of the eigenbasis expansion of . If the sum is convergent, and hence this prior is the law of a Gaussian element in .
In our case, we consider the same type of the prior with an additional constraint that only the first components of the prior are non-degenerate, i.e. , where is as above. In addition, we assume the prior on is independent of the noise in (8). With these assumptions in force, we see for Furthermore, the posterior can be obtained from the product structure of the model and the prior via the normal conjugacy,
We also introduce
where We conclude this section with a useful fact that will be applied in later sections:
3 Main results
3.1 Contraction rates
In this section, we determine the rate at which the posterior distribution concentrates on shrinking neighbourhoods of the ‘true’ parameter as the sample size grows to infinity.
Assume the observations in (3) have been collected under the parameter value . Thus our observations given in (8) have the law . We will use the notation to denote the posterior distribution given in (12).
Theorem 3.12 (Posterior contraction: mildly ill-posed problem)
If the problem is mildly ill-posed as with , the true parameter with , and furthermore by letting with and any positive satisfying , we have, for any and ,
if , then ;
if and , then ;
if , then for every scaling ,
Thus we recover the same posterior contraction rates as obtained in knapik2011bayesianmild , at the cost of an extra constraint . The frequentist minimax convergence rate for mildly ill-posed problems in the white noise setting with is see cavalier2008nonparametricinverseproblems . We will compare our result to this rate. Our theorem states that in case (i.) the posterior contraction rate reaches the frequentist optimal rate if the regularity of the prior matches the truth and the scaling factor is fixed. Alternatively, as in case (ii.), the optimal rate can also be attained by proper scaling, provided a sufficiently regular prior is used. In all other cases the contraction rate is slower than the minimax rate. Our results are similar to those in knapik2011bayesianmild in the white noise setting. The extra constraint that we have in comparison to that work demands an explanation. As (10) shows, the size of negligible terms in (8) decreases as the smoothness of the transformed signal increases. In order to control , a minimal smoothness of is required. The latter is guaranteed if for it is known that in that case will be at least continuous, while it may fail to be so if , see tsybakov2008introduction .
The control on from (9) depends on the fact that the eigenbasis possesses the properties in Assumption 2.11. If instead of Assumption 2.11 (ii.) one only assumes for any and , the constraint on the smoothness of has to be strengthened to in order to obtain the same results as in Theorem 3.12, because the condition guarantees that the control on in (10) remains valid.
Now we consider the extremely ill-posed problem. The following result holds.
Theorem 3.14 (Posterior contraction: extremely ill-posed problem)
Let the problem be extremely ill-posed as with , and let the true parameter with Let with and any positive satisfying . Then
for any and , where
if , then ,
if for some , then .
Furthermore, if with , the following contraction rate is obtained:
Since the frequentist minimax estimation rate in extremely ill-posed problems in the white noise setting is (see cavalier2008nonparametricinverseproblems ), Theorem 3.14 shows that the optimal contraction rates can be reached by suitable choice of the regularity of the prior, or by using an appropriate scaling. In contrast to the mildly ill-posed case, we have no extra requirement on the smoothness of . The reason is obvious: because the signal is lifted to by the forward operator , the term (11) converges to zero exponentially fast, implying that in (8) is always negligible.
3.2 Credible sets
In the Bayesian paradigm, the spread of the posterior distribution is a common measure of uncertainty in parameter estimates. In this section we study the frequentist coverage of Bayesian credible sets in our problem.
When the posterior is Gaussian, it is customary to consider credible sets centered at the posterior mean, which is what we will also do. In addition, because in our case the covariance operator of the posterior distribution does not depend on the data, the radius of the credible ball is determined by the credibility level and the sample size . A credible ball centred at the posterior mean from (13) is given by
where the radius is determined by the requirement that
By definition, the frequentist coverage or confidence of the set (17) is
where the probability measure is the one induced by the law ofgiven in (8) with . We are interested in the asymptotic behaviour of the coverage (19) as for a fixed uniformly in Sobolev balls, and also along a sequence changing with .
The following two theorems hold.
Theorem 3.15 (Credible sets: mildly ill-posed problem)
Theorem 3.16 (Credible sets: extremely ill-posed problem)
1, uniformly in , if ;
1, uniformly in with with small enough;
1, for any fixed ,
provided the condition holds;
0, along some with , if .
Moreover, if with and any positive satisfying , the asymptotic coverage of the credible set (17) is
0, for every such that for some .
For the two theorems in this section, the most intuitive explanation is offered by the case . The situations (i.), (ii.) and (iii.) correspond to , and , respectively. The message is that the oversmoothing prior ((iii.) in Theorem 3.15 and (iii.), (iv.) in Theorem 3.16) leads to disastrous frequentist coverage of credible sets, while the undersmoothing prior ((i.) in both theorems) delivers very conservative frequentist results (coverage 1). With the right regularity of the prior (case (ii.)), the outcome depends on the norm of the true parameter . Our results are thus similar to those obtained in the white noise setting in knapik2011bayesianmild and knapik2013bayesianextreme .
4 Simulation examples
In this section we carry out a small-scale simulation study illustrating our theoretical results. Examples we use to that end are those given in Subsection 2.1. These were also used in simulations in knapik2011bayesianmild and knapik2013bayesianextreme .
In the setting of Example 2.7, we use the following true signal,
It is easy to check that .
In the setup of Example 2.8, the initial condition is assumed to be
One can verify that in this case
and for any .
First, we generate noisy observations from our observation scheme (3) at design points in the case of Volterra operator, and in the case of the heat equation. Next, we apply the transform described in (8) and obtain transformed observations . Then, by (12), the posterior of the coefficients with the eigenbasis is given by
Figures 1 and 2 display plots of -credible bands for different sample sizes and different priors. For all priors we assume , and use different smoothness degrees , as shown in the titles of the subplots. In addition, the columns from left to right corresponds to and observations. The (estimated) credible bands are obtained by generating realizations from the posterior and retaining of them that are closest in the -distance to the posterior mean.
Two simulations reflect several similar facts. First, because of the difficulty due to the inverse nature of the problem, the recovery of the true signal is relatively slow, as the posteriors for the sample size are still rather diffuse around the true parameter value. Second, it is evident that undersmoothing priors (the top rows in the figures) deliver conservative credible bands, but still capture the truth. On the other hand, oversmoothing priors lead to overconfident, narrow bands, failing to actually express the truth (bottom rows in the figures). As already anticipated due to a greater degree of ill-posedness, recovery of the initial condition in the heat equation case is more difficult than recovery of the true function in the case of the Volterra operator. Finally, we remark that qualitative behaviour of the posterior in our examples is similar to the one observed in knapik2011bayesianmild and knapik2013bayesianextreme ; for larger samples sizes , discreteness of the observation scheme does not appear to have a noticeably adversary effect compared to the fully observed case in knapik2011bayesianmild and knapik2013bayesianextreme .
5.1 Proof of Lemma 2.9
This proof is a modification of the one of Lemma 1.7 in tsybakov2008introduction . With the following temporary definitions and using Euler’s formula, we have
Observe that when , we have
Similarly, if ,
We fix and discuss different situations depending on .
Since , we always have , and the terms and can be calculated as above. Similarly, since , only when . Moreover, and have the same parity, and so is only possible if is even.
In this case, . This leads to .
Further, if , we have and . Otherwise, if , we have and (since is even), and so
which implies (22) equals .
and . We have . Arguing as above, if is odd, and . If is even, and .
The remaining cases follow the same arguments, and hence we omit the (lengthy and elementary) calculations.
It can be shown that always holds.
When is even, one obtains , where . Otherwise, for odd , where .
5.2 Proof of Theorem 3.12
In this proof we use the notation . To show
we first apply Markov’s inequality,
) and the bias-variance decomposition,
where is given in (12). Because is deterministic,
Since is assumed, it suffices to show that the terms in brackets are bounded by a constant multiple of uniformly in in the Sobolev ellipsoid.
Using (14), we obtain