Bayesian linear inverse problems in regularity scales

02/25/2018 ∙ by Shota Gugushvili, et al. ∙ 0

We obtain rates of contraction of posterior distributions in inverse problems defined by scales of smoothness classes. We derive abstract results for general priors, with contraction rates determined by Galerkin approximation. The rate depends on the amount of prior concentration near the true function and the prior mass of functions with inferior Galerkin approximation. We apply the general result to non-conjugate series priors, showing that these priors give near optimal and adaptive recovery in some generality, Gaussian priors, and mixtures of Gaussian priors, where the latter are also shown to be near optimal and adaptive. The proofs are based on general testing and approximation arguments, without explicit calculations on the posterior distribution. We are thus not restricted to priors based on the singular value decomposition of the operator. We illustrate the results with examples of inverse problems resulting from differential equations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a statistical inverse problem one observes a noisy version of a transformed signal and wishes to recover the unknown parameter . In this paper we consider linear inverse problems of the type

(1.1)

where is a known bounded linear operator between separable Hilbert spaces and , and is a stochastic ‘noise’ process, which is multiplied by the scalar ‘noise level’ . The problem is to infer from the observation . To this purpose we assume that the forward operator is injective, but we shall be interested in the case that the inverse , defined on the range of is not continuous (or equivalently the range of is not closed in ). The problem of recovering from is then ill-posed, and regularization methods are necessary in order to ‘invert’ the operator . These consist of constructing an approximation to , with natural properties such as boundedness and whose domain includes the data , and applying this to . By the discontinuity of the inverse , the noise present in the observation is necessarily multiplied, and regularization is focused on balancing the error in the approximation to to the size of the magnified noise, in order to obtain a solution that is as close as possible to the true signal . In this article we study this through the convergence rates of the regularized solutions to a true parameter , as , i.e. as the noise level tends to zero. In particular, we consider contraction rates of posterior distributions resulting from a Bayesian approach to the problem.

There is a rich literature on inverse problems. The case that the noise is a bounded deterministic perturbation, has been particularly well studied, and various general procedures and methods to estimate the convergence rates of regularized solutions have been proposed. See the monographs [15, 29]. The case of stochastic noise is less studied, but is receiving increasing attention. In this paper we shall be mostly interested in the case that is white noise indexed by the Hilbert space , i.e. the isonormal process, which is characterized by the requirement that

is a zero-mean Gaussian variable with variance

, for every , where and are the inner product and norm in in . Actually the isonormal process cannot be realized as a Borel-measurable map into , and hence we need to interpret eq. 1.1 in a generalized sense. In our measurement model the observation will be a stochastic process such that

(1.2)

where is the iso-normal process, i.e. a zero-mean Gaussian process with covariance function . The processes and are viewed as measurable maps in the sample space , with its product

-field. Statistical sufficiency considerations show that the observation can also be reduced to the vector

, which takes values in the sample space , for any orthonormal basis of . Since in that case the variables are stochastically independent standard normal variables, the coordinates

of this vector are independent random variables with normal distributions with means

and variance . This is known as the Gaussian sequence model in statistics, albeit presently the ‘drift function’ involves the operator . See [27, 5] and references therein.

An alternative method to give a rigorous interpretation to white noise , is to embed into a bigger space in which can be realized as a Borel measurable map, or to think of as a cylindrical process. See e.g., [45]. For a set of functions on an interval, one can also realize as a stochastic integral relative to Brownian motion, which takes its values in the ‘abstract Wiener space’ attached to . We shall not follow these constructions, as they imply the stochastic process version eq. 1.2, which is easier to grasp and will be the basis for our proofs.

It is also possible to consider the model eq. 1.1 with a noise variable that takes its values inside the Hilbert space . In this paper we briefly note some results on this ‘coloured noise’ model, but our main focus is model eq. 1.2.

The study of statistical (nonparametric) linear inverse problems was initiated by Wahba in 1970s in [53]. The 1990s paper [13] used wavelet shrinkage methods, while around 2000, the authors of [10] investigated eq. 1.1

in the linear partial differential equations setting, while a systematic study of Gaussian sequence models was presented in

[9]. A review of work until 2008 is given in [8]. The connection of regularization methods to the Bayesian approach was recognized early on. However, the study of the recovery properties of posterior distributions was started only in [30, 31]. A review of the Bayesian approach to inverse problems, with many examples, is given in [46].

In the present paper we follow the Bayesian approach. This consists of putting a probability measure on

, the prior, that quantifies one’s prior beliefs on , and next, after collecting the data, updating the prior to the posterior measure, through Bayes’ formula. As always, this is the conditional distribution of given in the model, where follows the prior measure

, a Borel probability distribution on

, and given the variable has the conditional distribution on determined by eq. 1.2. For a given the latter conditional distribution is dominated by its distribution under . The Radon-Nikodym densities of the conditional distributions can be chosen jointly measurable in , and by Bayes’ formula the posterior distribution of is the Borel measure on given by

(1.3)

The form of the densities is given by the (abstract) Cameron-Martin formula, but will not be needed in the following. In the Bayesian paradigm the posterior distribution encompasses all the necessary information for inference on . An attractive feature of the Bayesian approach is that it not only offers an estimation procedure, through a measure of ‘center’ of the posterior distribution, but also provides a way to conduct uncertainty quantification, through the spread in the posterior distribution.

One hopes that as the noise level tends to zero, i.e. , the posterior measures eq. 1.3 will contract to a Dirac measure at if in reality is generated through the model eq. 1.2 with . We shall be interested in the rate of contraction. Following [17, 19, 20] we say that a sequence is a rate of posterior contraction to if, for a fixed sufficiently large constant , and ,

(1.4)

We shall use the general approach to establishing such rates of contraction, based on a prior mass condition and testing condition, explained in [19].

Much of the existing work on statistical inverse problems is based on the singular value decomposition (SVD) of the operator ; see, e.g., [8]. When is compact, the operator , where is the adjoint of , can be diagonalized with respect to an orthonormal eigenbasis

, with eigenvalues tending to zero. The observation

can then be reduced to noisy observations on the Fourier coefficients of in the eigenbasis, which are multiples of the Fourier coefficients of , and the problem is to recover the latter. In the frequentist setup thresholding or other regularization methods can be applied to reduce the weight of estimates on coefficients corresponding to smaller eigenvalues, in which the noise will overpower the signal. In the Bayesian setup one may design a prior by letting the Fourier coefficients be (independent) random variables, with smaller variances for smaller eigenvalues. These singular value methods have several disadvantages, as pointed out in [12, 13]. First, the eigenbasis functions might not be easy to compute. Second, and more importantly, these functions are directly linked to the operator , and need not be related to the function space (smoothness class) that is thought to contain the true signal . Consequently, the parameter of interest may not have a simple, parsimonious representation in the eigenbasis expansion, see [13]. Furthermore, it is logical to consider the series expansion of the signal in other bases than the eigenbasis, for instance, in the situation that one can only measure noisy coefficients of the signal in a given basis expansion, due to a particular experimental setup. See [21, 38] for further discussion.

One purpose of the present paper is to work with priors that directly relate to common bases (e.g., splines or wavelets bases) and function spaces, rather than to the operator through its singular value decomposition. We succeed in this aim under the assumption that the operator respects a given scale of function spaces. A canonical example are Sobolev spaces, with the operator being an integral operator. This Sobolev space setup with wavelet basis was investigated in [13, 12]. In deterministic inverse problems, a more general setup, considering that acts along nested Hilbert spaces, Hilbert scales, was initiated by Natterer in [39] and further developed in, amongst others, [38, 37, 26]. In the Bayesian context Hilbert scales were used in [16], under the assumption that the noise is a proper Gaussian element in , and in [1], but under rather intricate assumptions.

A second purpose of the present paper is to allow priors that are not necessarily Gaussian. In the linear inverse problem Gaussian priors are easy, as they lead to Gaussian posterior distributions, which can be studied by direct means. Most of the results on Bayesian inverse problems fall in this framework ([30, 31], [16], [1], an exception being [43].

Thus in this paper we investigate a Bayesian approach to linear inverse problems that is not based on the SVD and does cover non-conjugate, non-Gaussian priors.

The white noise model represents a limiting case (in an appropriate sense) of the inverse regression model

where are independent standard normal random variables. Insights gained in inverse problems in the white noise model shed light on the behaviour of statistical procedures in the inverse regression model, which is the one encountered in actual practice, as the signal can be typically observed only on a discrete grid of points. It is next at times possible to extend theoretical results obtained in the white noise setting to those in the inverse regression setting.

The paper is organized as follows. In Section 2 we introduce in greater detail our setup along with the assumptions that will be used in this article. We also present some examples for illustration. Next we present a general contraction theorem in Section 3, and apply this to two main special cases, series priors and mixtures of Gaussian priors in Section 4 and Section 5. Since the simple Gaussian prior is not fully adaptive, we introduce Gaussian mixture priors to obtain adaptation in Section 6. In Section 7 we discuss several extensions of the present work. Section 8 contains the proofs, and an appendix presents background to some of the tools we need in the proofs.

Notation 1.1.

The symbols mean up to a positive multiple independent of , (or another asymptotic parameter). The constant may be stated explicitly in subscripts, and e.g. means that it depends on .

2 Setup

In this section we formalize the structure of the inverse problem that will be worked out in this article.

Smoothness classes

The function in eq. 1.1 is an element of a Hilbert space . We embed this space as the space in a ‘scale of smoothness classes’, defined as follows.

Definition 2.1 (Smoothness scale).

For every the space is an infinite-dimensional, separable Hilbert space, with inner product and induced norm . The spaces satisfy the following conditions:

  1. [(i)]

  2. For the space is a dense subspace of and , for .

  3. For and viewed as element of ,

    (2.1)

The notion of scales of smoothness classes is standard in the literature on inverse problems. In the preceding definition we have stripped it to the bare essentials needed in our general result on posterior contraction. Concrete examples, as well as more involved structures such as Hilbert scales, are introduced below.

Remark 2.2.

The norm duality eq. 2.1 is implied if, for , the space can be identified with the dual space of and the embedding is the adjoint of the embedding , after the usual identification of and its dual space . (The three nested spaces then form a ‘Gelfand triple’.) Indeed, by definition the image of under the adjoint is the map from . The norm of this map as an element of is . The norm duality follows if is identified with the element .

Since every is a Hilbert space, one can also identify with itself in the usual way, but this involves the inner product in , and is different from the identification of with the ‘bigger space’ .

We assume that the smoothness scale allows good finite-dimensional approximations, as in the following condition.

Assumption 2.3 (Approximation).

For every and , for some , there exists a -dimensional linear subspace and a number such that as , and such that

(2.2)
(2.3)

This assumption is also common in the literature on inverse problems. The two inequalities eq. 2.2 and eq. 2.3 are known as inequalities of Jackson and Bernstein type, respectively, see, e.g., [6]. The approximation property eq. 2.2 shows that ‘smooth elements’ are well approximated in by their projection onto a finite-dimensional space , with approximation error tending to zero as the dimension of tends to infinity. Naturally one expects the numbers that control the approximation to be decreasing in both and . In our examples we shall mostly have polynomial dependence , in the case that consists of functions on a -dimensional domain. The stability property eq. 2.3 quantifies the smoothness norm of the projections in terms of the approximation numbers. Both conditions are assumed up to a maximal order of smoothness , and it follows from eq. 2.3 that must be contained in the space .

The approximation property eq. 2.2 can also be stated in terms of the ‘approximation numbers’ of the canonical embedding . The th approximation number of a general bounded linear operator between normed spaces is defined as

(2.4)

where the infimum is taken over all linear operators of rank less than . It is immediate from the definitions that the numbers in eq. 2.2 can be taken equal to the approximation numbers . The set of approximation numbers of the canonical embedding describes many characteristics of the smoothness scale . We give a brief discussion in Appendix B.

Example 2.4 (Sobolev classes).

The most important examples of smoothness classes satisfying Definition 2.1 are fractional Sobolev spaces on a bounded domain . For a natural number the Sobolev space of order can be defined by

Here is the space of generalized functions on (distributions), i.e. the topological dual space of the space of infinitely differentiable functions with compact support in ; the sum ranges over the multi-indices with ; and is the differential operator

The definition can be extended to in several ways. All constructions are equivalent to the Besov space , see [49, 48].

It is well known that the approximation numbers of the scale of Sobolev spaces satisfy Assumption 2.3 with , see [25].

Example 2.5 (Sequence spaces).

Suppose is a given orthonormal sequence in a given Hilbert space , and is a given sequence of numbers. For , define as the set of all elements with , equipped with the norm

Then is embedded in , for every , and the norms are increasing in . Every space is a Hilbert space; in fact is isometric to under the map , where we have identified the series with their coefficients for simplicity of notation.

For , we equip the elements of , where , with the norm as in the display, which is now automatically finite, and next define as the metric completion of under this norm. The space is isometric to the set of all sequences with equipped with the norm given on the right hand side of the preceding display, but the series may not possess a concrete meaning, for instance as a function if is a function space.

By Parseval’s identity the inner product on is given by , and the norm duality eq. 2.1 follows with the help of the Cauchy-Schwarz inequality.

The natural approximation spaces for use in Assumption 2.3 are . Inequalities are satisfied with the approximation numbers taken equal to .

The forward operator in the model eq. 1.1 is a bounded linear operator between the separable Hilbert spaces and , and is assumed to be smoothing. The following assumption makes this precise. This assumption is satisfied in many examples and is common in the literature (for instance [12, 39, 22]).

In Definition 2.1 the space is embedded as in the smoothness scale and hence has norm .

Assumption 2.6 (Smoothing property of ).

For some the operator is injective and bounded and, for every ,

(2.5)
Example 2.7 (Svd).

If the operator is compact, then the positive self-adjoint operator

possesses a countable orthonormal basis of eigenfunctions

, which can be arranged so that the corresponding sequence of eigenvalues decreases to zero. If is injective, then all eigenvalues, whose roots are known as the singular values of , are strictly positive. Suppose that there exists such that, for some given sequence ,

(2.6)

If we construct the smoothness classes from the basis and the numbers as in Example 2.5, then eq. 2.5 is satisfied.

Indeed, we can write in polar decomposition as , for a partial isometry , and then have , so that .

Thus constructions using the singular value decomposition of can always be accommodated in the more general setup described in the preceding.

For more interesting illustrations of the preceding setup, consider linear differential equations of the form

where is a differential operator. Under appropriate boundary conditions, the solution can often be expressed in terms of the Green’s function associated with , through a kernel operator

(2.7)

The operator typically lifts a function to a Sobolev space of functions, as in Example 2.4. The ill-posedness surfaces when one observes the state with noise (which deteriorates the smoothness), and tries to recover the source function . For illustration we include two concrete examples from the literature.

Example 2.8 (Poisson equation).

The following example can be found in Sections 10.4 and 11.2 in [24]. Let be the periodic Sobolev spaces of (generalized) functions satisfying the boundary condition . Consider the following boundary problem,

with the Dirichlet boundary condition: . The unique solution is given by

where

The operator is Hilbert-Schmidt and hence compact, and therefore has no bounded inverse. On the other hand the inverse exists as bounded operator , and is given by .

When , the Sobolev norm is equivalent to (Page 217 in [4]). Since , we have , i.e .

Since the kernel is symmetric, is self-adjoint. Besides, is an isomorphism between and as shown above. Hence

by norm duality argument, for all . This shows that eq. 2.5 holds with .

Example 2.9 (Symm’s equation [29]).

Consider the Laplace equation in a bounded set with boundary condition on the boundary . The singular layer potential, a boundary integral

solves the boundary value problem if and only if the density , belonging to the space of continuous functions on , solves Symm’s equation

(2.8)

Assume the boundary has a parametrization of the form , for some -periodic analytic function such that for all . Then Symm’s equation takes the following form,

where . As shown in Theorem 3.18 from [29], the operator satisfies eq. 2.5, with and being periodic Sobolev spaces on .

3 General Result

In this section we present a general theorem on posterior contraction. We form the posterior distribution as in eq. 1.3, given a prior on the space and an observation , whose conditional distribution given is determined by the model eq. 1.2. We study this random distribution under the assumption that follows the model eq. 1.2 for a given ‘true’ function , which we assume to be an element of in a given smoothness scale , as in Definition 2.1.

The result is based on an extension of the testing approach of [20] to the inverse problem eq. 1.2. The inverse problem is handled with the help of the Galerkin method, which is a well known strategy in numerical analysis to solve the operator equation for , in particular for differential and integral operators. The Galerkin method has several variants, which are useful depending on the properties of the operator involved. Here we use the least squares method, which is of general application; for other variants and background, see e.g., [29]. In Appendix A we give a self-contained derivation of the necessary inequalities, exactly in our framework. We note that the Galerkin method only appears as a tool to state and derive a posterior contraction rate. In our context it does not enter into the solution of the inverse problem, which is achieved through the Bayesian method.

Let be the image under of a finite-dimensional approximation space linked to the smoothness scale as in Assumption 2.3, and let be the orthogonal projection onto . If is injective, then is a bijection between the finite-dimensional vector spaces and , and hence for every there exists such that . The element is called the Galerkin solution to in . By the projection theorem in Hilbert spaces it is characterized by the property that together with the orthogonality relations

(3.1)

The idea of the Galerkin inversion is to project the (complex) object onto the finite-dimensional space , and next find the inverse image of the projection, in the finite-dimensional space , as in the diagram:

Clearly the Galerkin solution to an element is itself, but in general is an approximation to , which will be better for increasing , but increasingly complex. The following theorem uses a dimension that balances approximation to complexity, where the complexity is implicitly determined by a testing criterion.

Theorem 3.1.

For smoothness classes as in Definition 2.1, assume that for some , and let denote the Galerkin solution to relative to linear subspaces associated to as in Assumption 2.3. Let for some , and for such that , and such that , and some , assume

(3.2)
(3.3)
(3.4)

Consider prior probability distributions

on satisfying

(3.5)
(3.6)

Then the posterior distribution in the model eq. 1.2 contracts at the rate at , i.e. for a sufficiently large constant we have , in probability under the law of given by eq. 1.2 with .

Proof.

The Kullback-Leibler divergence and variation between the distributions of

under two functions and are given by and twice this quantity, respectively. (E.g., Lemma 8.3 in [20].) Therefore the neighbourhoods in (8.19) of [20] contain the ball . By assumption eq. 3.5 this has prior mass at least .

Because the quotient of the left sides of eq. 3.5 and eq. 3.6 is

, the posterior probability of the set

tends to zero, by Theorem 8.20 in [20].

By a variation of Theorem 8.22 in [20] it is now sufficient to show the existence of tests such that, for some ,

Indeed, in the case that the prior mass condition (8.20) in Theorem 8.22 of [20] can be strengthened to (8.22), as is the case in our setup in view of eq. 3.5, it suffices to verify (8.24) only for a single value of . Furthermore, we can apply Theorem 8.22 with the metrics in order to reduce the restriction to .

Fix any orthonormal basis of and define

where . The latter is a “standard normal vector in the finite-dimensional space ”: because are i.i.d. standard normal variables, the variable is -distributed, for every .

Let the operator be defined as , where is the inverse of , which is well defined on the range of . Then by definition is equal to the Galerkin solution to . By the preceding display is a well-defined Gaussian random element in , satisfying

(3.7)

The variable is a Gaussian random element in

with strong and weak second moments

In both cases the inequality on at the far right side follows from eq. A.3.

The first inequality shows that the first moment of the variable is bounded above by . By Borell’s inequality (e.g. Lemma 3.1 in [36] and subsequent discussion), applied to the Gaussian random variable in , we see that there exist positive constants and such that, for every ,

For and , and satisfying eq. 3.2, eq. 3.3 and eq. 3.4 this yields, for some ,

(3.8)

We apply this to bound the error probabilities of the tests

(3.9)

where is a given constant, to be determined.

Under , the decomposition eq. 3.7 is valid with , and hence . By the triangle inequality it follows that implies that . By eq. A.5 the assumption that implies that , for some , which at is further bounded by , by assumption eq. 3.4. Hence the probability of an error of the first kind satisfies

For , the right side is bounded by , by eq. 3.8.

Under the decomposition eq. 3.7 gives that . By the triangle inequality implies that . For such that and , we have . Hence the probability of an error of the second kind satisfies

For , this is bounded by , by eq. 3.8.

We can first choose large enough so that , and next large enough so that , to finish the proof. ∎

Inequality eq. 3.5 is the usual prior mass condition for the ‘direct problem’ of estimating (see [17]). It determines the rate of contraction of the posterior distribution of to . The rate of contraction of the posterior distribution of is slower due to the necessity of (implicitly) inverting the operator . The theorem shows that the rate depends on the combination of the prior, through eq. 3.6, and the inverse problem, through the various approximation rates.

The theorem applies to a true function that is ‘smooth’ of order (i.e., ). For a prior that is constructed to give an optimal contraction rate for multiple values of simultaneously, the theorem may not give the best result. The following theorem refines Theorem 3.1 by considering a mixture prior of the form

(3.10)

where is a prior on

, for every given ‘hyperparameter

running through some measurable space, and is a prior on this hyperparameter. The idea is to adapt the prior to multiple smoothness levels through the hyperparameter .

Theorem 3.2.

Consider the setup and assumptions of Theorem 3.1 with a prior of the form eq. 3.10. Assume that eq. 3.2, eq. 3.3, eq. 3.4 and eq. 3.5 hold, but replace eq. 3.6 by the pair of conditions, for numbers and and every ,

(3.11)
(3.12)

Then the posterior distribution in the model eq. 1.2 contracts at the rate at , i.e. for a sufficiently large constant we have , in probability under the law of given by eq. 1.2 with .

Proof.

We take the parameter of the model as the pair , which receives the joint prior given by and . With abuse of notation, we denote this prior also by . The likelihood still depends on only, but the joint prior gives rise to a posterior distribution on the pair , which we also denote by , by a similar abuse of notation.

By eq. 3.10 and eq. 3.11eq. 3.12,