Well-posedness of Bayesian inverse problems in quasi-Banach spaces with stable priors

by   T. J. Sullivan, et al.

The Bayesian perspective on inverse problems has attracted much mathematical attention in recent years. Particular attention has been paid to Bayesian inverse problems (BIPs) in which the parameter to be inferred lies in an infinite-dimensional space, a typical example being a scalar or tensor field coupled to some observed data via an ODE or PDE. This article gives an introduction to the framework of well-posed BIPs in infinite-dimensional parameter spaces, as advocated by Stuart (Acta Numer. 19:451--559, 2010) and others. This framework has the advantage of ensuring uniformly well-posed inference problems independently of the finite-dimensional discretisation used for numerical solution. Recently, this framework has been extended to the case of a heavy-tailed prior measure in the family of stable distributions, such as an infinite-dimensional Cauchy distribution, for which polynomial moments are infinite or undefined. It is shown that analogues of the Karhunen--Loève expansion for square-integrable random variables can be used to sample such measures on quasi-Banach spaces. Furthermore, under weaker regularity assumptions than those used to date, the Bayesian posterior measure is shown to depend Lipschitz continuously in the Hellinger and total variation metrics upon perturbations of the misfit function and observed data.


page 1

page 2

page 3

page 4


A Hadamard fractioal total variation-Gaussian (HFTG) prior for Bayesian inverse problems

This paper studies the infinite-dimensional Bayesian inference method wi...

Random tree Besov priors – Towards fractal imaging

We propose alternatives to Bayesian a priori distributions that are freq...

Posterior Convergence of α-Stable Sheets

This paper is concerned with the theoretical understanding of α-stable s...

Posterior Convergence Analysis of α-Stable Sheets

This paper is concerned with the theoretical understanding of α-stable s...

On the well-posedness of Bayesian inverse problems

The subject of this article is the introduction of a weaker concept of w...

Solving linear Bayesian inverse problems using a fractional total variation-Gaussian (FTG) prior and transport map

The Bayesian inference is widely used in many scientific and engineering...

Stein variational gradient descent on infinite-dimensional space and applications to statistical inverse problems

For solving Bayesian inverse problems governed by large-scale forward pr...

1 Introduction

The Bayesian perspective on inverse problems has attracted much mathematical attention in recent years (Kaipio and Somersalo, 2005; Stuart, 2010). Particular attention has been paid to Bayesian inverse problems (BIPs) in which the parameter to be inferred lies in an infinite-dimensional space , a typical example being a scalar or tensor field coupled to some observed data via an ODE or PDE. Numerical solution of such infinite-dimensional BIPs must necessarily be performed in an approximate manner on a finite-dimensional subspace , but it is profitable to delay discretisation to the last possible moment and consider the original infinite-dimensional problem as the primary object of study, since infinite-dimensional well-posedness results and algorithms descend to in a discretisation-independent way, whereas careless early discretisation may lead to a sequence of well-posed finite-dimensional BIPs or algorithms whose stability properties degenerate as the discretisation dimension increases.

The infinite-dimensional (or, in statistical language, non-parametric) context presents a number of computational and theoretical challenges. The development of sampling algorithms that are well defined in infinite-dimensional spaces, and hence robust in a high finite dimension, is an interesting topic in itself; the preconditioned Crank–Nicolson (pCN) method of Cotter et al. (2013) is a representative example in this area. On a more fundamental level, though, the specification of the Bayesian prior and likelihood must be treated with care in order to ensure that the posterior measure is well posed in the sense of Hadamard, i.e. is well defined and depends ‘nicely’ upon the observed data etc.

It has become clear over the last two decades that it is not sufficient to study only the finite-dimensional formulation of BIPs, and some analysis of the infinite-dimensional limit is essential. In the Finnish school of inverse problems, this notion of infinite-dimensional well-posedness became known as discretisation invariance. Its importance was highlighted by a seminal paper on edge-preserving image reconstruction using the discrete total variation prior (Lassas and Siltanen, 2004). The model problem in this paper was the reconstruction using pixels of a one-dimensional continuum image from

-dimensional linear observations subject to additive Gaussian noise. Corresponding to a prior belief that the true image is piecewise smooth with a few jump discontinuities, the discrete total variation prior on

is used. However, Lassas and Siltanen (2004)

showed that there is no way to consistently scale the discrete TV prior so that the Bayesian prior and posterior (and derived summary quantities such as conditional means and maximum a posteriori estimators) all have meaningful limits as

. This negative result inspired a search for priors that would provide edge-preserving and discretisation-invariant Bayesian image reconstruction.

It has recently been suggested, based upon numerical experiments, that Cauchy difference priors may be suitable priors for edge-preserving Bayesian image reconstruction (Markkanen et al., 2016). However, in this case, the well-posedness theory of Stuart (2010) must be extended to the heavy-tailed setting. In this respect, this work forms part of an ongoing investigation of BIPs with heavy-tailed or other complicated structure far from the simple Gaussian regime (Dashti et al., 2012; Hosseini, 2017; Hosseini and Nigam, 2017; Lassas et al., 2009).

2 Bayesian inverse problems

This section reviews the essential notions for the study of BIPs in function spaces, following the style of Stuart (2010). We fix two spaces and ; the space contains an unknown that we wish to recover from data in . We posit that is in fact a noisily perturbed version of , where is a known function, referred to as a forward operator. The prototypical example to keep in mind is the case of additive Gaussian noise:


where is a centred -valued Gaussian random variable with covariance operator . Naturally, many other noise models can arise in practice. The challenge of recovering from is known as an inverse problem.

Inverse problems are typically ill posed in the sense of having no solution in the strict sense, or multiple solutions, or solutions that depend very sensitively upon the problem setup, and in particular upon the data . To deal with the existence problem, one typically relaxes the notion of solution and seeks a minimiser of a suitable misfit functional . In the additive Gaussian case, the appropriate functional is a weighted least-squares functional:


Such problems can still be ill-conditioned, and in order to rectify this and also incorporate prior beliefs about (e.g. that should be small in some norm) one typically minimises a regularised version of the misfit:


For example, on , the regularisation functional

could be the Euclidean norm, corresponding to ridge regression; or the

-norm, corresponding to the sparsity-promoting LASSO; or the TV norm alluded to in the introduction.

The Bayesian interpretation of the inverse problem is to interpret both and as random variables. One first posits a prior measure for — in the case of PDE-constrained inverse problems, typically encodes beliefs about the smoothness of in that it charges a suitable Sobolev or Hölder space with full mass. Eq. (2.1) then determines the conditional distribution of , the likelihood; the Bayesian solution to the inverse problem is the conditional distribution of , the posterior distribution. If and

were finite sets, then this posterior would be given in terms of the prior and likelihood by Bayes’ rule; informally, the posterior is the likelihood times the prior, normalised to be a probability measure. It is also helpful to think of the prior

as having probability density proportional to , and the posterior as having density proportional to .

However, particularly when is an infinite-dimensional space, Bayes’ rule must be stated more carefully, since there is no uniform reference measure (counting measure or Lebesgue measure) with respect to which one can take densities. This generalisation of Bayes’ rule is the content of Theorem 4.1 below. It is also of interest to study the stability of the Bayesian posterior under approximation/error in the observed data, the likelihood, or the prior; this is the content of Theorem 4.2 and the discussion thereafter.

3 Stable distributions in quasi-Banach spaces

Cauchy distributions have recently been proposed as discretisation-invariant edge-preserving priors by Markkanen et al. (2016). Cauchy distributions are an example of more general stable distributions, which are themselves generalised by infinitely divisible distributions (Hosseini, 2017). This section surveys how stable distributions can be defined and sampled by a Karhunen–Loève-like expansion that is valid not just on Banach but even more general spaces. While separability and completeness remain desirable attributes, it turns our that the triangle inequality can be easily relaxed.

A quasi-Banach space

is a vector space

equipped with a quasinorm , i.e. a function that is positive-definite and -homogeneous and satisfies, for some choice of constant , the weakened triangle inequality


and that is Cauchy-complete with respect to this quasinorm. Examples of such spaces include and for .

A -valued random variable is said to be stable of order if, when are independent and identically distributed copies of , is equal in distribution to for some . Equivalently, in terms of the law of and the rescaling ,


Stability is a particularly appealing property if the aim is to construct prior measures for BIPs that are ‘physically consistent’ in the sense of remaining in the same model class regardless of discretisation or coordinate choices, at least when the ‘physical quantity’ obeys an additive law, e.g. the amount of mass or charge contained within a given physical region.

The stable distributions on

are completely classified by four parameters: the index of stability

mentioned above, a skewness parameter

, a scale parameter , and a location parameter . We denote the unique such distribution by . The case

is the case of a Gaussian (normal) distribution; the case

, is the Cauchy distribution. Crucially, when , only has finite moments up to, but not including, order (the exception being , in which case has moments of all orders).

To define and sample -valued stable random variables, we take a Karhunen–Loève-style approach, i.e. we resort to series expansions with real stable random coefficients in a countable, unconditional, normalised, Schauder basis of , e.g. a polynomial, Fourier, or wavelet basis. We also assume that the basis and are such that the synthesis operator is a continuous embedding of the sequence space of coefficients into , i.e.


When is a Banach space, this assumption holds with for any choice of basis , since it is just the triangle inequality for an unconditionally convergent series in . Since , whenever (3.3) holds for it also holds with replaced by any . If inequality (3.3) can be reversed, possibly with a different constant, then the basis is known as a -frame for . The case is the well-known notion of a Riesz basis.

The convergence theorem for such series expansions — which states that sufficiently rapid decay of the scale parameters implies almost sure convergence of the series in — is broadly the same as the well-known Gaussian case , but with some logarithmic corrections (Sullivan, 2017):

Theorem 3.1.

Consider , where are independent, with , , , and, in addition,


with the convention that . Then almost surely. Furthermore, for with , in as and, in particular,


In particular, in the case , we can define -valued Cauchy random variables with moments up to but not including order by using scalar Cauchy-distributed random coefficients whose scale parameters are ‘slightly better than ’ in the sense that is finite.

4 Well-posedness of BIPs with stable priors

The well-posedness of the posterior with respect to perturbations of the problem setup is a topic of natural interest. Since there are many choices of (inequivalent) topology or metric upon the space of probability measures on , this well-posedness can be quantified in many ways (Deza and Deza, 2016, Chapter 14). We choose to use the Hellinger metric defined by


where is an arbitrary choice of reference measure with respect to which both and are absolutely continuous. The Hellinger topology is stronger than the weak topology, equivalent to the total variation topology, and weaker than the relative entropy (Kullback–Leibler) topology. It appears quite naturally in the well-posedness analysis of BIPs, and has the advantage of controlling the expected values of square-integrable quantities of interest:


The basic result ensuring that the posterior distribution for exists and satisfies the appropriate generalisation of Bayes’ rule is as follows (Stuart, 2010; Sullivan, 2017):

Theorem 4.1.

Let and be quasi-Banach spaces and let be a prior distribution on . Suppose that is locally bounded, Carathéodory (i.e. continuous in and measurable in ). Suppose also that, for every , there exists so that


and such that . Then, whenever , it follows that the normalising constant is strictly positive and finite and


does indeed define a Borel probability measure on , which is Radon if is Radon, and is the posterior distribution of conditioned upon the data .

The possibility of a lower bound on , as in Eq. (4.3), that tends to as or seems counterintuitive from a finite-dimensional perspective. However, such do arise naturally in the infinite-dimensional setting, and indeed are unavoidable: for example, in the case of a Gaussian prior and likelihood, the naïve formulation of as a quadratic misfit would be almost surely infinite, so the Cameron–Martin theorem must be used to ‘subtract off the infinite part of ’, making it almost surely finite, but at the cost of a global lower bound of (Stuart, 2010, Remark 3.8).

In the case of a Cauchy prior, Theorems 3.1 and 4.1 combine to say, informally, that a BIP with Cauchy prior has well-defined posterior provided that does not escape to any faster than logarithmically. This is to be contrasted with the polynomial growth rate that is permitted under a Gaussian prior.

The BIP can also be shown to be well-posed in the sense of depending continuously in the Hellinger metric upon perturbations of the data and the potential (Stuart, 2010; Sullivan, 2017). Indeed, with a possibly different constant, the BIP inherits the local Lipschitz continuity of :

Theorem 4.2.

Suppose that, in addition to the conditions of Theorem 4.1, for every , there exists such that, for all with ,


Suppose also that . Then the posterior depends locally Lipschitz continuously in the Hellinger metric upon the observed data in the sense that, for all , there exists a constant such that


Similar results to Theorem 4.2 can be shown for the well-posedness of with respect to perturbation of , e.g. as a result of incorporating an approximate numerical solution of the forward operator , which might be an ODE or PDE solution operator. Again, similarly to (4.5), the key property is for the approximation error in to be controlled by a bound that is exponentially integrable with respect to the prior ; the error rate for the forward problem, e.g. as a function of mesh size, then transfers to the BIP with a possibly different constant prefactor.

On the other hand, the well-posedness of BIPs with respect to the prior is a subtle topic, especially if the model is misspecified (i.e. there is no parameter value in that corresponds to the ‘truth’). Of particular note in this setting is the brittleness phenomenon highlighted by Owhadi et al. (2015a, b): not only does depend upon , as would be expected, but it can do so in a highly discontinuous way, in the sense that any pre-specified quantity of interest can be made to have any desired posterior expectation value after arbitrarily small perturbation in the common-moments, weak, total variation, or Hellinger topologies. This brittleness phenomenon takes place in the limit as the data resolution becomes infinitely fine, and it is a topic of current research whether approaches such as coarsening can in general yield robust inferences (Miller and Dunson, 2015).


The author is supported by the Free University of Berlin within the Excellence Initiative of the German Research FOundation (DFG).