Solving Bayesian inverse problems KaipioSomersalo ; Stuart2010 is a challenging task in many domains of application, due to the complexity of the posterior distribution. One of the primary sources of complexity is the dimension of the parameters to be inferred, which is often high or in principle infinite—e.g., when the posterior is a distribution over functions or their finite-dimensional discretization. High dimensionality presents difficulties for posterior sampling: care is required to design sampling algorithms that mix effectively while remaining robust under refinement of the discretization. High dimensionality also raises significant hurdles to the use of model reduction or approximation schemes (e.g., Conrad2016 ; Cui2015 ; manzoni2016accurate ; rubio2018fast ) that attempt to reduce the cost of likelihood or forward model evaluations.
Successful strategies for high-dimensional Bayesian inverse problems typically exploit the presence of some low-dimensional structure. A common structure in inverse problems is that the posterior is a low-dimensional update of the prior, in the sense that change from prior to posterior is most prominent on a low-dimensional subspace of the parameter space. Put another way, the likelihood is influential, relative to the prior, only on a low-dimensional subspace (we will make these intuitive notions more precise later). Sources of such structure include the smoothing properties of the forward operator and limitations on the number or accuracy of the observations. In the linear–Gaussian case, this structure is already well understood. For instance, low-rank approximations of the (prior-preconditioned) Hessian of the log-likelihood have been used in flath2011fast to approximate the posterior covariance. In Spantani2015 , this approach is shown to yield optimal
approximations of the posterior covariance and of the posterior mean. A heuristic extension of this approach to nonlinear forward models, known as thelikelihood-informed subspace (LIS) method, is proposed in Cui2014_LIS and shown to perform well in many applications beskos2017multilevel ; lamminpaa2017likelihood . A similar idea underlies the active subspace method applied to Bayesian problems Constantine2015 . Identifying the subspace on which changes from the prior to posterior are most prominent also provides a foundation for infinite-dimensional MCMC algorithms that “split” the parameter space, such as the dimension-independent likelihood-informed (DILI) MCMC samplers of Cui2016_DILI . In Cui2016_joint , LIS-based dimension reduction is shown to be an effective prelude to model order reduction, where the forward model is replaced with a computationally inexpensive approximation. Yet all of these dimension reduction approaches, outside of the linear–Gaussian setting, are essentially heuristics. Optimal low-dimensional approximations—or approximations endowed with certified error bounds—have not yet been developed for the nonlinear/non-Gaussian setting. Such approximations are the subject of this paper.
Other forms of dimension reduction for Bayesian inverse problems have also been proposed, besides the “update” form of dimension reduction described above. For instance, in Li2006 ; marzouk2009dimensionality , a truncated Karhunen-Loève decomposition of the prior distribution is used to reduce the parameter dimension of the entire inverse problem. This approach exploits only the low-dimensional structure of the prior distribution, however, and does not take advantage of structure in the likelihood function or forward model. In lieberman2010parameter , a greedy algorithm is used to identify a parameter subspace capable of reproducing the forward model. This approach, in contrast, does not take advantage of the prior correlation structure, and moreover can remove directions that are uninformed by the likelihood but that still retain large variation under the true posterior. More recent results on the intrinsic dimension of linear Bayesian inverse problems Agapiou2017 reinforce the idea that the update from prior to posterior should instead be a central object, e.g., when characterizing the performance of importance sampling schemes.
In this paper we propose a methodology to detect and exploit the low-dimensional structure of the update from prior to posterior. Our approach addresses Bayesian inverse problems with nonlinear forward operators, non-Gaussian priors, and non-Gaussian observation noise. The basic idea is to approximate the likelihood function by a ridge function, i.e., a function that depends non-trivially only on a few linear combinations of the parameters. More precisely, we seek a controlled approximation of the likelihood, such that the Kullback–Leibler (KL) divergence from the resulting posterior approximation to the exact posterior is below some user-defined threshold. To attain this goal, we derive an upper bound on the KL divergence and construct the ridge approximation so that the bound falls below the threshold. This bound is obtained via logarithmic Sobolev inequalities, an important class of inequalities in measure theory gross1975logarithmic ; guionnet2003lectures ; Otto1999 with many implications on concentration of measure phenomena boucheron2013concentration ; ledoux1999concentration . Using logarithmic Sobolev inequalities requires some assumptions on the prior distribution and on the regularity of the likelihood function. In particular, we need the gradient of the log-likelihood to be square-integrable over the posterior distribution. A similar methodology has been proposed in zahm2018gradient to reduce the input dimension of multivariate functions. In that paper, a bound on the function approximation error in norm is obtained via Poincaré inequalities, another class of Sobolev inequality.
The benefits of using our upper bound are threefold. First it provides a test that reveals the low effective dimension of the problem: a quickly decaying spectrum of the second moment matrix of the log-likelihood gradient ensures the existence of a low-dimensional ridge approximation. Second, unlike the original KL divergence, the upper bound can easily be minimized over the approximation class we consider. Doing so drives the construction of the ridge approximation of the likelihood function. Third, the upper bound is a certified
error estimate that allows control of the KL divergence, a quantity not readily accessible in practice.
The proposed dimension reduction method requires computing (i) the second moment matrix of the log-likelihood gradient and (ii) a conditional expectation of the likelihood function. In practice, we approximate both quantities with Monte Carlo estimates, requiring samples from the posterior for (i) and samples from the prior for (ii). The resulting (random) posterior approximation is not exactly a minimizer of the upper bound. We show, however, that value of the upper bound corresponding to this random approximation is close to its minimum value, provided that the sample sizes are sufficiently large. This quasi-optimality result is given in expectation for (ii) and in high probability for (i). In particular, we show that the number of posterior samples to approximate (i) should scale in proportion to therank of the matrix to be estimated, which can be much smaller than the ambient dimension. Finally, even if the method only requires a limited number of posterior samples, it can be difficult to obtain the posterior samples in the first place. Thus, we also propose several alternatives for computing (i) that do not require sampling from the exact posterior. These alternatives include an iterative procedure that builds a sequence of low-dimensional posterior approximations in order to obtain an accurate final estimate.
The outline of the paper is as follows. In Section 2 we introduce the theoretical tools to derive and to minimize the error bound. We demonstrate the benefits of this method on an analytical example. In Section 3 we propose algorithms for the numerical construction of the low-dimensional posterior approximation. We give a theoretical analysis of the convergence of these sample-based estimators. In order to provide some context for our developments, we show in Section 4 how the proposed methodology compares with existing dimension reduction methods, including a comparison with a Karhunen-Loève decomposition. Finally, Section 5 illustrates the benefits of our approach on two numerical examples.
2 Dimension reduction for the approximation of high-dimensional distributions
be a probability distribution defined over the Borel setsof . Given a measurable function such that , let be the probability distribution such that
In the context of Bayesian inverse problems, one can view as the prior distribution and as the posterior distribution, while represents, up to a multiplicative constant, the likelihood function. We consider the problem of approximating the posterior by a probability distribution such that
where is a linear projector with rank and is a Borel function called the profile function. Throughout this paper we identify the projector with its matrix representation . Notice that is not restricted to be orthogonal: it can be any matrix which satisfies and , but not necessarily
. Any vectorcan be uniquely decomposed as
denotes the identity matrix. If, the approximation of by consists essentially in replacing the high-dimensional likelihood by a function of fewer variables. Indeed, depends only on the variable and is constant along . Yet cannot itself be considered a low-dimensional distribution, since its support could be the same as that of .
We use the Kullback–Leibler divergence to measure the dissimilarity between probability distributions. It is defined as
for any probability distributions and such that is absolutely continuous with respect to , and otherwise. Given a prescribed tolerance , our goal is to build an approximation of of the form (1) such that
Of course if is the identity matrix and , the distribution is exactly the posterior , so that (3) is trivially satisfied. In that case, the rank of is and there is no dimension reduction. The goal of this section is to give sufficient conditions under which we can build an approximation such that (3) holds with a rank that is significantly smaller than .
Instead of approximating by as given by (1), we could have considered an approximation of the form
with a Borel function and a matrix of full row rank. Functions of the form are a particular kind of ridge function pinkus_2015 . They are used in various domains to approximate multivariate functions; see Cohen2012b ; fornasier2012learning ; tyagi2014learning , for example. However, any function of the form can be written as and vice-versa, where is some Borel function and is some rank- projector. More precisely, for any we can show that
This means that approximating by or by are the essentially the same problem: we do not gain approximation power in using instead of .
2.1 Optimal profile function
We begin by characterizing the optimal profile function , defined as a minimizer of over the set of positive and measurable functions, where .
In this section, we assume that the projector is given (fixed).
We will address the problem of constructing later in Section 2.2.
Let us denote by the -algebra generated by . It is defined by where denotes the pre-image of under . The following lemma corresponds to the Doob–Dynkin lemma; see, for example, Lemma 1.13 in Mikosch1998 . It states that, for any projector , the set of all functions of the form for some measurable function is exactly the set of all -measurable functions.
Let be a projector. Given a measurable function defined on , the function is -measurable. Conversely, given a -measurable function , there exists a measurable function defined on such that .
We denote by the conditional expectation of given under the distribution . It is the unique -measurable function that satisfies
for any -measurable function . We consider the probability distribution defined by
This distribution is well defined because is a positive function (the conditional expectation preserves the sign) and because (by letting in (4)). By definition of the Kullback–Leibler divergence (2), we can write
for any measurable function . From this inequality we deduce that any function which satisfies is a minimizer of . By Lemma 1 such a function exists because is a -measurable function.
The conditional expectation is known to be the best approximation of a function with respect to the -norm, meaning that it minimizes among all -measurable functions . Here we showed that it is also optimal with respect to the Kullback–Leibler divergence. As pointed out in Banerjee2005 , is also an optimal approximation of with respect to the class of expected Bregman divergences, which includes the Kullback–Leibler divergence and the -norm distance.
The following proposition gives an explicit expression for the conditional expectation provided the prior distribution admits a Lebesgue density. The proof is given in Section 7.1.
Proposition 1 (Explicit expression for the conditional expectation)
Let be a probability distribution on which admits a Lebesgue density . Given a rank- projector , we denote by a matrix whose columns form a basis of . Let be the conditional probability density on defined by
for all and any , with the convention whenever the denominator of (8) is zero. Then, for any measurable function , the conditional expectation can be written as
We conclude this section with a remarkable property of the conditional expectation . Let be any projector whose kernel is the same as that of . Then the relations and hold; see, for instance, the proof of Proposition 2.2 in zahm2018gradient . Then Lemma 1 ensures that any -measurable function is -measurable and vice-versa. In other words, being -measurable or -measurable are equivalent such that, by definition of the conditional expectation (4), we have
In other words, the conditional expectation is invariant with respect to the image of —so that the error , seen as a function of the projector , is actually only a function of
. In the context of Bayesian inference, this property means that when the optimal profile functionis used in (1), the important feature of to be discovered is its kernel, and not its image. Thus reducing the dimension of a Bayesian inverse problem consists in identifying the subspace on which the data are not informative.
2.2 A controlled approximation
In this section we show how to build a projector with a sufficiently large rank so that , defined by , satisfies
for some prescribed tolerance . We argue that under some conditions on the likelihood , the rank of can be significantly smaller than .
We make the following assumption on the prior distribution . Here, the notation means that the matrix is positive semidefinite.
The distribution admits a Lebesgue density such that , where and are two functions defined on which satisfy the following two properties.
is twice continuously differentiable and there exists a symmetric positive definite matrix such that for all we have
is a bounded function such that
for some .
Let us make some comments on this assumption. First consider the case , which means that is a constant function so that . Assumption 1(a) corresponds to a strong convexity property of the function . Intuitively, it means that is “at least quadratically convex.” Now, when , Assumption 1(b) means that holds for all , where and are two constants that satisfy .
Example 1 (Gaussian distribution)
Any Gaussian prior with mean and invertible covariance matrix satisfies Assumption 1 with and . Indeed, the density associated to is such that where the function satisfies .
Example 2 (Gaussian mixture)
Let with and for all , where means that is a positive definite matrix. The density of the Gaussian mixture can be written as follows,
where and . For any the function is quadratic and, by assumption, its Hessian satisfies . This ensures that is bounded from above, and so is the function . Furthermore, the relation holds for all , which means that is bounded from below. As a consequence we have so that satisfies Assumption 1 with and for some finite .
Assumption 1 provides sufficient conditions for the prior distribution to satisfy the following logarithmic Sobolev inequality:
for any smooth enough function , where denotes the norm on such that for all . This result relies on Bakry–Émery theorem bakry1985diffusions ; bobkov2000brunn ; Otto1999 , which uses Assumption 1(a), and on Holley–Stroock perturbation lemma holley1987logarithmic , which uses Assumption 1(b). More precisely, Proposition 3.1 in bobkov2000brunn states that in the case (i.e., is constant), Assumption 1(a) is sufficient to have (12). Then, following the lines of the original proof in holley1987logarithmic (see also the proof of Theorem 1.9 in Gozlan2013 , for instance), we have that (12) still holds when is such that .
The following proposition shows that Assumption 1 also implies another class of inequalities, which we call the subspace logarithmic Sobolev inequalities.
Theorem 2.1 (Subspace logarithmic Sobolev inequality)
Let be a probability distribution that satisfies Assumption 1 for some and . Then the relation
holds for any continuously differentiable function such that and for any projector .
Let be a distribution which satisfies the subspace logarithmic Sobolev inequality (13) for some and , and let be such that for some continuously differentiable function such that . Then for any projector , the distribution such that satisfies
where and where is the symmetric positive matrix defined by
holds and shows that is the mean squared error of the approximation of the random vector by , whereis called the reconstruction error. In contrast to , the reconstruction error is quadratic in . The following proposition gives a closed form expression for a minimizer of over the set of the rank- projectors. It corresponds to Proposition 2.6 in zahm2018gradient where is replaced by .
Let be a symmetric positive definite matrix and be a symmetric positive semidefinite matrix. Denote by the -th generalized eigenpair of the matrix pencil , meaning with and , where . For any we have
Furthermore, a solution to (17) is the following -orthogonal projector (i.e., satisfying for all ) given by
where is the
-th generalized eigenvalue of. This relation holds for any . Then for any , the choice is sufficient to obtain . Observe that a strong decay in the generalized eigenvalues of ensures that . In particular, if is rank deficient, we have for all so that implies . The matrix can be seen as a test that reveals the low effective dimensionality of the posterior distribution: a strong decay in the generalized spectrum of , or certainly a rank deficiency of , ensures that there exists an approximation of such that with small .
2.3 Prior-based dimension reduction
To provide some context for the preceding developments, we now derive a prior-based dimension reduction method which requires knowledge only of the prior distribution . Observe that for any projector we have
Here denotes the spectral norm of the symmetric positive semidefinite matrix , meaning . To derive this inequality, we used the fact that holds true for any symmetric positive semidefinite matrix , in particular for . Together with Corollary 1, the previous relation allows us to write
for any . By Proposition 2, the rank- projector which minimizes
is given as the orthogonal (self-adjoint) projector onto the leading eigenspace of. This projector is entirely determined by the knowledge of the matrix , which itself comes from the prior though Assumption 1. This explains the phrase prior-based dimension reduction.
When is a minimizer of , the right-hand side of (19) is proportional to the sum of the last eigenvalues of . Thus a sharp decay in the spectrum of indicates a low effective dimension of the inverse problem. Notice, however, that the bound (19) still involves the quantity . Therefore, a certified approximation (meaning one that controls the error ) would still require knowing and cannot in general be realized using prior information alone.
Finally, let us consider the Gaussian prior for which . In that case the minimizer of is the orthogonal projector onto the leading eigenspace of the prior covariance . This type of projector is also encountered in the Karhunen–Loève decomposition; see Section 4.1 for a detailed discussion.
2.4 An illustrative example
To illustrate how sharp the bound given by Corollary 1 can be, we consider a simple example for which the Kullback–Leibler divergence is computable in closed form. This allows a comparison of the error and its upper bound .
Assume that the prior
is the standard Gaussian distributionand let the likelihood function be given by
where is a symmetric positive semidefinite matrix. The Lebesgue density of is , so that satisfies Assumption 1 with and . The posterior defined by is also Gaussian with zero mean and covariance . In this setting, the matrix defined by (15) can be written as follows
Consider the generalized eigenvalue problem which, since , is simply the standard eigenvalue problem . Notice that is a rational function in , so that and
have the same eigenvectors and their eigenvalues satisfy the relation
for all , where is the -th largest eigenvalue of . According to Proposition 2, a projector minimizing over the set of rank- projectors is . Corollary 1 ensures that the distribution defined by is such that
To analyze the sharpness of this inequality, we now compute . Using Proposition 1, we can express the conditional expectation as follows:
where . Then such that is Gaussian with zero mean and covariance . The Kullback–Leibler divergence from to admits the following closed form expression:
To continue the calculation, one needs the eigenvalues of . Let
be the orthogonal matrix containing the eigenvectors ofand let so that . By construction, where , so that From this relation, we deduce that the -th eigenvalue of is when and otherwise. Then we have
We now analyze the deficit in inequality (20). With a Taylor expansion as goes to zero for all , we can write
If the function is nearly constant () along the subspace , then the upper bound (20) is close to . In that case, the upper bound is, up to a multiplicative factor of two, the same as the true error.
In this particular example, the projector obtained by minimizing the upper bound in fact yields the optimal approximation of in Kullback–Leibler divergence, for any given rank . This can be shown by Theorem 2.3 in Spantani2015 . Of course, minimizing the upper bound does not produce the optimal projector in general.
3 Building the approximation
In this section we propose and analyze algorithms for the numerical construction of a low-rank projector and of a profile function , such that the distribution given by is a controlled approximation of the posterior distribution . Recall that in the previous section (see (7)) we obtained the following decomposition of the Kullback–Leibler divergence:
where is given by