1 Introduction
Solving Bayesian inverse problems KaipioSomersalo ; Stuart2010 is a challenging task in many domains of application, due to the complexity of the posterior distribution. One of the primary sources of complexity is the dimension of the parameters to be inferred, which is often high or in principle infinite—e.g., when the posterior is a distribution over functions or their finitedimensional discretization. High dimensionality presents difficulties for posterior sampling: care is required to design sampling algorithms that mix effectively while remaining robust under refinement of the discretization. High dimensionality also raises significant hurdles to the use of model reduction or approximation schemes (e.g., Conrad2016 ; Cui2015 ; manzoni2016accurate ; rubio2018fast ) that attempt to reduce the cost of likelihood or forward model evaluations.
Successful strategies for highdimensional Bayesian inverse problems typically exploit the presence of some lowdimensional structure. A common structure in inverse problems is that the posterior is a lowdimensional update of the prior, in the sense that change from prior to posterior is most prominent on a lowdimensional subspace of the parameter space. Put another way, the likelihood is influential, relative to the prior, only on a lowdimensional subspace (we will make these intuitive notions more precise later). Sources of such structure include the smoothing properties of the forward operator and limitations on the number or accuracy of the observations. In the linear–Gaussian case, this structure is already well understood. For instance, lowrank approximations of the (priorpreconditioned) Hessian of the loglikelihood have been used in flath2011fast to approximate the posterior covariance. In Spantani2015 , this approach is shown to yield optimal
approximations of the posterior covariance and of the posterior mean. A heuristic extension of this approach to nonlinear forward models, known as the
likelihoodinformed subspace (LIS) method, is proposed in Cui2014_LIS and shown to perform well in many applications beskos2017multilevel ; lamminpaa2017likelihood . A similar idea underlies the active subspace method applied to Bayesian problems Constantine2015 . Identifying the subspace on which changes from the prior to posterior are most prominent also provides a foundation for infinitedimensional MCMC algorithms that “split” the parameter space, such as the dimensionindependent likelihoodinformed (DILI) MCMC samplers of Cui2016_DILI . In Cui2016_joint , LISbased dimension reduction is shown to be an effective prelude to model order reduction, where the forward model is replaced with a computationally inexpensive approximation. Yet all of these dimension reduction approaches, outside of the linear–Gaussian setting, are essentially heuristics. Optimal lowdimensional approximations—or approximations endowed with certified error bounds—have not yet been developed for the nonlinear/nonGaussian setting. Such approximations are the subject of this paper.Other forms of dimension reduction for Bayesian inverse problems have also been proposed, besides the “update” form of dimension reduction described above. For instance, in Li2006 ; marzouk2009dimensionality , a truncated KarhunenLoève decomposition of the prior distribution is used to reduce the parameter dimension of the entire inverse problem. This approach exploits only the lowdimensional structure of the prior distribution, however, and does not take advantage of structure in the likelihood function or forward model. In lieberman2010parameter , a greedy algorithm is used to identify a parameter subspace capable of reproducing the forward model. This approach, in contrast, does not take advantage of the prior correlation structure, and moreover can remove directions that are uninformed by the likelihood but that still retain large variation under the true posterior. More recent results on the intrinsic dimension of linear Bayesian inverse problems Agapiou2017 reinforce the idea that the update from prior to posterior should instead be a central object, e.g., when characterizing the performance of importance sampling schemes.
In this paper we propose a methodology to detect and exploit the lowdimensional structure of the update from prior to posterior. Our approach addresses Bayesian inverse problems with nonlinear forward operators, nonGaussian priors, and nonGaussian observation noise. The basic idea is to approximate the likelihood function by a ridge function, i.e., a function that depends nontrivially only on a few linear combinations of the parameters. More precisely, we seek a controlled approximation of the likelihood, such that the Kullback–Leibler (KL) divergence from the resulting posterior approximation to the exact posterior is below some userdefined threshold. To attain this goal, we derive an upper bound on the KL divergence and construct the ridge approximation so that the bound falls below the threshold. This bound is obtained via logarithmic Sobolev inequalities, an important class of inequalities in measure theory gross1975logarithmic ; guionnet2003lectures ; Otto1999 with many implications on concentration of measure phenomena boucheron2013concentration ; ledoux1999concentration . Using logarithmic Sobolev inequalities requires some assumptions on the prior distribution and on the regularity of the likelihood function. In particular, we need the gradient of the loglikelihood to be squareintegrable over the posterior distribution. A similar methodology has been proposed in zahm2018gradient to reduce the input dimension of multivariate functions. In that paper, a bound on the function approximation error in norm is obtained via Poincaré inequalities, another class of Sobolev inequality.
The benefits of using our upper bound are threefold. First it provides a test that reveals the low effective dimension of the problem: a quickly decaying spectrum of the second moment matrix of the loglikelihood gradient ensures the existence of a lowdimensional ridge approximation. Second, unlike the original KL divergence, the upper bound can easily be minimized over the approximation class we consider. Doing so drives the construction of the ridge approximation of the likelihood function. Third, the upper bound is a certified
error estimate that allows control of the KL divergence, a quantity not readily accessible in practice.
The proposed dimension reduction method requires computing (i) the second moment matrix of the loglikelihood gradient and (ii) a conditional expectation of the likelihood function. In practice, we approximate both quantities with Monte Carlo estimates, requiring samples from the posterior for (i) and samples from the prior for (ii). The resulting (random) posterior approximation is not exactly a minimizer of the upper bound. We show, however, that value of the upper bound corresponding to this random approximation is close to its minimum value, provided that the sample sizes are sufficiently large. This quasioptimality result is given in expectation for (ii) and in high probability for (i). In particular, we show that the number of posterior samples to approximate (i) should scale in proportion to the
rank of the matrix to be estimated, which can be much smaller than the ambient dimension. Finally, even if the method only requires a limited number of posterior samples, it can be difficult to obtain the posterior samples in the first place. Thus, we also propose several alternatives for computing (i) that do not require sampling from the exact posterior. These alternatives include an iterative procedure that builds a sequence of lowdimensional posterior approximations in order to obtain an accurate final estimate.The outline of the paper is as follows. In Section 2 we introduce the theoretical tools to derive and to minimize the error bound. We demonstrate the benefits of this method on an analytical example. In Section 3 we propose algorithms for the numerical construction of the lowdimensional posterior approximation. We give a theoretical analysis of the convergence of these samplebased estimators. In order to provide some context for our developments, we show in Section 4 how the proposed methodology compares with existing dimension reduction methods, including a comparison with a KarhunenLoève decomposition. Finally, Section 5 illustrates the benefits of our approach on two numerical examples.
2 Dimension reduction for the approximation of highdimensional distributions
Let
be a probability distribution defined over the Borel sets
of . Given a measurable function such that , let be the probability distribution such thatIn the context of Bayesian inverse problems, one can view as the prior distribution and as the posterior distribution, while represents, up to a multiplicative constant, the likelihood function. We consider the problem of approximating the posterior by a probability distribution such that
(1) 
where is a linear projector with rank and is a Borel function called the profile function. Throughout this paper we identify the projector with its matrix representation . Notice that is not restricted to be orthogonal: it can be any matrix which satisfies and , but not necessarily
. Any vector
can be uniquely decomposed aswhere
denotes the identity matrix. If
, the approximation of by consists essentially in replacing the highdimensional likelihood by a function of fewer variables. Indeed, depends only on the variable and is constant along . Yet cannot itself be considered a lowdimensional distribution, since its support could be the same as that of .We use the Kullback–Leibler divergence to measure the dissimilarity between probability distributions. It is defined as
(2) 
for any probability distributions and such that is absolutely continuous with respect to , and otherwise. Given a prescribed tolerance , our goal is to build an approximation of of the form (1) such that
(3) 
Of course if is the identity matrix and , the distribution is exactly the posterior , so that (3) is trivially satisfied. In that case, the rank of is and there is no dimension reduction. The goal of this section is to give sufficient conditions under which we can build an approximation such that (3) holds with a rank that is significantly smaller than .
Remark 1
Instead of approximating by as given by (1), we could have considered an approximation of the form
with a Borel function and a matrix of full row rank. Functions of the form are a particular kind of ridge function pinkus_2015 . They are used in various domains to approximate multivariate functions; see Cohen2012b ; fornasier2012learning ; tyagi2014learning , for example. However, any function of the form can be written as and viceversa, where is some Borel function and is some rank projector. More precisely, for any we can show that
This means that approximating by or by are the essentially the same problem: we do not gain approximation power in using instead of .
2.1 Optimal profile function
We begin by characterizing the optimal profile function , defined as a minimizer of over the set of positive and measurable functions, where .
In this section, we assume that the projector is given (fixed).
We will address the problem of constructing later in Section 2.2.
Let us denote by the algebra generated by . It is defined by where denotes the preimage of under . The following lemma corresponds to the Doob–Dynkin lemma; see, for example, Lemma 1.13 in Mikosch1998 . It states that, for any projector , the set of all functions of the form for some measurable function is exactly the set of all measurable functions.
Lemma 1
Let be a projector. Given a measurable function defined on , the function is measurable. Conversely, given a measurable function , there exists a measurable function defined on such that .
We denote by the conditional expectation of given under the distribution . It is the unique measurable function that satisfies
(4) 
for any measurable function . We consider the probability distribution defined by
This distribution is well defined because is a positive function (the conditional expectation preserves the sign) and because (by letting in (4)). By definition of the Kullback–Leibler divergence (2), we can write
(5)  
(6)  
(7) 
where and are two normalizing constants defined by and . To go from (5) to (6), we used relation (4) with , which is a measurable function. By (7) we have
for any measurable function . From this inequality we deduce that any function which satisfies is a minimizer of . By Lemma 1 such a function exists because is a measurable function.
Remark 2
The conditional expectation is known to be the best approximation of a function with respect to the norm, meaning that it minimizes among all measurable functions . Here we showed that it is also optimal with respect to the Kullback–Leibler divergence. As pointed out in Banerjee2005 , is also an optimal approximation of with respect to the class of expected Bregman divergences, which includes the Kullback–Leibler divergence and the norm distance.
The following proposition gives an explicit expression for the conditional expectation provided the prior distribution admits a Lebesgue density. The proof is given in Section 7.1.
Proposition 1 (Explicit expression for the conditional expectation)
Let be a probability distribution on which admits a Lebesgue density . Given a rank projector , we denote by a matrix whose columns form a basis of . Let be the conditional probability density on defined by
(8) 
for all and any , with the convention whenever the denominator of (8) is zero. Then, for any measurable function , the conditional expectation can be written as
(9) 
We conclude this section with a remarkable property of the conditional expectation . Let be any projector whose kernel is the same as that of . Then the relations and hold; see, for instance, the proof of Proposition 2.2 in zahm2018gradient . Then Lemma 1 ensures that any measurable function is measurable and viceversa. In other words, being measurable or measurable are equivalent such that, by definition of the conditional expectation (4), we have
In other words, the conditional expectation is invariant with respect to the image of —so that the error , seen as a function of the projector , is actually only a function of
. In the context of Bayesian inference, this property means that when the optimal profile function
is used in (1), the important feature of to be discovered is its kernel, and not its image. Thus reducing the dimension of a Bayesian inverse problem consists in identifying the subspace on which the data are not informative.2.2 A controlled approximation
In this section we show how to build a projector with a sufficiently large rank so that , defined by , satisfies
for some prescribed tolerance . We argue that under some conditions on the likelihood , the rank of can be significantly smaller than .
We make the following assumption on the prior distribution . Here, the notation means that the matrix is positive semidefinite.
Assumption 1
The distribution admits a Lebesgue density such that , where and are two functions defined on which satisfy the following two properties.

is twice continuously differentiable and there exists a symmetric positive definite matrix such that for all we have
(10) 
is a bounded function such that
(11) for some .
Let us make some comments on this assumption. First consider the case , which means that is a constant function so that . Assumption 1(a) corresponds to a strong convexity property of the function . Intuitively, it means that is “at least quadratically convex.” Now, when , Assumption 1(b) means that holds for all , where and are two constants that satisfy .
Example 1 (Gaussian distribution)
Any Gaussian prior with mean and invertible covariance matrix satisfies Assumption 1 with and . Indeed, the density associated to is such that where the function satisfies .
Example 2 (Gaussian mixture)
Let with and for all , where means that is a positive definite matrix. The density of the Gaussian mixture can be written as follows,
where and . For any the function is quadratic and, by assumption, its Hessian satisfies . This ensures that is bounded from above, and so is the function . Furthermore, the relation holds for all , which means that is bounded from below. As a consequence we have so that satisfies Assumption 1 with and for some finite .
Assumption 1 provides sufficient conditions for the prior distribution to satisfy the following logarithmic Sobolev inequality:
(12) 
for any smooth enough function , where denotes the norm on such that for all . This result relies on Bakry–Émery theorem bakry1985diffusions ; bobkov2000brunn ; Otto1999 , which uses Assumption 1(a), and on Holley–Stroock perturbation lemma holley1987logarithmic , which uses Assumption 1(b). More precisely, Proposition 3.1 in bobkov2000brunn states that in the case (i.e., is constant), Assumption 1(a) is sufficient to have (12). Then, following the lines of the original proof in holley1987logarithmic (see also the proof of Theorem 1.9 in Gozlan2013 , for instance), we have that (12) still holds when is such that .
The following proposition shows that Assumption 1 also implies another class of inequalities, which we call the subspace logarithmic Sobolev inequalities.
Theorem 2.1 (Subspace logarithmic Sobolev inequality)
Let be a probability distribution that satisfies Assumption 1 for some and . Then the relation
(13) 
holds for any continuously differentiable function such that and for any projector .
The proof is given in Section 7.2. The subspace logarithmic Sobolev inequality (13) allows us to derive an upper bound for the Kullback–Leibler divergence .
Corollary 1
Let be a distribution which satisfies the subspace logarithmic Sobolev inequality (13) for some and , and let be such that for some continuously differentiable function such that . Then for any projector , the distribution such that satisfies
(14) 
where and where is the symmetric positive matrix defined by
(15) 
Proof
Corollary 1 provides an upper bound for that is proportional to , where is given in (15). As noticed in the previous proof, the relation
(16) 
holds and shows that is the mean squared error of the approximation of the random vector by , where
. In principal component analysis,
is called the reconstruction error. In contrast to , the reconstruction error is quadratic in . The following proposition gives a closed form expression for a minimizer of over the set of the rank projectors. It corresponds to Proposition 2.6 in zahm2018gradient where is replaced by .Proposition 2
Let be a symmetric positive definite matrix and be a symmetric positive semidefinite matrix. Denote by the th generalized eigenpair of the matrix pencil , meaning with and , where . For any we have
(17) 
Furthermore, a solution to (17) is the following orthogonal projector (i.e., satisfying for all ) given by
(18) 
Corollary 1 and Proposition 2 ensure that, provided satisfies the subspace logarithmic Sobolev inequality (13) and provided is defined as in (18), the approximation of defined by is such that
where is the
th generalized eigenvalue of
. This relation holds for any . Then for any , the choice is sufficient to obtain . Observe that a strong decay in the generalized eigenvalues of ensures that . In particular, if is rank deficient, we have for all so that implies . The matrix can be seen as a test that reveals the low effective dimensionality of the posterior distribution: a strong decay in the generalized spectrum of , or certainly a rank deficiency of , ensures that there exists an approximation of such that with small .2.3 Priorbased dimension reduction
To provide some context for the preceding developments, we now derive a priorbased dimension reduction method which requires knowledge only of the prior distribution . Observe that for any projector we have
Here denotes the spectral norm of the symmetric positive semidefinite matrix , meaning . To derive this inequality, we used the fact that holds true for any symmetric positive semidefinite matrix , in particular for . Together with Corollary 1, the previous relation allows us to write
(19) 
for any . By Proposition 2, the rank projector which minimizes
is given as the orthogonal (selfadjoint) projector onto the leading eigenspace of
. This projector is entirely determined by the knowledge of the matrix , which itself comes from the prior though Assumption 1. This explains the phrase priorbased dimension reduction.When is a minimizer of , the righthand side of (19) is proportional to the sum of the last eigenvalues of . Thus a sharp decay in the spectrum of indicates a low effective dimension of the inverse problem. Notice, however, that the bound (19) still involves the quantity . Therefore, a certified approximation (meaning one that controls the error ) would still require knowing and cannot in general be realized using prior information alone.
Finally, let us consider the Gaussian prior for which . In that case the minimizer of is the orthogonal projector onto the leading eigenspace of the prior covariance . This type of projector is also encountered in the Karhunen–Loève decomposition; see Section 4.1 for a detailed discussion.
2.4 An illustrative example
To illustrate how sharp the bound given by Corollary 1 can be, we consider a simple example for which the Kullback–Leibler divergence is computable in closed form. This allows a comparison of the error and its upper bound .
Assume that the prior
is the standard Gaussian distribution
and let the likelihood function be given bywhere is a symmetric positive semidefinite matrix. The Lebesgue density of is , so that satisfies Assumption 1 with and . The posterior defined by is also Gaussian with zero mean and covariance . In this setting, the matrix defined by (15) can be written as follows
Consider the generalized eigenvalue problem which, since , is simply the standard eigenvalue problem . Notice that is a rational function in , so that and
have the same eigenvectors and their eigenvalues satisfy the relation
for all , where is the th largest eigenvalue of . According to Proposition 2, a projector minimizing over the set of rank projectors is . Corollary 1 ensures that the distribution defined by is such that
(20) 
To analyze the sharpness of this inequality, we now compute . Using Proposition 1, we can express the conditional expectation as follows:
(21) 
where . Then such that is Gaussian with zero mean and covariance . The Kullback–Leibler divergence from to admits the following closed form expression:
To continue the calculation, one needs the eigenvalues of . Let
be the orthogonal matrix containing the eigenvectors of
and let so that . By construction, where , so that From this relation, we deduce that the th eigenvalue of is when and otherwise. Then we haveWe now analyze the deficit in inequality (20). With a Taylor expansion as goes to zero for all , we can write
If the function is nearly constant () along the subspace , then the upper bound (20) is close to . In that case, the upper bound is, up to a multiplicative factor of two, the same as the true error.
In this particular example, the projector obtained by minimizing the upper bound in fact yields the optimal approximation of in Kullback–Leibler divergence, for any given rank . This can be shown by Theorem 2.3 in Spantani2015 . Of course, minimizing the upper bound does not produce the optimal projector in general.
3 Building the approximation
In this section we propose and analyze algorithms for the numerical construction of a lowrank projector and of a profile function , such that the distribution given by is a controlled approximation of the posterior distribution . Recall that in the previous section (see (7)) we obtained the following decomposition of the Kullback–Leibler divergence:
(22) 
where is given by . The first term in this decomposition depends only on