1 Introduction
Given observation data for a system with unknown parameters, Bayesian inference provides an optimal probability framework for learning the parameters by updating their prior distribution to a posterior distribution. However, many conventional methods for solving highdimensional Bayesian inference problems face the curse of dimensionality, i.e., the computational complexity grows rapidly, often exponentially, with respect to (w.r.t.) the number of parameters. To address the curse of dimensionality, the intrinsic properties of the posterior distribution, such as its smoothness, sparsity, and intrinsic lowdimensionality, have been exploited to reduce the parameter correlation and develop efficient methods whose complexity grows slowly or remains the same with increasing dimension. By exploiting the geometry of the loglikelihood function, accelerated Markov chain Monte Carlo (MCMC) methods have been developed to reduce the sample correlation or increase effective sample size independent of the dimension
(Girolami & Calderhead, 2011; Martin et al., 2012; Petra et al., 2014; Constantine et al., 2016; Cui et al., 2016; Beskos et al., 2017). Nevertheless, these random and essentially serial sampling methods remain prohibitive for largescale inference problems with expensive likelihoods. Deterministic methods using sparse quadratures (Schwab & Stuart, 2012; Schillings & Schwab, 2013; Chen & Schwab, 2015, 2016) were shown to converge rapidly with dimensionindependent rates for problems with smooth and sparse posteriors. However, for posteriors lacking smoothness or sparsity, the convergence deteriorates significantly, despite incorporation of Hessianbased transformations (Schillings & Schwab, 2016; Chen et al., 2017).Transportbased variational inference is another type of deterministic method that seeks a transport map in a function space (represented by, e.g., polynomials, kernels, or neural networks) that pushes the prior to the posterior by minimizing the difference between the transported prior and the posterior, measured in, e.g., Kullback–Leibler divergence
(Marzouk et al., 2016; Liu & Wang, 2016; Blei et al., 2017; Bigoni et al., 2019; Detommaso et al., 2019). In particular, kernelbased Stein variational methods, using gradientbased (SVGD) (Liu & Wang, 2016; Chen et al., 2018; Liu & Zhu, 2018) and Hessianbased (SVN) (Detommaso et al., 2018; Wang & Li, 2020) optimization methods, are shown to achieve fast convergence in relatively low dimensions. Nonetheless, the convergence and accuracy of these methods deteriorates in high dimensions due to the curse of dimensionality in kernel representation. This can be partially addressed by a localized SVGD on Markov blankets, which relies on a conditional independence structure of the target distribution (Zhuo et al., 2018; Wang et al., 2018), or by a projected SVN that projects the parameter in a lowdimensional subspace (Chen et al., 2019b).Contributions: Here, we propose, analyze, and apply a projected SVGD method to tackle the curse of dimensionality for highdimensional nonlinear Bayesian inference problems, which relies on the fundamental property that data typically only inform a lowdimensional subspace of highdimensional parameters, e.g., (Bashir et al., 2008; BuiThanh & Ghattas, 2012; BuiThanh et al., 2013; Spantini et al., 2015; Isaac et al., 2015; Cui et al., 2016; Chen et al., 2017, 2019a; Chen & Ghattas, 2019; Bigoni et al., 2019)
and references therein. Specifically, our contributions are: (1) we perform dimension reduction by projecting the parameters to a lowdimensional subspace constructed using the gradient of the loglikelihood, and push the prior samples of the projection coefficients to their posterior by pSVGD; (2) we prove the equivalence of the projected transport in the coefficient space and the transport in the projected parameter space; (3) we provide an upper bound for the projection error committed in the posterior (with the likelihood approximated by an optimal profile function) in terms of the eigenvalues of a gradient information matrix; (4) we propose adaptive and parallel algorithms to efficiently approximate the optimal profile function and the gradient information matrix; and (5) we demonstrate the accuracy (compared to SVGD) and scalability of pSVGD w.r.t. the number of parameters, samples, data points, and processor cores for a highdimensional nonlinear Bayesian inference problem.
The major differences of this work compared to pSVN (Chen et al., 2019b): (1) pSVGD uses only gradient information of the loglikelihood, which is available for many models, while pSVN requires Hessian information, which may not be available for complex models and codes in practical applications; (2) the upper bound for the projection error w.r.t. the posterior in terms of eigenvalues is much sharper than that for pSVN in terms of projection error in parameters; (3) here we theoretically prove the equivalence of the projected transport for the coefficient and the transport for the projected parameters; (4) we also investigate the convergence of pSVGD w.r.t. the number of parameters and the scalability of pSVGD w.r.t. the number of data points.
2 Preliminaries
2.1 Bayesian Inference
Let denote a random parameter of dimension , which has a continuous prior density . Let denote a set of i.i.d. observation data. Let denote, up to a multiplicative constant, a continuous likelihood of at given . Then the posterior density of parameter conditioned on data , denoted as , is given by Bayes’ rule as
(1) 
where is the normalization constant defined as
(2) 
whose computation is typically intractable, especially for a large and with a complex geometry in , e.g., multimodal, rich local behavior, etc. The central task of Bayesian inference is to draw samples of the parameter from its posterior distribution with density
, and compute some statistical quantity of interest, e.g., the mean and variance of the parameter
or some function of .2.2 Stein Variational Gradient Descent (SVGD)
SVGD is one type of variational inference method that seeks an approximation of the posterior density by a function in a predefined function set , which is realized by minimizing the Kullback–Leibler (KL) divergence that measures the difference between two densities, i.e.,
(3) 
where , i.e., the average of with respect to the density , which vanishes when . In particular, a transport based function set is considered as , where is a pushforward map that pushes the prior density to a new density through an invertible transport map in a space . Let be given by
(4) 
where is a differentiable perturbation map w.r.t. , and is small enough so that is invertible. It is shown in (Liu & Wang, 2016) that
(5) 
where is the Stein operator given by
(6) 
Moreover, by choosing the space
, a tensor product of a reproducing kernel Hilbert space (RKHS)
with kernel , the steepest descent direction of w.r.t. , denoted as , is explicitly given by(7) 
where
(8) 
Following this optimization perspective, we define a sequence of transport maps for as
(9) 
and , the density pushed from by the pushforward map . The function is the steepest descent direction of , which is given as in (7) with replaced by . is known as a learning rate that is suitably updated in each iteration. Convergence of to the posterior as is studied in (Liu, 2017). To this end, the sequence of transport maps provide a practical algorithm for approximately drawing samples from the posterior as: first draw samples from the prior , then update them subsequently as
(10) 
where , given by
(11) 
is a sample average approximation (SAA) of with samples . The first term of (11) drives the samples to regions of high posterior density and the second term acts as a repulsive force that pushes the samples away from each other to avoid collapsing to local modes of the posterior. Note that while the posterior defined in (1) involves the intractable normalization constant , does not, thus making the sample update by (11) computationally efficient. For the kernel , a common choice is Gaussian as in (Liu & Wang, 2016),
(12) 
where is the bandwidth, e.g., with med representing the median of sample distances.
The repulsive force term in (11) plays a critical role in determining the empirical distribution of the samples. However, it is known that the kernel function in general suffers from the curse of dimensionality (Ramdas et al., 2015; Zhuo et al., 2018; Wang et al., 2018), which makes the repulsive term useless in high dimensions and leads to samples not representative of the posterior, as observed in (Zhuo et al., 2018; Wang et al., 2018), in which the SVGD sample variance becomes very inaccurate for large .
3 Projected SVGD
3.1 Dimension reduction by projection
Let denote an orthogonal basis that spans the parameter space , where span a subspace of dimension and span its complement subspace . We define a linear projector of rank , , as
(13) 
where represents the projection matrix and
is the coefficient vector with element
for . By this projection, we can decompose as(14) 
where the projected parameter and its complement , with
and identity matrix
. By defining , and with elements for , we have . For any , with the coefficient vector of the projection (13), and any , we can decompose the prior density as(15) 
where the marginal density is defined as
(16) 
and the conditional density as
(17) 
We seek a profile function such that is a good approximation of the likelihood function for a given projector . We define a particular profile function as the marginalization or conditional expectation of w.r.t. the conditional density , i.e.,
(18) 
The following proposition, established in (Zahm et al., 2018), shows that is the optimal profile function w.r.t. the KL divergence between the posterior density defined in (1) and the projected posterior density defined as
(19) 
where .
Proposition 1.
Remark 1.
Once the projector is provided, defined in (18) is the optimal by Proposition 1. However, practical evaluation of is challenging because (1) the basis is typically not available; (2) it involves two dimensional integrals in (16) and (18), whose evaluation may be expensive. Serveral approximations were introduced in (Zahm et al., 2018). We propose an efficient approximation in Section 3.5 that is most suitable for pSVGD.
3.2 Construction of the basis
The basis introduced in last section plays a critical role for the accuracy of the approximation of the posterior density by the optimal projected posterior density . To construct a good basis, we first make the following assumption on the prior density.
Assumption 1.
The prior density satisfies for the two functions such that:
(1) is twice continuously differentiable, and there exists a symmetric positive definite matrix such that
for any at any ;
(2) is bounded in , and there exists a constant such that the maximum difference is bounded by
Remark 2.
Assumption (1) implies that the function is at least quadratically convex such that the density decays rapidly departing from the origin. Assumption (2) guarantees that for some constants . Priors satisfying the two assumptions are Gaussian or subGaussian whose tails decay at least as fast as that of Gaussian. For example, a parameter
satisfies Assumption 1 with and .By we denote a gradient information matrix, which is defined as the average of the outer product of the gradient of the loglikelihood w.r.t. the posterior, i.e.,
(21) 
By we denote the generalized eigenpairs of , with given in Assumption 1, which correspond to the largest eigenvalues , i.e.,
(22) 
In particular, for the development of pSVGD, we require instead of the conventional , being if and zero otherwise. We make the following important observation: By definition of , the eigenvalue measures the sensitivity of the data w.r.t. the parameters along direction , i.e., the data mostly inform parameters in directions corresponding to large eigenvalues . For small close to zero, the variation of the likelihood in direction is negligible.
More rigorously, the following proposition, as shown in (Zahm et al., 2018), provides an upper bound for the projection error committed in the optimally projected posterior in terms of the eigenvalues for .
Proposition 2.
Under Assumption 1, for the projector defined in (13) with the basis
taken as the eigenvectors of the generalized eigenvalue problem (
22), for any continuously differentiable likelihood function satisfying , we have(23) 
To distinguish different dimensions, we consider a relative projection error defined as .
Remark 3.
Proposition 2 implies that if the eigenvalues decay rapidly, then the projected posterior is close to the true posterior in KL divergence for a small . Moreover, as , if , which is usually true because of the essential illposedness of the inference problem due to parameter correlation, i.e., the observation data cannot inform all parameter dimensions if is large.
Remark 4.
Construction of the basis by is challenging since involves an integration w.r.t. the posterior distribution. We propose an algorithm in Section 3.5 for an adaptive approximation of and construction of the basis.
3.3 Bayesian inference of the coefficient
For any with decomposition (14), or equivalently
(24) 
by the density decomposition (15) we have
(25) 
For this decomposition, the projected posterior for in (19) can be written for and as
(26) 
We define a new function of the coefficient vector as
(27) 
which is indeed a density function of because by definition
(28) 
where in the second equality we used the property that the randomness of in can be fully represented by that of in due to . Therefore, we can view and as the prior and posterior densities of , and as the likelihood function. Then by the density decomposition (26), or equivalently
(29) 
sampling from the projected posterior distribution with density can be conceptually realized by (1) sampling from the posterior distribution with density , which is to solve a problem of Bayesian inference of the coefficient vector of dimension , and (2) sampling from the conditional distribution with density , and (3) assembling by the decomposition (24). However, this sampling scheme remains impractical since is in general not available or very expensive to compute, especially for large . We defer sampling to Section 3.5 which describes a practical algorithm and focus on sampling from the posterior by projected SVGD.
3.4 Projected SVGD
To sample from the posterior distribution of the coefficient vector with density given by Bayes’ rule in (27), we employ the SVGD method presented in Section 2.2 in the coefficient space , with . Specifically, we define a projected transport map as
(30) 
with a differentiable perturbation map , and a small enough such that is invertible. Following the argument in (Liu & Wang, 2016) on the result (5) for SVGD, we obtain
(31) 
where is the Stein operator given by
(32) 
To find the perturbation map , we consider the space , a tensor product of RKHS defined with kernel . Then the steepest descent direction of w.r.t. , denoted as , is explicitly given by
(33) 
where
(34) 
By the same optimization perspective in the first step, we can define a sequence of transport maps for
(35) 
and push forward the density as . Convergence of the density to the posterior density as can be obtained by following the same argument as in (Liu, 2017). Therefore, to draw samples from the posterior , we can first draw samples from the prior with density , and subsequently update them as
(36) 
where is a SAA of with samples , i.e., is given by
(37) 
where practical evaluation of the gradient of the logposterior , with the posterior density defined in (27), is presented in Section 3.5.
The kernel can be specified as in (12), i.e.,
(38) 
To account for data impact in different directions informed by the eigenvalues of (22), we propose to replace in (38) by with for the likelihood and for the prior.
The following theorem, proved in Appendix A, establishes the connection between pSVGD in the coefficient space and SVGD in the projected parameter space.
Theorem 1.
Under the condition for , with the kernel defined in (38) and defined in (12), at the optimal profile function in (18), we have the equivalence of the projected transport map for the coefficient in the steepest direction (33) and the transport map for the projected parameter in the steepest direction (7), as
(39) 
In particular, we have
(40) 
Remark 5.
The equivalence (39) implies that the pSVGD only updates the projected parameter , or the parameter in the subspace formed by the basis , while leaving the parameter in the complement subspace unchanged. We propose an adaptive scheme (Algorithm 3) that also updates the parameter in the complement subspace.
Remark 6.
The condition is satisfied for such that its marginal for , i.e., defined in (16), is the same as the anchored density at . Meanwhile, Gaussian priors satisfy this condition since where both and are Gaussian densities.
3.5 Practical algorithms
Sampling from the projected posterior defined in (19) with the optimal profile function defined in (18), involves, by the decomposition (27), sampling from the posterior by pSVGD and sampling from the conditional distribution with density . The sampling is impractical because of two challenges summarized as follows:

The basis for the complement subspace, used in (16) for the marginal prior density and in (18) for the optimal profile function , are generally not available or their computation by solving the generalized eigenvalue problem (22) is too expensive for large dimension . Meanwhile, both (16) and (18) involve highdimensional integrals.

The matrix defined in (21) for the construction of the basis involves integration w.r.t. the posterior distribution of the parameter . However, drawing samples from the posterior to evaluate the integral turns out to be the central task of the Bayesian inference.
Challenge 1 can be practically addressed by using the fundamental property pointed out in Section 3.2, i.e., the variation of likelihood in the complement subspace is negligible, which means that the posterior distribution meaningfully differs from the prior distribution only in the subspace . Therefore, to draw samples from the posterior, we only need to proceed by the abstract Algorithm 1.

Draw samples from the prior distribution, e.g.,

Make parameter decomposition by projection

Push the projected parameter in the subspace to the projected posterior distribution (19) as

Assemble the samples to approximate the posterior
Two important factors in Algorithm 1 have to be specified to make it practically operational: (i) the projector in step 2, and (ii) the transport map in step 3.
We address (ii) at first by proposing a practical pSVGD algorithm given the projector with basis . By the fact that the likelihood has negligible variation in the subspace , for any sample , , we can approximate the optimal profile function in (18) as
(41) 
which is equivalent to using one sample such that to approximate the integral (18). Similarly, we can specify the projected prior density at as
(42) 
By this specification, we avoid computing not only but also the highdimensional integral w.r.t. in both (16) and (18), thus addressing challenge 1. Moreover, under this specification, the gradient of the logposterior in (37) for the pSVGD method can be evaluated as
(43) 
Given the projector with basis , we summarize the pSVGD transport of samples in Algorithm 2.
In particular, by leveraging the property that the samples can be updated in parallel, we implement a parallel version of pSVGD using MPI for information communication in processor cores, each with different samples, thus producing different samples in total.
To construct the projector with basis , we approximate in (21) by SAA with posterior samples , i.e.,
(44) 
Since the posterior samples are not available, we propose to adaptively construct the basis with samples transported from the prior samples by pSVGD, which addresses challenge 2. This procedure is summarized in Algorithm 3. We remark that by the adaptive construction, we push the samples to their posterior in each subspace spanned by (possibly) different basis with different for different . With exploiting all the datainformed dimensions as , the samples may represent the true posterior without committing projection error as bounded in Proposition 2.
4 Numerical Experiments
We present a nonlinear Bayesian inference problem with highdimensional parameters to demonstrate the accuracy of pSVGD compared to SVGD, and the convergence and scalability of pSVGD w.r.t. the number of parameters, samples, data points, and processor cores. A linear inference example, whose posterior is analytically given, is presented in Appendix B to demonstrate the accuracy of pSVGD compared to SVGD. The code is available at https://github.com/cpempire/pSVGD.
We consider a parametertoobservable map
(45) 
where is a nonlinear discrete solution map of the lognormal diffusion model in a unit square domain
(46) 
imposed with Dirichlet boundary condition on the top boundary and on bottom boundary, and homogeneous Neumann boundary condition on the left and right boundaries. is a divergence operator, and is a gradient operator. and are discretized by finite elements with piecewise linear elements in a uniform mesh of triangles of size . and are the nodal values of and . We consider a Gaussian distribution for with covariance , which leads to a Gaussian distribution for , where is discretized from . is a pointwise observation map at points equally distributed in . We consider an additive Gaussian noise with and for data
(47) 
which leads to the likelihood function
(48) 
We use a DILIMCMC algorithm (Cui et al., 2016) to generate effective posterior samples and use them to compute a reference sample variance. We run SVGD and the adaptive pSVGD (with ) using 256 samples and 200 iterations for different dimensions, both using line search to seek the step size . The comparison of accuracy can be observed in Figure 1. We can see that SVGD samples fail to capture the posterior distribution in high dimensions and become worse with increasing dimension, while pSVGD samples represent the posterior distribution well, measured by sample variance, and the approximation remains accurate with increasing dimension.
The accuracy of pSVGD can be further demonstrated by the significant decay (about 7 orders of magnitude) of the eigenvalues for different dimensions in the top of Figure 2. Only about 50 dimensions (with small relative projection error, about , committed in the posterior by Proposition 2) are preserved out of from 289 to 16,641 dimensions, representing over 300 dimension reduction for the last case. The similar decays of the eigenvalues in the projection rank and the averaged step norm in the number of iterations shown in Figure 2 imply that pSVGD is scalable w.r.t. the parameter dimension. Moreover, the similar decays for different sample size in Figure 3 demonstrate that pSVGD is scalable w.r.t. the number of samples . Furthermore, as displayed in the top of Figure 4, with increasing number of i.i.d. observation data points in a refined mesh of size , the eigenvalues decay at almost the same rate with similar relative projection error , and lead to similar reduction for such that , which implies weak scalability of pSVGD w.r.t. the number of data points. Lastly, from the bottom of Figure 4 by the nearly decay of CPU time we can see that pSVGD achieves strong parallel scalability (in computing gradient, kernel, and sample update) w.r.t. the number of processor cores for the same work with samples.
5 Conclusions and Future Work
We proposed, analyzed, and demonstrated pSVGD for Bayesian inference in high dimensions to tackle the critical challenge of curse of dimensionality. The projection can be adaptively constructed by using only the information of the gradient of the loglikelihood, which is available in SVGD algorithm, where the projection error committed in the posterior can be bounded by the truncated eigenvalues. We proved that pSVGD for the coefficient is equivalent to SVGD for the projected parameter under suitable assumptions. The accuracy (compared to SVGD), convergence, and scalability of pSVGD w.r.t. the number of parameters, samples, data points, and processor cores were demonstrated by a highdimensional nonlinear Bayesian inference problem.
Further analysis of the adaptive pSVGD and its application to other highdimensional inference problems, e.g., Bayesian neural networks, is of interest. Another promising direction is to study the correlation, reduction by projection, and scalability in data dimension for big data applications.
References
 Bashir et al. (2008) Bashir, O., Willcox, K., Ghattas, O., van Bloemen Waanders, B., and Hill, J. Hessianbased model reduction for largescale systems with initial condition inputs. International Journal for Numerical Methods in Engineering, 73:844–868, 2008.
 Beskos et al. (2017) Beskos, A., Girolami, M., Lan, S., Farrell, P. E., and Stuart, A. M. Geometric MCMC for infinitedimensional inverse problems. Journal of Computational Physics, 335:327 – 351, 2017.
 Bigoni et al. (2019) Bigoni, D., Zahm, O., Spantini, A., and Marzouk, Y. Greedy inference with layers of lazy maps. arXiv preprint arXiv:1906.00031, 2019.
 Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 BuiThanh & Ghattas (2012) BuiThanh, T. and Ghattas, O. Analysis of the Hessian for inverse scattering problems: I. Inverse shape scattering of acoustic waves. Inverse Problems, 28(5):055001, 2012.
 BuiThanh et al. (2013) BuiThanh, T., Ghattas, O., Martin, J., and Stadler, G. A computational framework for infinitedimensional bayesian inverse problems part I: The linearized case, with application to global seismic inversion. SIAM Journal on Scientific Computing, 35(6):A2494–A2523, 2013.
 Chen & Ghattas (2019) Chen, P. and Ghattas, O. Hessianbased sampling for highdimensional model reduction. International Journal for Uncertainty Quantification, 9(2), 2019.
 Chen & Schwab (2015) Chen, P. and Schwab, C. Sparsegrid, reducedbasis Bayesian inversion. Computer Methods in Applied Mechanics and Engineering, 297:84 – 115, 2015.
 Chen & Schwab (2016) Chen, P. and Schwab, C. Sparsegrid, reducedbasis Bayesian inversion: Nonaffineparametric nonlinear equations. Journal of Computational Physics, 316:470 – 503, 2016.
 Chen et al. (2017) Chen, P., Villa, U., and Ghattas, O. Hessianbased adaptive sparse quadrature for infinitedimensional Bayesian inverse problems. Computer Methods in Applied Mechanics and Engineering, 327:147–172, 2017.
 Chen et al. (2019a) Chen, P., Villa, U., and Ghattas, O. Taylor approximation and variance reduction for PDEconstrained optimal control problems under uncertainty. Journal of Computational Physics, 2019a. To appear.
 Chen et al. (2019b) Chen, P., Wu, K., Chen, J., O’LearyRoseberry, T., and Ghattas, O. Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, pp. 15104–15113, 2019b.
 Chen et al. (2018) Chen, W. Y., Mackey, L., Gorham, J., Briol, F.X., and Oates, C. J. Stein points. arXiv preprint arXiv:1803.10161, 2018.
 Constantine et al. (2016) Constantine, P. G., Kent, C., and BuiThanh, T. Accelerating Markov chain Monte Carlo with active subspaces. SIAM Journal on Scientific Computing, 38(5):A2779–A2805, 2016.
 Cui et al. (2016) Cui, T., Law, K. J., and Marzouk, Y. M. Dimensionindependent likelihoodinformed MCMC. Journal of Computational Physics, 304:109–137, 2016.
 Detommaso et al. (2018) Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., and Scheichl, R. A stein variational Newton method. In Advances in Neural Information Processing Systems, pp. 9187–9197, 2018.
 Detommaso et al. (2019) Detommaso, G., Kruse, J., Ardizzone, L., Rother, C., Köthe, U., and Scheichl, R. HINT: Hierarchical invertible neural transport for general and sequential Bayesian inference.
Comments
There are no comments yet.