# Projected Stein Variational Gradient Descent

The curse of dimensionality is a critical challenge in Bayesian inference for high dimensional parameters. In this work, we address this challenge by developing a projected Stein variational gradient descent (pSVGD) method, which projects the parameters into a subspace that is adaptively constructed using the gradient of the log-likelihood, and applies SVGD for the much lower-dimensional coefficients of the projection. We provide an upper bound for the projection error with respect to the posterior and demonstrate the accuracy (compared to SVGD) and scalability of pSVGD with respect to the number of parameters, samples, data points, and processor cores.

## Authors

• 50 publications
• 20 publications
05/29/2020

### Bayesian Neural Network via Stochastic Gradient Descent

The goal of bayesian approach used in variational inference is to minimi...
11/02/2020

### Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for long-term care facilities

We propose a high dimensional Bayesian inference framework for learning ...
02/25/2020

### Stein variational reduced basis Bayesian inversion

We propose and analyze a Stein variational reduced basis method (SVRB) t...
12/02/2019

### On the geometry of Stein variational gradient descent

Bayesian inference problems require sampling or approximating high-dimen...
07/20/2017

### Learning to Draw Samples with Amortized Stein Variational Gradient Descent

We propose a simple algorithm to train stochastic neural networks to dra...
01/31/2020

### Learning Unitaries by Gradient Descent

We study the hardness of learning unitary transformations by performing ...
12/20/2014

### Explorations on high dimensional landscapes

Finding minima of a real valued non-convex function over a high dimensio...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given observation data for a system with unknown parameters, Bayesian inference provides an optimal probability framework for learning the parameters by updating their prior distribution to a posterior distribution. However, many conventional methods for solving high-dimensional Bayesian inference problems face the curse of dimensionality, i.e., the computational complexity grows rapidly, often exponentially, with respect to (w.r.t.) the number of parameters. To address the curse of dimensionality, the intrinsic properties of the posterior distribution, such as its smoothness, sparsity, and intrinsic low-dimensionality, have been exploited to reduce the parameter correlation and develop efficient methods whose complexity grows slowly or remains the same with increasing dimension. By exploiting the geometry of the log-likelihood function, accelerated Markov chain Monte Carlo (MCMC) methods have been developed to reduce the sample correlation or increase effective sample size independent of the dimension

(Girolami & Calderhead, 2011; Martin et al., 2012; Petra et al., 2014; Constantine et al., 2016; Cui et al., 2016; Beskos et al., 2017). Nevertheless, these random and essentially serial sampling methods remain prohibitive for large-scale inference problems with expensive likelihoods. Deterministic methods using sparse quadratures (Schwab & Stuart, 2012; Schillings & Schwab, 2013; Chen & Schwab, 2015, 2016) were shown to converge rapidly with dimension-independent rates for problems with smooth and sparse posteriors. However, for posteriors lacking smoothness or sparsity, the convergence deteriorates significantly, despite incorporation of Hessian-based transformations (Schillings & Schwab, 2016; Chen et al., 2017).

Transport-based variational inference is another type of deterministic method that seeks a transport map in a function space (represented by, e.g., polynomials, kernels, or neural networks) that pushes the prior to the posterior by minimizing the difference between the transported prior and the posterior, measured in, e.g., Kullback–Leibler divergence

(Marzouk et al., 2016; Liu & Wang, 2016; Blei et al., 2017; Bigoni et al., 2019; Detommaso et al., 2019). In particular, kernel-based Stein variational methods, using gradient-based (SVGD) (Liu & Wang, 2016; Chen et al., 2018; Liu & Zhu, 2018) and Hessian-based (SVN) (Detommaso et al., 2018; Wang & Li, 2020) optimization methods, are shown to achieve fast convergence in relatively low dimensions. Nonetheless, the convergence and accuracy of these methods deteriorates in high dimensions due to the curse of dimensionality in kernel representation. This can be partially addressed by a localized SVGD on Markov blankets, which relies on a conditional independence structure of the target distribution (Zhuo et al., 2018; Wang et al., 2018), or by a projected SVN that projects the parameter in a low-dimensional subspace (Chen et al., 2019b).

Contributions: Here, we propose, analyze, and apply a projected SVGD method to tackle the curse of dimensionality for high-dimensional nonlinear Bayesian inference problems, which relies on the fundamental property that data typically only inform a low-dimensional subspace of high-dimensional parameters, e.g., (Bashir et al., 2008; Bui-Thanh & Ghattas, 2012; Bui-Thanh et al., 2013; Spantini et al., 2015; Isaac et al., 2015; Cui et al., 2016; Chen et al., 2017, 2019a; Chen & Ghattas, 2019; Bigoni et al., 2019)

and references therein. Specifically, our contributions are: (1) we perform dimension reduction by projecting the parameters to a low-dimensional subspace constructed using the gradient of the log-likelihood, and push the prior samples of the projection coefficients to their posterior by pSVGD; (2) we prove the equivalence of the projected transport in the coefficient space and the transport in the projected parameter space; (3) we provide an upper bound for the projection error committed in the posterior (with the likelihood approximated by an optimal profile function) in terms of the eigenvalues of a gradient information matrix; (4) we propose adaptive and parallel algorithms to efficiently approximate the optimal profile function and the gradient information matrix; and (5) we demonstrate the accuracy (compared to SVGD) and scalability of pSVGD w.r.t. the number of parameters, samples, data points, and processor cores for a high-dimensional nonlinear Bayesian inference problem.

The major differences of this work compared to pSVN (Chen et al., 2019b): (1) pSVGD uses only gradient information of the log-likelihood, which is available for many models, while pSVN requires Hessian information, which may not be available for complex models and codes in practical applications; (2) the upper bound for the projection error w.r.t. the posterior in terms of eigenvalues is much sharper than that for pSVN in terms of projection error in parameters; (3) here we theoretically prove the equivalence of the projected transport for the coefficient and the transport for the projected parameters; (4) we also investigate the convergence of pSVGD w.r.t. the number of parameters and the scalability of pSVGD w.r.t. the number of data points.

## 2 Preliminaries

### 2.1 Bayesian Inference

Let denote a random parameter of dimension , which has a continuous prior density . Let denote a set of i.i.d. observation data. Let denote, up to a multiplicative constant, a continuous likelihood of at given . Then the posterior density of parameter conditioned on data , denoted as , is given by Bayes’ rule as

 p(x)=1Zf(x)p0(x), (1)

where is the normalization constant defined as

 Z=∫Rdf(x)p0(x)dx, (2)

whose computation is typically intractable, especially for a large and with a complex geometry in , e.g., multimodal, rich local behavior, etc. The central task of Bayesian inference is to draw samples of the parameter from its posterior distribution with density

, and compute some statistical quantity of interest, e.g., the mean and variance of the parameter

or some function of .

### 2.2 Stein Variational Gradient Descent (SVGD)

SVGD is one type of variational inference method that seeks an approximation of the posterior density by a function in a predefined function set , which is realized by minimizing the Kullback–Leibler (KL) divergence that measures the difference between two densities, i.e.,

 q∗=argminq∈QDKL(q|p), (3)

where , i.e., the average of with respect to the density , which vanishes when . In particular, a transport based function set is considered as , where is a pushforward map that pushes the prior density to a new density through an invertible transport map in a space . Let be given by

 T(x)=x+ϵϕ(x), (4)

where is a differentiable perturbation map w.r.t. , and is small enough so that is invertible. It is shown in (Liu & Wang, 2016) that

 ∇ϵDKL(T♯p0|p)∣∣ϵ=0=−Ex∼p0[trace(Apϕ(x))], (5)

where is the Stein operator given by

 Apϕ(x)=∇xlogp(x)ϕ(x)T+∇xϕ(x). (6)

Moreover, by choosing the space

, a tensor product of a reproducing kernel Hilbert space (RKHS)

with kernel , the steepest descent direction of w.r.t. , denoted as , is explicitly given by

 ϕ∗p0,p(⋅)=Ex∼p0[Apk(x,⋅)], (7)

where

 Apk(x,⋅)=∇xlogp(x)k(x,⋅)+∇xk(x,⋅). (8)

Following this optimization perspective, we define a sequence of transport maps for as

 Tℓ(x)=x+ϵℓϕ∗pℓ,p(x), (9)

and , the density pushed from by the pushforward map . The function is the steepest descent direction of , which is given as in (7) with replaced by . is known as a learning rate that is suitably updated in each iteration. Convergence of to the posterior as is studied in (Liu, 2017). To this end, the sequence of transport maps provide a practical algorithm for approximately drawing samples from the posterior as: first draw samples from the prior , then update them subsequently as

 xℓ+1m=xℓm+ϵl^ϕ∗pℓ,p(xℓm),m=1,…,N, (10)

where , given by

 1NN∑n=1∇xℓnlogp(xℓn)k(xℓn,xℓm)+∇xℓnk(xℓn,xℓm), (11)

is a sample average approximation (SAA) of with samples . The first term of (11) drives the samples to regions of high posterior density and the second term acts as a repulsive force that pushes the samples away from each other to avoid collapsing to local modes of the posterior. Note that while the posterior defined in (1) involves the intractable normalization constant , does not, thus making the sample update by (11) computationally efficient. For the kernel , a common choice is Gaussian as in (Liu & Wang, 2016),

 k(x,x′)=exp(−||x−x′||22h), (12)

where is the bandwidth, e.g., with med representing the median of sample distances.

The repulsive force term in (11) plays a critical role in determining the empirical distribution of the samples. However, it is known that the kernel function in general suffers from the curse of dimensionality (Ramdas et al., 2015; Zhuo et al., 2018; Wang et al., 2018), which makes the repulsive term useless in high dimensions and leads to samples not representative of the posterior, as observed in (Zhuo et al., 2018; Wang et al., 2018), in which the SVGD sample variance becomes very inaccurate for large .

## 3 Projected SVGD

### 3.1 Dimension reduction by projection

Let denote an orthogonal basis that spans the parameter space , where span a subspace of dimension and span its complement subspace . We define a linear projector of rank , , as

 Prx:=r∑i=1ψiψTix=Ψrw,∀x∈Rd, (13)

where represents the projection matrix and

is the coefficient vector with element

for . By this projection, we can decompose as

 x=xr+x⊥, (14)

where the projected parameter and its complement , with

. By defining , and with elements for , we have . For any , with the coefficient vector of the projection (13), and any , we can decompose the prior density as

 p0(Ψrw+Ψ⊥v⊥)=pr0(w)p⊥0(v⊥|w), (15)

where the marginal density is defined as

 pr0(w)=∫Rd−rp0(Ψrw+Ψ⊥v⊥)dv⊥, (16)

and the conditional density as

 p⊥0(v⊥|w)=p0(Ψrw+Ψ⊥v⊥)/pr0(w). (17)

We seek a profile function such that is a good approximation of the likelihood function for a given projector . We define a particular profile function as the marginalization or conditional expectation of w.r.t. the conditional density , i.e.,

 g∗(Prx)=∫Rd−rf(Prx+Ψ⊥v⊥)p⊥0(v⊥|w)dv⊥. (18)

The following proposition, established in (Zahm et al., 2018), shows that is the optimal profile function w.r.t. the KL divergence between the posterior density defined in (1) and the projected posterior density defined as

 pr(x):=1Zrg(Prx)p0(x), (19)

where .

###### Proposition 1.

Given any projector defined in (13), let and denote the projected posterior densities defined in (19) for any given profile function and the profile function defined in (18), respectively, then we have

 DKL(p|p∗r)≤DKL(p|pr), (20)

where is the posterior density defined in (1).

###### Remark 1.

Once the projector is provided, defined in (18) is the optimal by Proposition 1. However, practical evaluation of is challenging because (1) the basis is typically not available; (2) it involves two -dimensional integrals in (16) and (18), whose evaluation may be expensive. Serveral approximations were introduced in (Zahm et al., 2018). We propose an efficient approximation in Section 3.5 that is most suitable for pSVGD.

### 3.2 Construction of the basis

The basis introduced in last section plays a critical role for the accuracy of the approximation of the posterior density by the optimal projected posterior density . To construct a good basis, we first make the following assumption on the prior density.

###### Assumption 1.

The prior density satisfies for the two functions such that:

(1) is twice continuously differentiable, and there exists a symmetric positive definite matrix such that

 zT∇2V(x)z≥zTΓz

for any at any ;

(2) is bounded in , and there exists a constant such that the maximum difference is bounded by

 supx∈RdW(x)−infx∈RdW(x)≤log(γ).
###### Remark 2.

Assumption (1) implies that the function is at least quadratically convex such that the density decays rapidly departing from the origin. Assumption (2) guarantees that for some constants . Priors satisfying the two assumptions are Gaussian or sub-Gaussian whose tails decay at least as fast as that of Gaussian. For example, a parameter

satisfies Assumption 1 with and .

By we denote a gradient information matrix, which is defined as the average of the outer product of the gradient of the log-likelihood w.r.t. the posterior, i.e.,

 H=∫Rd(∇xlogf(x))(∇xlogf(x))Tp(x)dx. (21)

By we denote the generalized eigenpairs of , with given in Assumption 1, which correspond to the largest eigenvalues , i.e.,

 Hψi=λiΓψi. (22)

In particular, for the development of pSVGD, we require instead of the conventional , being if and zero otherwise. We make the following important observation: By definition of , the eigenvalue measures the sensitivity of the data w.r.t. the parameters along direction , i.e., the data mostly inform parameters in directions corresponding to large eigenvalues . For small close to zero, the variation of the likelihood in direction is negligible.

More rigorously, the following proposition, as shown in (Zahm et al., 2018), provides an upper bound for the projection error committed in the optimally projected posterior in terms of the eigenvalues for .

###### Proposition 2.

Under Assumption 1, for the projector defined in (13) with the basis

taken as the eigenvectors of the generalized eigenvalue problem (

22), for any continuously differentiable likelihood function satisfying , we have

 DKL(p|p∗r)≤γ2d∑i=r+1λi. (23)

To distinguish different dimensions, we consider a relative projection error defined as .

###### Remark 3.

Proposition 2 implies that if the eigenvalues decay rapidly, then the projected posterior is close to the true posterior in KL divergence for a small . Moreover, as , if , which is usually true because of the essential ill-posedness of the inference problem due to parameter correlation, i.e., the observation data cannot inform all parameter dimensions if is large.

###### Remark 4.

Construction of the basis by is challenging since involves an integration w.r.t. the posterior distribution. We propose an algorithm in Section 3.5 for an adaptive approximation of and construction of the basis.

### 3.3 Bayesian inference of the coefficient

For any with decomposition (14), or equivalently

 x=Ψrw+Ψ⊥w⊥, (24)

by the density decomposition (15) we have

 p0(x)=pr0(w)p⊥0(w⊥|w). (25)

For this decomposition, the projected posterior for in (19) can be written for and as

 pr(x)=1Zrg(Ψrw)pr0(w)p⊥0(w⊥|w). (26)

We define a new function of the coefficient vector as

 pr(w):=1Zrg(Ψrw)pr0(w), (27)

which is indeed a density function of because by definition

 Zr=Ex∼p0[g(Prx)]=Ew∼pr0[g(Ψrw)], (28)

where in the second equality we used the property that the randomness of in can be fully represented by that of in due to . Therefore, we can view and as the prior and posterior densities of , and as the likelihood function. Then by the density decomposition (26), or equivalently

 pr(x)=pr(w)p⊥0(w⊥|w), (29)

sampling from the projected posterior distribution with density can be conceptually realized by (1) sampling from the posterior distribution with density , which is to solve a problem of Bayesian inference of the coefficient vector of dimension , and (2) sampling from the conditional distribution with density , and (3) assembling by the decomposition (24). However, this sampling scheme remains impractical since is in general not available or very expensive to compute, especially for large . We defer sampling to Section 3.5 which describes a practical algorithm and focus on sampling from the posterior by projected SVGD.

### 3.4 Projected SVGD

To sample from the posterior distribution of the coefficient vector with density given by Bayes’ rule in (27), we employ the SVGD method presented in Section 2.2 in the coefficient space , with . Specifically, we define a projected transport map as

 Tr(w)=w+ϵϕr(w), (30)

with a differentiable perturbation map , and a small enough such that is invertible. Following the argument in (Liu & Wang, 2016) on the result (5) for SVGD, we obtain

 ∇ϵDKL(Tr♯pr0|pr)∣∣ϵ=0=−Ew∼pr0[trace(Aprϕr(w))], (31)

where is the Stein operator given by

 Aprϕr(w)=∇wlogpr(w)ϕr(w)T+∇wϕr(w). (32)

To find the perturbation map , we consider the space , a tensor product of RKHS defined with kernel . Then the steepest descent direction of w.r.t. , denoted as , is explicitly given by

 ϕr,∗pr0,pr(⋅)=Ew∼pr0[Aprkr(w,⋅)], (33)

where

 Aprkr(w,⋅)=∇wlogpr(w)kr(w,⋅)+∇wkr(w,⋅). (34)

By the same optimization perspective in the first step, we can define a sequence of transport maps for

 Trℓ(w)=w+ϵℓϕ∗,rprℓ,pr(w), (35)

and push forward the density as . Convergence of the density to the posterior density as can be obtained by following the same argument as in (Liu, 2017). Therefore, to draw samples from the posterior , we can first draw samples from the prior with density , and subsequently update them as

 wℓ+1m=wℓm+ϵℓ^ϕr,∗prℓ,pr(wℓm),m=1,…,N, (36)

where is a SAA of with samples , i.e., is given by

 1NN∑n=1∇wℓnlogpr(wℓn)kr(wℓn,wℓm)+∇wℓnkr(wℓn,wℓm), (37)

where practical evaluation of the gradient of the log-posterior , with the posterior density defined in (27), is presented in Section 3.5.

The kernel can be specified as in (12), i.e.,

 kr(w,w′)=exp(−||w−w′||22h). (38)

To account for data impact in different directions informed by the eigenvalues of (22), we propose to replace in (38) by with for the likelihood and for the prior.

The following theorem, proved in Appendix A, establishes the connection between pSVGD in the coefficient space and SVGD in the projected parameter space.

###### Theorem 1.

Under the condition for , with the kernel defined in (38) and defined in (12), at the optimal profile function in (18), we have the equivalence of the projected transport map for the coefficient in the steepest direction (33) and the transport map for the projected parameter in the steepest direction (7), as

 Tr(w)=ΨTrT(Prx). (39)

In particular, we have

 ∇wlogpr(w)=ΨTr∇xlogpr(Prx), (40)

with the posteriors defined in (27) and defined in (19).

###### Remark 5.

The equivalence (39) implies that the pSVGD only updates the projected parameter , or the parameter in the subspace formed by the basis , while leaving the parameter in the complement subspace unchanged. We propose an adaptive scheme (Algorithm 3) that also updates the parameter in the complement subspace.

###### Remark 6.

The condition is satisfied for such that its marginal for , i.e., defined in (16), is the same as the anchored density at . Meanwhile, Gaussian priors satisfy this condition since where both and are Gaussian densities.

### 3.5 Practical algorithms

Sampling from the projected posterior defined in (19) with the optimal profile function defined in (18), involves, by the decomposition (27), sampling from the posterior by pSVGD and sampling from the conditional distribution with density . The sampling is impractical because of two challenges summarized as follows:

1. The basis for the complement subspace, used in (16) for the marginal prior density and in (18) for the optimal profile function , are generally not available or their computation by solving the generalized eigenvalue problem (22) is too expensive for large dimension . Meanwhile, both (16) and (18) involve high-dimensional integrals.

2. The matrix defined in (21) for the construction of the basis involves integration w.r.t. the posterior distribution of the parameter . However, drawing samples from the posterior to evaluate the integral turns out to be the central task of the Bayesian inference.

Challenge 1 can be practically addressed by using the fundamental property pointed out in Section 3.2, i.e., the variation of likelihood in the complement subspace is negligible, which means that the posterior distribution meaningfully differs from the prior distribution only in the subspace . Therefore, to draw samples from the posterior, we only need to proceed by the abstract Algorithm 1.

Two important factors in Algorithm 1 have to be specified to make it practically operational: (i) the projector in step 2, and (ii) the transport map in step 3.

We address (ii) at first by proposing a practical pSVGD algorithm given the projector with basis . By the fact that the likelihood has negligible variation in the subspace , for any sample , , we can approximate the optimal profile function in (18) as

 g∗(Prxℓn)=f(Prxℓn+x⊥n), (41)

which is equivalent to using one sample such that to approximate the integral (18). Similarly, we can specify the projected prior density at as

 pr0(wℓn)=p0(Ψrwℓn+x⊥n). (42)

By this specification, we avoid computing not only but also the high-dimensional integral w.r.t.  in both (16) and (18), thus addressing challenge 1. Moreover, under this specification, the gradient of the log-posterior in (37) for the pSVGD method can be evaluated as

 ∇wℓnlogpr(wℓn)=ΨTr∇xlogp(Ψrwℓn+x⊥n). (43)

Given the projector with basis , we summarize the pSVGD transport of samples in Algorithm 2.

In particular, by leveraging the property that the samples can be updated in parallel, we implement a parallel version of pSVGD using MPI for information communication in processor cores, each with different samples, thus producing different samples in total.

To construct the projector with basis , we approximate in (21) by SAA with posterior samples , i.e.,

 ^H:=1MM∑m=1∇xlogf(xm)(∇xlogf(xm))T. (44)

Since the posterior samples are not available, we propose to adaptively construct the basis with samples transported from the prior samples by pSVGD, which addresses challenge 2. This procedure is summarized in Algorithm 3. We remark that by the adaptive construction, we push the samples to their posterior in each subspace spanned by (possibly) different basis with different for different . With exploiting all the data-informed dimensions as , the samples may represent the true posterior without committing projection error as bounded in Proposition 2.

## 4 Numerical Experiments

We present a nonlinear Bayesian inference problem with high-dimensional parameters to demonstrate the accuracy of pSVGD compared to SVGD, and the convergence and scalability of pSVGD w.r.t. the number of parameters, samples, data points, and processor cores. A linear inference example, whose posterior is analytically given, is presented in Appendix B to demonstrate the accuracy of pSVGD compared to SVGD. The code is available at https://github.com/cpempire/pSVGD.

We consider a parameter-to-observable map

 h(x)=O∘S(x), (45)

where is a nonlinear discrete solution map of the log-normal diffusion model in a unit square domain

 −∇⋅(ex∇u)=0,in (0,1)2, (46)

imposed with Dirichlet boundary condition on the top boundary and on bottom boundary, and homogeneous Neumann boundary condition on the left and right boundaries. is a divergence operator, and is a gradient operator. and are discretized by finite elements with piecewise linear elements in a uniform mesh of triangles of size . and are the nodal values of and . We consider a Gaussian distribution for with covariance , which leads to a Gaussian distribution for , where is discretized from . is a pointwise observation map at points equally distributed in . We consider an additive Gaussian noise with and for data

 y=h(x)+ξ, (47)

which leads to the likelihood function

 f(x)=exp(−12||y−h(x)||2Σ−1ξ). (48)

We use a DILI-MCMC algorithm (Cui et al., 2016) to generate effective posterior samples and use them to compute a reference sample variance. We run SVGD and the adaptive pSVGD (with ) using 256 samples and 200 iterations for different dimensions, both using line search to seek the step size . The comparison of accuracy can be observed in Figure 1. We can see that SVGD samples fail to capture the posterior distribution in high dimensions and become worse with increasing dimension, while pSVGD samples represent the posterior distribution well, measured by sample variance, and the approximation remains accurate with increasing dimension.

The accuracy of pSVGD can be further demonstrated by the significant decay (about 7 orders of magnitude) of the eigenvalues for different dimensions in the top of Figure 2. Only about 50 dimensions (with small relative projection error, about , committed in the posterior by Proposition 2) are preserved out of from 289 to 16,641 dimensions, representing over 300 dimension reduction for the last case. The similar decays of the eigenvalues in the projection rank and the averaged step norm in the number of iterations shown in Figure 2 imply that pSVGD is scalable w.r.t. the parameter dimension. Moreover, the similar decays for different sample size in Figure 3 demonstrate that pSVGD is scalable w.r.t. the number of samples . Furthermore, as displayed in the top of Figure 4, with increasing number of i.i.d. observation data points in a refined mesh of size , the eigenvalues decay at almost the same rate with similar relative projection error , and lead to similar reduction for such that , which implies weak scalability of pSVGD w.r.t. the number of data points. Lastly, from the bottom of Figure 4 by the nearly decay of CPU time we can see that pSVGD achieves strong parallel scalability (in computing gradient, kernel, and sample update) w.r.t. the number of processor cores for the same work with samples.

## 5 Conclusions and Future Work

We proposed, analyzed, and demonstrated pSVGD for Bayesian inference in high dimensions to tackle the critical challenge of curse of dimensionality. The projection can be adaptively constructed by using only the information of the gradient of the log-likelihood, which is available in SVGD algorithm, where the projection error committed in the posterior can be bounded by the truncated eigenvalues. We proved that pSVGD for the coefficient is equivalent to SVGD for the projected parameter under suitable assumptions. The accuracy (compared to SVGD), convergence, and scalability of pSVGD w.r.t. the number of parameters, samples, data points, and processor cores were demonstrated by a high-dimensional nonlinear Bayesian inference problem.

Further analysis of the adaptive pSVGD and its application to other high-dimensional inference problems, e.g., Bayesian neural networks, is of interest. Another promising direction is to study the correlation, reduction by projection, and scalability in data dimension for big data applications.

## References

• Bashir et al. (2008) Bashir, O., Willcox, K., Ghattas, O., van Bloemen Waanders, B., and Hill, J. Hessian-based model reduction for large-scale systems with initial condition inputs. International Journal for Numerical Methods in Engineering, 73:844–868, 2008.
• Beskos et al. (2017) Beskos, A., Girolami, M., Lan, S., Farrell, P. E., and Stuart, A. M. Geometric MCMC for infinite-dimensional inverse problems. Journal of Computational Physics, 335:327 – 351, 2017.
• Bigoni et al. (2019) Bigoni, D., Zahm, O., Spantini, A., and Marzouk, Y. Greedy inference with layers of lazy maps. arXiv preprint arXiv:1906.00031, 2019.
• Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
• Bui-Thanh & Ghattas (2012) Bui-Thanh, T. and Ghattas, O. Analysis of the Hessian for inverse scattering problems: I. Inverse shape scattering of acoustic waves. Inverse Problems, 28(5):055001, 2012.
• Bui-Thanh et al. (2013) Bui-Thanh, T., Ghattas, O., Martin, J., and Stadler, G. A computational framework for infinite-dimensional bayesian inverse problems part I: The linearized case, with application to global seismic inversion. SIAM Journal on Scientific Computing, 35(6):A2494–A2523, 2013.
• Chen & Ghattas (2019) Chen, P. and Ghattas, O. Hessian-based sampling for high-dimensional model reduction. International Journal for Uncertainty Quantification, 9(2), 2019.
• Chen & Schwab (2015) Chen, P. and Schwab, C. Sparse-grid, reduced-basis Bayesian inversion. Computer Methods in Applied Mechanics and Engineering, 297:84 – 115, 2015.
• Chen & Schwab (2016) Chen, P. and Schwab, C. Sparse-grid, reduced-basis Bayesian inversion: Nonaffine-parametric nonlinear equations. Journal of Computational Physics, 316:470 – 503, 2016.
• Chen et al. (2017) Chen, P., Villa, U., and Ghattas, O. Hessian-based adaptive sparse quadrature for infinite-dimensional Bayesian inverse problems. Computer Methods in Applied Mechanics and Engineering, 327:147–172, 2017.
• Chen et al. (2019a) Chen, P., Villa, U., and Ghattas, O. Taylor approximation and variance reduction for PDE-constrained optimal control problems under uncertainty. Journal of Computational Physics, 2019a. To appear.
• Chen et al. (2019b) Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., and Ghattas, O. Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, pp. 15104–15113, 2019b.
• Chen et al. (2018) Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. Stein points. arXiv preprint arXiv:1803.10161, 2018.
• Constantine et al. (2016) Constantine, P. G., Kent, C., and Bui-Thanh, T. Accelerating Markov chain Monte Carlo with active subspaces. SIAM Journal on Scientific Computing, 38(5):A2779–A2805, 2016.
• Cui et al. (2016) Cui, T., Law, K. J., and Marzouk, Y. M. Dimension-independent likelihood-informed MCMC. Journal of Computational Physics, 304:109–137, 2016.
• Detommaso et al. (2018) Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., and Scheichl, R. A stein variational Newton method. In Advances in Neural Information Processing Systems, pp. 9187–9197, 2018.
• Detommaso et al. (2019) Detommaso, G., Kruse, J., Ardizzone, L., Rother, C., Köthe, U., and Scheichl, R. HINT: Hierarchical invertible neural transport for general and sequential Bayesian inference.