A Stein variational Newton method

Stein variational gradient descent (SVGD) was recently proposed as a general purpose nonparametric variational inference algorithm [Liu & Wang, NIPS 2016]: it minimizes the Kullback-Leibler divergence between the target distribution and its approximation by implementing a form of functional gradient descent on a reproducing kernel Hilbert space. In this paper, we accelerate and generalize the SVGD algorithm by including second-order information, thereby approximating a Newton-like iteration in function space. We also show how second-order information can lead to more effective choices of kernel. We observe significant computational gains over the original SVGD algorithm in multiple test cases.

• 6 publications
• 17 publications
• 27 publications
• 22 publications
• 6 publications
08/16/2016

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

We propose a general purpose variational inference algorithm that forms ...
04/19/2022

A stochastic Stein Variational Newton method

Stein variational gradient descent (SVGD) is a general-purpose optimizat...
03/30/2017

Diving into the shallows: a computational perspective on large-scale shallow learning

In this paper we first identify a basic limitation in gradient descent-b...
06/18/2020

When Does Preconditioning Help or Hurt Generalization?

While second order optimizers such as natural gradient descent (NGD) oft...
11/13/2017

Analyzing and Improving Stein Variational Gradient Descent for High-dimensional Marginal Inference

Stein variational gradient descent (SVGD) is a nonparametric inference m...
05/24/2021

It has long been a goal to efficiently compute and use second order info...
11/15/2017

Variational Adaptive-Newton Method for Explorative Learning

We present the Variational Adaptive Newton (VAN) method which is a black...

1 Introduction

Approximating an intractable probability distribution via a collection of samples—in order to evaluate arbitrary expectations over the distribution, or to otherwise characterize uncertainty that the distribution encodes—is a core computational challenge in statistics and machine learning. Common features of the target distribution can make sampling a daunting task. For instance, in a typical Bayesian inference problem, the posterior distribution might be strongly non-Gaussian (perhaps multimodal) and high dimensional, and evaluations of its density might be computationally intensive.

There exist a wide range of algorithms for such problems, ranging from parametric variational inference blei2017variational

to Markov chain Monte Carlo (MCMC) techniques

gilks1995markov

. Each algorithm offers a different computational trade-off. At one end of the spectrum, we find the parametric mean-field approximation—a cheap but potentially inaccurate variational approximation of the target density. At the other end, we find MCMC—a nonparametric sampling technique yielding estimators that are consistent, but potentially slow to converge. In this paper, we focus on Stein variational gradient descent (SVGD)

liu2016svgd , which lies somewhere in the middle of the spectrum and can be described as a particular nonparametric variational inference method blei2017variational , with close links to the density estimation approach in AnderesCoram2012 .

The SVGD algorithm seeks a deterministic coupling between a tractable reference distribution of choice (e.g., a standard normal) and the intractable target. This coupling is induced by a transport map that can transform a collection of reference samples into samples from the desired target distribution. For a given pair of distributions, there may exist infinitely many such maps villani2008optimal ; several existing algorithms (e.g., tabak2013family ; rezende2015variational ; marzouk2016sampling ) aim to approximate feasible transport maps of various forms. The distinguishing feature of the SVGD algorithm lies in its definition of a suitable map . Its central idea is to approximate as a growing composition of simple maps, computed sequentially:

 T=T1∘⋯∘Tk∘⋯, (1)

where each map is a perturbation of the identity map along the steepest descent direction of a functional that describes the Kullback–Leibler (KL) divergence between the pushforward of the reference distribution through the composition and the target distribution. The steepest descent direction is further projected onto a reproducing kernel Hilbert space (RKHS) in order to give a nonparametric closed form aronszajn1950theory . Even though the resulting map is available explicitly without any need for numerical optimization, the SVGD algorithm implicitly approximates a steepest descent iteration on a space of maps of given regularity.

A primary goal of this paper is to explore the use of second-order information (e.g., Hessians) within the SVGD algorithm. Our idea is to develop the analogue of a Newton iteration—rather than gradient descent—for the purpose of sampling distributions more efficiently. Specifically, we design an algorithm where each map is now computed as the perturbation of the identity function along the direction that minimizes a certain local quadratic approximation of . Accounting for second-order information can dramatically accelerate convergence to the target distribution, at the price of additional work per iteration. The tradeoff between speed of convergence and cost per iteration is resolved in favor of the Newton-like algorithm—which we call a Stein variational Newton method (SVN)—in several numerical examples.

The efficiency of the SVGD and SVN algorithms depends further on the choice of reproducing kernel. A second contribution of this paper is to design geometry-aware Gaussian kernels that also exploit second-order information, yielding substantially faster convergence towards the target distribution than SVGD or SVN with an isotropic kernel.

In the context of parametric variational inference, second-order information has been used to accelerate the convergence of certain variational approximations, e.g., khan2017adaptivenewton ; khal2017viRMSprop ; marzouk2016sampling . In this paper, however, we focus on nonparametric variational approximations, where the corresponding optimisation problem is defined over an infinite-dimensional RKHS of transport maps. More closely related to our work is the Riemannian SVGD algorithm liu2017riemannian , which generalizes a gradient flow interpretation of SVGD liu2017stein to Riemannian manifolds, and thus also exploits geometric information within the inference task.

The rest of the paper is organized as follows. Section 2 briefly reviews the SVGD algorithm, and Section 3 introduces the new SVN method. In Section 4 we introduce geometry-aware kernels for the SVN method. Numerical experiments are described in Section 5. Proofs of our main results and further numerical examples addressing scaling to high dimensions are given in the Appendix. Code and all numerical examples are collected in our GitHub repository github .

2 Background

Suppose we wish to approximate an intractable target distribution with density on via an empirical measure, i.e., a collection of samples. Given samples from a tractable reference density on , one can seek a transport map such that the pushforward density of under , denoted by , is a close approximation to the target .111 If is invertible, then . There exist infinitely many such maps villani2008optimal . The image of the reference samples under the map, , can then serve as an empirical measure approximation of (e.g., in the weak sense liu2016svgd ).

Variational approximation.

Using the KL divergence to measure the discrepancy between the target and the pushforward , one can look for a transport map that minimises the functional

 T↦DKL(T∗p||π) (2)

over a broad class of functions. The Stein variational method breaks the minimization of (2) into several simple steps: it builds a sequence of transport maps to iteratively push an initial reference density towards . Given a scalar-valued RKHS with a positive definite kernel , each transport map is chosen to be a perturbation of the identity map

along the vector-valued RKHS

, i.e.,

The transport maps are computed iteratively. At each iteration , our best approximation of is given by the pushforward density . The SVGD algorithm then seeks a transport map that further decreases the KL divergence between and ,

 Q↦Jpl[Q]\coloneqqDKL((I+Q)∗pl||π), (4)

for an appropriate choice of . In other words, the SVGD algorithm seeks a map such that

 Jpl[Q]

where denotes the zero map. By construction, the sequence of pushforward densities becomes increasingly closer (in KL divergence) to the target . Recent results on the convergence of the SVGD algorithm are presented in liu2017stein .

The first variation of at along can be defined as

 DJpl[S](V)\coloneqqlimτ→01τ(Jpl[S+τV]−Jpl[S]). (6)

Assuming that the objective function is Fréchet differentiable, the functional gradient of at is the element of such that

 DJpl[S](V)=⟨∇Jpl[S],V⟩Hd∀V∈Hd, (7)

where denotes an inner product on .

In order to satisfy (5), the SVGD algorithm defines as a perturbation of the identity map along the steepest descent direction of the functional evaluated at the zero map, i.e.,

 Tl+1=I−ε∇Jpl[0], (8)

for a small enough . It was shown in liu2016svgd that the functional gradient at has a closed form expression given by

 −∇Jpl[0](z)=Ex∼pl[k(x,z)∇xlogπ(x)+∇xk(x,z)]. (9)

Empirical approximation.

There are several ways to approximate the expectation in (9). For instance, a set of particles can be generated from the initial reference density and pushed forward by the transport maps . The pushforward density can then be approximated by the empirical measure given by the particles , where for , so that

 −∇Jpl[0](z)≈G(z)\coloneqq1n∑nj=1[k(xlj,z)∇xljlogπ(xlj)+∇xljk(xlj,z)]. (10)

The first term in (10) corresponds to a weighted average steepest descent direction of the log-target density with respect to . This term is responsible for transporting particles towards high-probability regions of . In contrast, the second term can be viewed as a “repulsion force” that spreads the particles along the support of , preventing them from collapsing around the mode of . The SVGD algorithm is summarised in Algorithm 1.

3 Stein variational Newton method

Here we propose a new method that incorporates second-order information to accelerate the convergence of the SVGD algorithm. We replace the steepest descent direction in (8) with an approximation of the Newton direction.

Functional Newton direction.

Given a differentiable objective function , we can define the second variation of at along the pair of directions as

 D2Jpl[0](V,W)\coloneqqlimτ→01τ(DJpl[τW](V)−DJpl[0](V)).

At each iteration, the Newton method seeks to minimize a local quadratic approximation of . The minimizer of this quadratic form defines the Newton direction and is characterized by the first order stationarity conditions

 D2Jpl[0](V,W)=−DJpl[0](V),∀V∈Hd. (11)

We can then look for a transport map that is a local perturbation of the identity map along the Newton direction, i.e.,

 Tl+1=I+εW, (12)

for some that satisfies (5). The function is guaranteed to be a descent direction if the bilinear form in (11) is positive definite. The following theorem gives an explicit form for and is proven in Appendix.

Theorem 1.

The variational characterization of the Newton direction in (11) is equivalent to

 d∑i=1⟨d∑j=1⟨hij(y,z),wj(z)⟩H+∂iJpl[0](y),vi(y)⟩H=0, (13)

for all , where

 hij(y,z)=Ex∼pl[−∂2ijlogπ(x)k(x,y)k(x,z)+∂ik(x,y)∂jk(x,z)]. (14)

We propose a Galerkin approximation of (13). Let be an ensemble of particles distributed according to , and define the finite dimensional linear space . We look for an approximate solution in —i.e.,

 wj(z)=n∑k=1αkjk(xk,z) (15)

for some unknown coefficients —such that the residual of (13) is orthogonal to . The following corollary gives an explicit characterization of the Galerkin solution and is proven in the Appendix.

Corollary 1.

The coefficients are given by the solution of the linear system

where is a vector of unknown coefficients, is the evaluation of the symmetric form (14) at pairs of particles, and where represents the evaluation of the first variation at the -th particle.

In practice, we can only evaluate a Monte Carlo approximation of and in (31) using the ensemble .

Inexact Newton.

The solution of (31) by means of direct solvers might be impractical for problems with a large number of particles or high parameter dimension , since it is a linear system with unknowns. Moreover, the solution of (31) might not lead to a descent direction (e.g., when is not log-concave). We address these issues by deploying two well-established techniques in nonlinear optimisation wright1999numerical . In the first approach, we solve (31) using the inexact Newton–conjugate gradient (NCG) method (wright1999numerical, , Chapters 5 and 7), wherein a descent direction can be guaranteed by appropriately terminating the conjugate gradient iteration. The NCG method only needs to evaluate the matrix-vector product with each and does not construct the matrix explicitly, and thus can be scaled to high dimensions. In the second approach, we simplify the problem further by taking a block-diagonal approximation of the second variation, breaking (31) into decoupled linear systems

 Hs,sαs=∇Js,s=1,…n. (17)

Here, we can either employ a Gauss-Newton approximation of the Hessian in or again use inexact Newton–CG, to guarantee that the approximation of the Newton direction is a descent direction.

Both the block-diagonal approximation and inexact NCG are more efficient than solving for the full Newton direction (31). In addition, the block-diagonal form (17) can be solved in parallel for each of the blocks, and hence it may best suit high-dimensional applications and/or large numbers of particles. In the Appendix, we provide a comparison of these approaches on various examples. Both approaches provide similar progress per SVN iteration compared to the full Newton direction.

Leveraging second-order information provides a natural scaling for the step size, i.e., . Here, the choice performs reasonably well in our numerical experiments (Section 5 and the Appendix). In future work, we will refine our strategy by considering either a line search or a trust region step. The resulting Stein variational Newton method is summarised in Algorithm 2.

4 Scaled Hessian kernel

In the Stein variational method, the kernel weighs the contribution of each particle to a locally averaged steepest descent direction of the target distribution, and it also spreads the particles along the support of the target distribution. Thus it is essential to choose a kernel that can capture the underlying geometry of the target distribution, so the particles can traverse the support of the target distribution efficiently. To this end, we can use the curvature information characterised by the Hessian of the logarithm of the target density to design anisotropic kernels.

Consider a positive definite matrix that approximates the local Hessian of the negative logarithm of the target density, i.e., . We introduce the metric

 Mπ\coloneqqEx∼π[A(x)], (18)

to characterise the average curvature of the target density, stretching and compressing the parameter space in different directions. There are a number of computationally efficient ways to evaluate such an

—for example, the generalised eigenvalue approach in

stonewton2012 and the Fisher information-based approach in girolami2011riemann . The expectation in (18) is taken against the target density , and thus cannot be directly computed. Utilising the ensemble in each iteration, we introduce an alternative metric

 Mpl\coloneqq1n∑ni=1A(xli), (19)

to approximate . Similar approximations have also been introduced in the context of dimension reduction for statistical inverse problems; see cui2014likelihood . Note that the computation of the metric (19) does not incur extra computational cost, as we already calculated (approximations to) at each particle in the Newton update.

Given a kernel of the generic form , we can then use the metric to define an anisotropic kernel

 kl(x,x′)=f(1g(d)∥x−x′∥2Mpl),

where the norm is defined as and is a positive and real-valued function of the dimension . For example, with , the Gaussian kernel used in the SVGD of liu2016svgd can be modified as

 kl(x,x′)\coloneqqexp(−12d∥x−x′∥2Mpl). (20)

The metric induces a deformed geometry in the parameter space: distance is greater along directions where the (average) curvature is large. This geometry directly affects how particles in SVGD or SVN flow—by shaping the locally-averaged gradients and the “repulsion force” among the particles—and tends to spread them more effectively over the high-probability regions of .

The dimension-dependent scaling factor plays an important role in high dimensional problems. Consider a sequence of target densities that converges to a limit as the dimension of the parameter space increases. For example, in the context of Bayesian inference on function spaces, e.g., stuart2010 , the posterior density is often defined on a discretisation of a function space, whose dimensionality increases as the discretisation is refined. In this case, the -weighed norm is the square of the discretised norm under certain technical conditions (e.g., the examples in Section 5.2 and the Appendix) and converges to the functional norm as . With an appropriate scaling , the kernel may thus exhibit robust behaviour with respect to discretisation if the target distribution has appropriate infinite-dimensional limits. For high-dimensional target distributions that do not have a well-defined limit with increasing dimension, an appropriately chosen scaling function can still improve the ability of the kernel to discriminate inter-particle distances. Further numerical investigation of this effect is presented in the Appendix.

5 Test cases

We evaluate our new SVN method with the scaled Hessian kernel on a set of test cases drawn from various Bayesian inference tasks. For these test cases, the target density is the (unnormalised) posterior density. We assume the prior distributions are Gaussian, that is, , where and are the prior mean and prior covariance, respectively. Also, we assume there exists a forward operator mapping from the parameter space to the data space. The relationship between the observed data and unknown parameters can be expressed as , where is the measurement error and

is the identity matrix. This relationship defines the likelihood function

and the (unnormalised) posterior density .

We will compare the performance of SVN and SVGD, both with the scaled Hessian kernel (20

) and the heuristically-scaled isotropic kernel used in

liu2016svgd . We refer to these algorithms as SVN-H, SVN-I, SVGD-H, and SVGD-I, where ‘H’ or ‘I’ designate the Hessian or isotropic kernel, respectively. Recall that the heuristic used in the ‘-I’ algorithms involves a scaling factor based on the number of particles and the median pairwise distance between particles liu2016svgd

. Here we present two test cases, one multi-modal and the other high-dimensional. In the Appendix, we report on additional tests. First, we evaluate the performance of SVN-H with different Hessian approximations: the exact Hessian (full Newton), the block diagonal Hessian, and a Newton–CG version of the algorithm with exact Hessian. Second, we provide a performance comparison between SVGD and SVN on a high-dimensional Bayesian neural network. Finally, we provide further numerical investigations of the dimension-scalability of our scaled kernel.

5.1 Two-dimensional double banana

The first test case is a two-dimensional bimodal and “banana” shaped posterior density. The prior is a standard multivariate Gaussian, i.e., and

, and the observational error has standard deviation

. The forward operator is taken to be a scalar logarithmic Rosenbrock function, i.e.,

 F(x)=log((1−x1)2+100(x2−x21)2),

where . We take a single observation , with

being a random variable drawn from the prior and

.

Figure 1 summarises the outputs of four algorithms at selected iteration numbers, each with particles initially sampled from the prior . The rows of Figure 1 correspond to the choice of algorithms and the columns of Figure 1 correspond to the outputs at different iteration numbers. We run 10, 50, and 100 iterations of SVN-H. To make a fair comparison, we rescale the number of iterations for each of the other algorithms so that the total cost (CPU time) is approximately the same. It is interesting to note that the Hessian kernel takes considerably less computational time than the Isotropic kernel. This is because, whereas the Hessian kernel is automatically scaled, the Isotropic kernel calculates the distance between the particles at each iterations to heuristically rescale the kernel.

The first row of Figure 1 displays the performance of SVN-H, where second-order information is exploited both in the optimisation and in the kernel. After only 10 iterations, the algorithm has already converged, and the configuration of particles does not visibly change afterwards. Here, all the particles quickly reach the high probability regions of the posterior distribution, due to the Newton acceleration in the optimisation. Additionally, the scaled Hessian kernel seems to spread the particles into a structured and precise configuration.

The second row shows the performance of SVN-I, where the second-order information is used exclusively in the optimisation. We can see the particles quickly moving towards the high-probability regions, but the configuration is much less structured. After 47 iterations, the algorithm has essentially converged, but the configuration of the particles is noticeably rougher than that of SVN-H.

SVGD-H in the third row exploits second-order information exclusively in the kernel. Compared to SVN-I, the particles spread more quickly over the support of the posterior, but not all the particles reach the high probability regions, due to slower convergence of the optimisation. The fourth row shows the original algorithm, SVGD-I. The algorithm lacks both of the benefits of second-order information: with slower convergence and a more haphazard particle distribution, it appears less efficient for reconstructing the posterior distribution.

5.2 100-dimensional conditioned diffusion

The second test case is a high-dimensional model arising from a Langevin SDE, with state and dynamics given by

 dut=βu(1−u2)(1+u2)dt+dxt,u0=0. (21)

Here is a Brownian motion, so that , where . This system represents the motion of a particle with negligible mass trapped in an energy potential, with thermal fluctuations represented by the Brownian forcing; it is often used as a test case for MCMC algorithms in high dimensions cui2016dimension . Here we use and . Our goal is to infer the driving process and hence its pushforward to the state .

The forward operator is defined by , where are equispaced observation times in the interval , i.e., . By taking , we define an observation , where is a Brownian motion path and . For discretization, we use an Euler-Maruyama scheme with step size ; therefore the dimensionality of the problem is . The prior is given by the Brownian motion , described above.

Figure 2 summarises the outputs of four algorithms, each with particles initially sampled from . Figure 2 is presented in the same way as Figure 1 from the first test case. The iteration numbers are scaled, so that we can compare outputs generated by various algorithms using approximately the same amount of CPU time. In Figure 2, the path in magenta corresponds to the solution of the Langevin SDE in (21) driven by the true Brownian path . The red points correspond to the 20 noisy observations. The blue path is the reconstruction of the magenta path, i.e., it corresponds to the solution of the Langevin SDE driven by the posterior mean of . Finally, the shaded area represents the marginal 90% credible interval of each dimension (i.e., at each time step) of the posterior distribution of .

We observe excellent performance of SVN-H. After 50 iterations, the algorithm has already converged, accurately reconstructing the posterior mean (which in turn captures the trends of the true path) and the posterior credible intervals. (See Figure 3 and below for a validation of these results against a reference MCMC simulation.) SVN-I manages to provide a reasonable reconstruction of the target distribution: the posterior mean shows fair agreement with the true solution, but the credible intervals are slightly overestimated, compared to SVN-H and the reference MCMC. The overestimated credible interval may be due to the poor dimension scaling of the isotropic kernel used by SVN-I. With the same amount of computational effort, SVGD-H and SVGD-I cannot reconstruct the posterior distribution: both the posterior mean and the posterior credible intervals depart significantly from their true values.

In Figure 3, we compare the posterior distribution approximated with SVN-H (using particles and 100 iterations) to that obtained with a reference MCMC run (using the DILI algorithm of cui2016dimension with an effective sample size of ), showing an overall good agreement. The thick blue and green paths correspond to the posterior means estimated by SVN-H and MCMC, respectively. The blue and green shaded areas represent the marginal 90% credible intervals (at each time step) produced by SVN-H and MCMC. In this example, the posterior mean of SVN-H matches that of MCMC quite closely, and both are comparable to the data-generating path (thick magenta line). (The posterior means are much smoother than the true path, which is to be expected.) The estimated credible intervals of SVN-H and MCMC also match fairly well along the entire path of the SDE.

6 Discussion

In general, the use of Gaussian reproducing kernels may be problematic in high dimensions, due to the locality of the kernel francois2005locality . While we observe in Section 4 that using a properly rescaled Gaussian kernel can improve the performance of the SVN method in high dimensions, we also believe that a truly general purpose nonparametric algorithm using local kernels will inevitably face further challenges in high-dimensional settings. A sensible approach to coping with high dimensionality is also to design algorithms that can detect and exploit essential structure in the target distribution, whether it be decaying correlation, conditional independence, low rank, multiple scales, and so on. See spantini2017inference ; wang2017structured for recent efforts in this direction.

7 Acknowledgements

G. Detommaso is supported by the EPSRC Centre for Doctoral Training in Statistical Applied Mathematics at Bath (EP/L015684/1) and by a scholarship from the Alan Turing Institute. T. Cui, G. Detommaso, A. Spantini, and Y. Marzouk acknowledge support from the MATRIX Program on “Computational Inverse Problems” held at the MATRIX Institute, Australia, where this joint collaboration was initiated. A. Spantini and Y. Marzouk also acknowledge support from the AFOSR Computational Mathematics Program.

Appendix A Proof of Theorem 1

The following proposition is used to prove Theorem 1.

Proposition 1.

Define the directional derivative of as the first variation of at along a direction ,

 DJp[S](V):=limτ→01τ(Jp[S+τV]−Jp[S]).

The first variation takes the form

 (22)
Proof.

Given the identity map and a transport map in the form of , the pullback density of is defined as

 T∗π=π(T(x))|det∇xT(x)|=π(x+S(x)+τV(x))∣∣det(I+∇xS(x)+τ∇xT(x))∣∣.

The perturbed objective function takes the form

 Jp[S+τV] =DKL((I+S+τV)∗p ∥ π) =DKL(p ∥ (I+S+τV)∗π) =∫p(x)logp(x)dx−∫p(x)(logπ(x+S(x)+τV(x)) +log∣∣det(I+∇xS(x)+τ∇xV(x))∣∣)dx.

Thus we have

 Jp[S+τV] −Jp[S]=−∫p(x)(logπ(x+S(x)+τV(x))−logπ(x+S(x))(i))dx (23)

Performing a Taylor expansion of the terms (i) and (ii) in (23), we have

 (i) =τ(∇xlogπ(x+S(x)))⊤V(x)+O(τ2), (ii) =τtrace((I+∇xS(x))−1∇xV(x))+O(τ2),

where is the partial derivative of evaluated at . Plugging the above expression into (23) and the definition of the directional derivative, we obtain

 (24)

The Fréchet derivative of evaluated at , satisfies

 DJp[S](V)=⟨∇Jp[S],V⟩Hd,∀V∈Hd,

and thus we can use Proposition 1 to prove Theorem 1.

Proof of Theorem 1.

The second variation of at along directions takes the form

 D2Jp[0](V,W):=limτ→01τ(DJp[τW](V)−DJp[0](V)).

Following Proposition 3, we have

 D2Jp[0](V,W)= limτ→01τ(DJp[τW](V)−DJp[0](V)) = −Ex∼p[limτ→01τ(∇xlogπ(x+τW(x))−∇xlogπ(x)(i))⊤V(x)] −Ex∼p[trace(limτ→01τ[(I+τ∇xW(x))−1−I](ii)∇xV(x))]. (25)

By Taylor expansion, the limits (i) and (ii) of the above equation can be written as

 (i) =∇2xlogπ(x)W(x), (ii) =−∇xW(x).

Thus, the second variation of at along directions becomes

 D2Jp[0](V,W)=−Ex∼p[W(x)⊤∇2xlogπ(x)V(x)−trace(∇xW(x)∇xV(x))]. (26)

Using the reproducing property of , i.e.

 vi(x)=⟨k(x,⋅),vi(⋅)⟩H, wj(x)=⟨k(x,⋅),wj(⋅)⟩H ∇xvi(x)=⟨∇xk(x,⋅),vi(⋅)⟩Hd, ∇xwi(x)=⟨∇xk(x,⋅),wi(⋅)⟩Hd

we then have

 Ex∼p[W(x)⊤∇2xlogπ(x)V(x)]=d∑i=1d∑j=1⟨⟨Ex∼p[∂2ijlogπ(x)k(x,y)k(x,z)],wj(z)⟩H,vi(y)⟩H

and

 Ex∼p[trace(∇xW(x)∇xV(x))]=d∑i=1d∑j=1⟨⟨Ex∼p[∂ik(x,y)∂jk(x,z)],wj(z)⟩H,vi(y)⟩H.

Plugging the above identities into (26), the second variation can be expressed as

 D2Jp[0](V,W)=d∑i=1d∑j=1⟨⟨hij(y,z),wj(z)⟩H,vi(y)⟩H,

where

 hij(y,z):=Ex∼p[−∂2ijlogπ(x)k(x,y)k(x,z)+∂ik(x,y)∂jk(x,z)].

Hence the result. ∎

Appendix B Proof of Corollary 1

Proof.

Here we drop the subscript . The ensemble of particles defines a linear function space . In the Galerkin approach, we seek a solution such that the residual of the Newton direction

 (27)

is zero for all possible . This way, we can approximate each component of the function as

 wj(z)=n∑k=1αkjk(xk,z), (28)

for a collection of unknown coefficients . We define to be the test function where for all .

We first project the Newton direction (27) onto for all . Applying the reproducing property of the kernel, this leads to

 d∑j=1⟨hij(xs,z),wj(z)⟩Hd+∂iJpl[0](xs)=0,i=1,…,d,s=1,…,n. (29)

Plugging (28) into (29), we obtain the fully discrete set of equations

 d∑j=1n∑ℓ=1hij(xs,xk)αkj+∂iJpl[0](xs)=0,i=1,…,d,s=1,…,n,k=1,…,n. (30)

We denote the coefficient vector for each , the block Hessian matrix for each pair of and , and for each . Then equation (30) can be expressed as

 n∑k=1Hs,kαk=∇Js,s=1,…,n. (31)

c.1 Comparison between the full and inexact Newton methods

Here we compare three different Stein variational Newton methods: SVNfull denotes the method that solves the fully coupled Newton system in equation (16) of the main paper, with no approximations; SVNCG denotes the method that applies inexact Newton–CG to the fully coupled system (16); and SVNbd employs the block-diagonal approximation given in equation (17) of the main paper.

We first make comparisons using the two-dimensional double banana distribution presented in Section 5.1. We run our test case for particles and 20 iterations. Figure 4 shows the contours of the target density and the samples produced by each of the three algorithms. Compared to the full Newton method, both the block-diagonal approximation and the inexact Newton–CG generate results of similar quality.

We use an additional nonlinear regression test case for further comparisons. In this case, the forward operator is given by

 F(x)=c1x31+c2x2,

where and

are some fixed coefficients sampled independently from a standard normal distribution. A data point is then given by

, where and . We use a standard normal prior distribution on .

We run our test case for particles and 20 iterations. Figure 5 shows contours of the posterior density and the samples produced by each of the three algorithms. Again, both the block-diagonal approximation and the inexact Newton–CG generate results of similar quality to those of the full Newton method.

These numerical results suggest that the block-diagonal approximation and the inexact Newton–CG can be effective methods for iteratively constructing the transport maps in SVN. We will adopt these approximate SVN strategies on large-scale problems, where computing the full Newton direction is not feasible.

c.2 Bayesian neural network

In this test case, we set up a Bayesian neural network as described in liu2016svgd . We use the open-source “yacht hydrodynamics” data set and denote the data by , where is an input, is the corresponding scalar prediction, and . We divide the data into a training set of input–prediction pairs and a validation set of additional pairs. For each input, we model the corresponding prediction as

 yi=f(xi,w)+εi,

where denotes the neural network with weight vector and is an additive Gaussian error. The dimension of the weight vector is . We endow the weights with independent Gaussian priors, . The inference problem then follows from the likelihood function,

 L(D|w,γ)=(γ2π)m2exp(−γ2m∑i=1(f(x,w)−yi)2),

and the prior density,

 π0(w|λ)=(λ2π)m2exp(−γ2m∑i=1w2j),

where and

play the role of hyperparameters.

Performance comparison of SVN-H with SVGD-I.

We compare SVN-H with the original SVGD-I algorithm on this Bayesian neural network example, with hyperparameters fixed to (which provides a very uninformative prior distribution) and . First, we run a line-search with Newton–CG to find the posterior mode . Figure 6 shows that neural network predictions at the posterior mode almost perfectly match the validation data.

Then, we randomly initialise particles around the mode, i.e., by independently drawing . As in the previous test cases, we make a fair comparison of SVN-H and SVGD-I by taking 10, 20, and 30 iterations of SVN-H and rescaling the number of iterations of SVGD-I to match the computational costs of the two algorithms. Because this test case is very high-dimensional, rather than storing the entire Hessian matrix and solving the Newton system we use the inexact Newton–CG approach within SVN, which requires only matrix-vector products and yields enormous memory savings. Implementation details can be found in our GitHub repository.

Figure 7 shows distributions of the error on the validation set, as resulting from posterior predictions. To obtain these errors, we use the particle representation of the posterior on the weights to evaluate posterior predictions on the validation inputs . Then we evaluate the error of each of these predictions. The red line represents the mean of these errors at each validation input

, and the shaded region represents the 90% credible interval of these error distribution. Although both algorithms “work” in the sense of producing errors of small range overall, SVN-H yields distributions of prediction error with smaller means and considerably reduced variances, compared to SVGD-I.

c.3 Scalability of kernels in high dimensions

Discretization-invariant posterior distribution.

Here we illustrate the dimension scalability of the scaled Hessian kernel, compared to the isotropic kernel used in liu2016svgd . We consider a linear Bayesian inverse problem in a function space setting stuart2010 : the forward operator is a linear functional , where the function is defined for . The scalar observation , where is Gaussian with zero mean and standard deviation . The prior is a Gaussian measure where is the Laplace operator , with zero essential boundary conditions.

Discretising this problem with finite differences on a uniform grid with degrees of freedom, we obtain a Gaussian prior density with zero mean and covariance matrix , where is the finite difference approximation of the Laplacian. Let the vector denote the discretised function , and let the corresponding discretised parameter be denoted by (overloading notation for convenience). Then the finite-dimensional forward operator can be written as . After discretization, the posterior has a Gaussian density of the form , where

 mpos=yσ2Cposa,Cpos=(K−1+1σ2aa⊤)−1.

To benchmark the performance of various kernels, we construct certain summaries of the posterior distribution. In particular, we use our SVN methods with the scaled Hessian kernel (SVN-H) and the isotropic kernel (SVN-I) to estimate the component-wise average of the posterior mean, , and the trace of the posterior covariance, , for problems discretised at different resolutions . We run each experiment with particles and iterations of SVN. We compare the numerical estimates of these quantities to the analytically known results. These comparisons are summarised in Tables 1 and 2.

From Table 1, we can observe that all algorithms almost perfectly recover the average of the posterior mean up to the first three significant figures. However, Table 2 shows that SVN-H does a good job in estimating the trace of the posterior covariance consistently for all dimensions, whereas SVN-I under-estimates the trace—suggesting that particles are under-dispersed and not correctly capturing the uncertainty in the parameter . This example suggests that the scaled Hessian kernel can lead to a more accurate posterior reconstruction for high-dimensional distributions than the isotropic kernel.

A posterior distribution that is not discretization invariant.

Now we examine the dimension-scalability of various kernels in a problem that does not have a well-defined limit with increasing parameter dimension. We modify the linear Bayesian inverse problem introduced above: now the prior covariance is the identity matrix, i.e., and the vector

used to define the forward operator is drawn from a uniform distribution,

. This way, the posterior is not discretization invariant. We perform the same set of numerical experiments as above and summarise the results in Tables 3 and 4. Although the target distribution used in this case is not discretization invariant, the scaled Hessian kernel is still reasonably effective in reconstructing the target distributions of increasing dimension (according to the summary statistics below), whereas the isotropic kernel under-estimates the target variances for all values of dimension that we have tested.