1 Introduction
Posterior inference via Monte Carlo sampling is an important topic in Bayesian methods. Recent years have seen the development of scalable sampling methods to facilitate bigdata analysis. Stochastic gradient Markov chain Monte Carlo (SGMCMC) is a family of scalable Bayesian sampling algorithms for efficient posterior sampling
[WT11, CFG14, DFB14, CDC15]. The underlying mathematical principle of SGMCMC relies on the theory of Itó diffusion, a linear stochastic differential equation (SDE) with appropriately designed coefficients such that the corresponding stationary distribution matches a target distribution. By numerically solving the SDE, SGMCMC generates samples that well approximate a target distribution in theory [TTV16, VZT16, CDC15]. However, due to the Markovian nature, samples are typically highly correlated, leading to low sampleefficiency, an undesirable property of SGMCMC.On the other hand, the recently emerging particlebased sampling methods maintain a set of interacting particles, which are optimized iteratively to minimize some distance between the target distribution and an empirical distribution formed by the particles. In this way, one maintains an optimal set of particles at each time, mitigating the correlatedsample issue in SGMCMC. An outstanding representative work of this direction is the Stein variational gradient descent (SVGD) [LW16a]. Recent development of SVGD has shown that the underlying mathematical principle is based on a family of nonlinear SDEs, in the sense that coefficients of the SDE depend on the current density of the particles. Though achieving numerous practical successes [LW16a, FWL17, LRLP17, HTAL17, LZ18]
, little theory has been developed to understand the convergence property of the algorithm. A recent theoretical development has interpreted SVGD as a special type of gradient flow in the space of probability measures, and developed theory to disclose its asymptotic convergence behavior
[Liu17].To unify SGMCMC and SVGD, a recent work [CZW18]
proposed a particleoptimization sampling framework by interpreting them as Wasserstein gradient flows (WGFs). Generally speaking, a WGF is a partial differential equation (PDE) defined on the space of probability measures, describing the evolution of a density function over time. In
[CZW18], the authors define a WGF by combining the corresponding FokkerPlanck equations for both SGMCMC and SVGD, and solve it with deterministic particle approximations. However, due to the diffusion nature, deterministicparticle approximation for SGMCMC leads to a hardtocontrol error, challenging for theoretical analysis.Based on the unified framework in [CZW18], we propose to solve WGFs with stochastic particle approximation, leading to stochastic particleoptimization sampling algorithms (SPOS). The idea is instead of solving the WGF with an uncontrollable deterministic approximation to a diffusion term, we solve it stochastically by injecting random noise in the updates. Remarkably, for the first time, we develop nonasymptotic convergence theory for the family of SPOS algorithms, considering both convex and nonconvexenergy functions. Different from existing theory for SGMCMC based algorithms [TTV16, VZT16, CDC15, RRT17, ZLC17, XCZG18], our development relies on the theory of nonlinear SDEs, which is more involved and less explored in the SDE literature. Particularly, our development has borrowed some ideas from granular media equations [Mal03, CGM08]. Within our theoretical framework, we provide a formal theoretical understand of a pitfall of SVGD, e.g., particles tend to collapse to one point understand some particular conditions; while this can be avoided in the proposed SPOS framework due to the injected random noise. Please refer to Section M in the Supplementary Material (SM) for detailed distinctions of our work to existing work. Our experimental results well suggest advantages of our framework compared to existing methods.
All proofs and experiments are presented in the Supplementary Material (SM).
2 Preliminaries
This section introduces necessary preliminaries, along with notations used in this paper. For the sake of clarity, through out the paper, we use bold letters to denote variables in continuoustime diffusions and model definitions, e.g., in (1) defined below (indexed by “time” ). By contrast, normal unbold letters are used to denote parameters in algorithms (discrete solutions of continuoustime diffusions), e.g., in (3) below (indexed by “iteration” ). For conciseness, all the proofs as well as some extra experimental results are presented in the SM. Discussion on the complexity of our algorithm is also included in Section L of the SM.
2.1 Stochastic gradient MCMC
In Bayesian sampling, we aim to generate random samples from a posterior distribution , where represents the model parameter with a prior distribution , and represents the observed data with likelihood . Define the potential energy as: . SGMCMC algorithms belong to diffusionbased sampling methods, where a continuoustime diffusion process is designed such that its stationary distribution matches the target posterior distribution. The diffusion process is driven by a specific stochastic differential equation (SDE). For example, in stochastic gradient Langevin dynamic (SGLD), the SDE endows the following form:
(1) 
where ; is the time index; is the temperature parameter; and is a dimensional Brownian motion. More instances of SDEs corresponding to other SGMCMC algorithms can be defined by specifying different forms of and potentially other diffusion coefficients. We focus on SGLD and (1) in this paper, and refer interested readers to [MCF15]
for more detailed description of general SGMCMC algorithms. Denote the probability density function of
in (1) as , andfor two vectors
and . It is known that is characterized by the following FokkerPlanck (FP) equation [Ris89]:(2) 
where the stationary distribution equals to our target distribution according to [CH87]. SGLD generates samples from by numerically solving the SDE (1). For scalability, it replaces in each iteration with an unbiased evaluation by randomly sampling a subset of , i.e. is approximated by: , where is a random subset of with size in each iteration. Based on the above settings, SGLD uses the Euler method with stepsize to numerically solve (1) and obtains the update equation: , .
2.2 Stein variational gradient descent
Different from SGMCMC, SVGD is a deterministic particleoptimization algorithm that is able to generate samples from a target distribution. In the algorithm, a set of particles interact with each other, driving them to high density locations in the parameter space while keeping them far away from each other with repulsive force. The update equations of the particles follow the fastest descent direction of the KLdivergence between current empirical distribution of the particles and the target distribution, on an RKHS induced by a kernel function [LW16a]. Formally, [LW16a] derived the following updating rules for the particles at the th iteration with stepsize and :
(3) 
where the first term in the bracket encourages particles to locate on high density modes, and the second term serves as repulsive force that pushes away different particles. Different from SGMCMC, only particles at the current iteration, , are used to approximate the target distribution.
2.3 Particleoptimization based sampling methods
SGMCMC and SVGD, though look closely related, behave very differently in terms of algorithms, e.g., stochastic and noninteractive versus deterministic and interactive particle updates. Recently, [CZW18] proposed a deterministic particleoptimization framework that unifies both SGMCMC and SVGD. Specifically, the authors viewed both SGMCMC and SVGD as Wasserstein gradient flows (WGFs) on the space of probabilistic measures, and derived several deterministic particleoptimization techniques for particle evolutions, like what SVGD does. For SGMCMC, the FP equation (2) for SGLD is a special type of WGFs. Together with an interpretation of SVGD as a special case of the Vlasov equation in nonlinear PDE literature, [CZW18] proposed a general form of PDE to characterize the evolution of the density for the model parameter , denoted as at time with matching our target (posterior) distribution, i.e.,
(4) 
where is a function controlling the interaction of particles in the PDE system. For example, in SVGD, [CZW18] showed that and endow the following forms:
(5) 
where is a kernel function such as the RBF kernel. In the following, we introduce a new unary function , thus can be rewritten as . Hence, (4) with defined in (5) can be equivalently rewritten as:
(6) 
where is a random sample from but independent of . Importantly,
Proposition 1 ([Czw18])
The stationary distribution of (6) equals to our target distribution, which means .
[CZW18] proposed to solve (4) numerically with deterministic particleoptimization algorithms such as the blob method. Specifically, the continuous density is approximated by a set of particles that evolve over time , i.e. , where if and 0 otherwise. Note in (4) is no longer a valid definition when adopting particle approximation for . Consequently, needs nontrivial approximations, e.g., by discrete gradient flows or blob methods proposed in [CZW18]. We omit the details here for simplicity.
3 Stochastic ParticleOptimization Sampling Algorithms
The deterministic particleapproximation methods proposed by [CZW18] to approximately solve the WGF problem (4) introduce approximation errors for that are hard to control analytically. To overcome this problem, we propose to solve (4) stochastically to replace the term with a Brownian motion. Specifically, first note that the term is contributed from Brownian motion, i.e., solving the SDE, , is equivalent to solving the corresponding FP equation: . Consequently, we decompose RHS of (4) into two parts: and . Our idea is to solve deterministically under a PDE setting, and solve stochastically based on its corresponding SDE. When adopting particle approximation for the density , both solutions of and are represented in terms of particles . Thus we can combine the solutions from the two parts directly to approximate the original exact solution of (4). Similar to the results of SVGD in Section 3.3 in [Liu17], we first formally show in Theorem 2 that when approximating with particles, i.e., , the PDE can be transformed into a system of deterministic differential equations with interacting particles.
Theorem 2
When approximating in (4) with particles , the PDE reduces to the following system of differential equations describing evolutions of the particles over time:
(7) 
On the other hand, by solving stochastically with its equivalent SDE counterpart, we arrive at the following differential equation system, describing evolutions of the particles over time :
(8) 
Our intuition is that if the particle evolution (8) can be solved exactly, the solution of (6) will be wellapproximated by the particles . In our theory, we show this intuition is actually true. In practice, however, solving (8) is typically infeasible, and thus numerical methods are adopted. Furthermore, in the case of big data, following SGMCMC, is typically replaced by a stochastic version evaluated with a minibatch of data of size for computational feasibility. Based on the Euler method [CDC15] with a stepsize , (8) leads to the following updates for the particles at the th iteration
(9) 
We called the algorithm with particle update equations (3) stochastic particleoptimization sampling (described in Algorithm 1), in the sense that particles are optimized stochastically with extra random Gaussian noise. Intuitively, the added Gaussian noise would enhance the ability of the algorithm to jump out of local modes, leading to better ergodic properties compared to standard SVGD. This serves as one of our motivations to generalize SVGD to SPOS. To illustrate the advantage of introducing the noise term, we compare SPOS and SVGD on sampling a difficult multimode distribution, with the density function given in Section A of the SM. The particles are initialized on a local mode close to zero. Figure 1 plots the final locations of the particles along with the true density, which shows that particles in SPOS are able to reach different modes, while they are all trapped at one mode in SVGD. This pitfall of SVGD will be studied formally in Section 4.4.
4 NonAsymptotic Convergence Analysis: the Convex Case
In this section, we develop nonasymptotic convergence theory for the proposed SPOS when the energy function is convex. The nonconvex case is discussed in Section 5. We prove nonasymptotic convergence rates for SPOS algorithm under the 1Wasserstein metric , a special case of pWasserstein metric defined as , where
is the set of joint distributions on
with marginal distribution and . Note that SPOS reduces to SVGD when , thus our theory also sheds light on the convergence behavior of SVGD, where nonasymptotic theory is currently missing, despite the asymptotic theory developed recently [Liu17, LLN18]. It is worth noting that part of our proofs are generalization of techniques for analyzing granular media equations in [Mal03, CGM08].4.1 Basic setup and assumptions
Due to the exchangeability of the particle system in (8), if we initialize all the particles with the same distribution , they would endow the same distribution for each time . We denote the distribution of each as . Similar arguments hold for the particle system in (3), and thus we denote the distribution of each as . To this end, our analysis aims at bounding since equals to our target distribution according to Proposition 1. Before proceeding to our theoretical results, we first present the following basic assumptions.
Assumption 1
Assume and satisfy the following conditions:

There exist positive and , such that and .

is bounded by and Lipschitz continuous i.e., and ; .

is Lipschitz continuous; is bounded by and Lipschitz continuous

and is an even function, i.e., .
Note the first bullet indicates to be a convex function and to be a concave function. For an RBF kernel, the later could be achieved by setting the bandwidth large enough and only considering the concave region for simplicity. This assumption is used for revealing some undesired property of SVGD developed below. We do not need such an assumption when analyzing under a nonconvex energy function in Section 5. Then "" in the second bullet is reasonable, as in our setting corresponds to an unnormalized logposterior, which can be shifted such that for a specific problem. Since we often care about bounded space in practice, we can realize the third bullet due to the continuity of and .
The highlevel idea of bounding in this section is to decompose it as follows:
(10) 
4.2 Bounds with stochastic particle approximation
We firstly bound and with the following theorems.
Theorem 3
Under Assumption 1 and let , there exist some positive constants and independent of and satisfying such that
(11) 

According to Theorem 3, we can bound the term as . Furthermore, by letting , we have , which is an important intermediate result to prove the following theorem.
Theorem 4
To ensure to decrease over time, one needs to choose small enough such that . This also sheds light on a failure case of SVGD (where ) discussed in Section 4.4.
4.3 Bounds with a numerical solution
To bound the term, we adopt techniques from [RRT17, XCZG18] on analyzing the behaviors of SGLD, and derive the following results for our SPOS algorithm:
Theorem 5
Under Assumptions 1, for a fixed step size that is small enough, the corresponding is bounded as
(13) 
where is the fixed size of the minibatch in each iteration and are some positive constants independent of .
The dependence of in the bound above makes the bound relatively loose. Fortunately, we can improve the bound to make it independent of by considering a decreasingstepsize SPOS algorithm, stated in Theorem 6.
Theorem 6
Under Assumptions 1, for a decreasing step size , and let the minibatch size in each iteration be , the corresponding term is bounded, for some small enough, as
(14) 
where is the initial minibatch size, and are some positive constants independent of .
Note increases at a very low speed, e.g., only by 15 after iterations, thus it would not affect algorithm efficiency. According to Theorem 6, would approach zero when . By directly combining results from Theorem 3–6, one can easily bound the target . Detailed statements are given in Theorem 15–16 in Section H of the SM.
4.4 A Pitfall of SVGD
Based on the above analysis, we now formally show a pitfall of SVGD, i.e., particles in SVGD tend to collapse to a local mode under some particular conditions. Inspired by the work on analyzing the granular media equations by [Mal03, CGM08], we measure this by calculating the expected distance between particles, called expected particle distance (EPD). Firstly, we bound the EPD for the proposed SPOS algorithm in Theorem 7.
Theorem 7
Under Assumption 1, further assuming every with the same initial probability law and . Choose a such that . Then the EPD of SPOS is bounded as: , where .

There are two interesting cases: When , the EPD would decrease to the bound along time . This represents the phenomenon of an attraction force between particles; When , the EPD would increase to the same bound, which represents the phenomenon of a repulsive force between particles, e.g., when particles are initialized with the same value (), they would be pushed away from each other until the EPD increases to the aforementioned bound.
Intuitively, the EPD for SVGD can be obtained by taking the limit. Corollary 8 formally characterizes the particledegeneracy phenomenon of SVGD, which has been empirically studied in [ZLS18].
Corollary 8
Under the same conditions of Theorem 7, the EPD in SVGD is bounded as: , where and .

We would like to emphasize two points: 1) In the case of , Corollary 8 indicates that particles in SVGD would collapse to a point when . In practice, we usually find that particles are trapped in a local mode instead of collapsing. This is due to two reasons: numerical errors inject noise into the particles; some particles are out of the concave region of stated in Assumption 1 in SVGD, which is required for the theory to hold. All these make the empirical EPD not exactly the same as the true particle distance. 2) Corollary 8
also applies when the energy function is nonconvex. Our proof in the SM considers the nonconvex case as well. Consequently, this serves as a strong theoretical motivation to apply SPOS instead of SVGD in deep learning.
5 NonAsymptotic Convergence Analysis: the Nonconvex Case
Since the nonconvex case is much more complicated than the convex case, we reply on different assumptions and adopt another distance metric, denoted as , to characterize the convergence behavior of SPOS under the nonconvex case. Specifically, is defined as for a known continuous function satisfying Assumption 2 below. Note such metric has also been adopted in [VZT16, CDC15]. Our analysis considers () as variables in . In addition, we use to denote the particles when full gradients are adopted in (3). The distribution of the particles is denoted as .
Our highlevel idea of bounding is to decompose it as follows:
(15) 
Our second idea is to concatenate the particles at each time into a single vector representation, i.e. defining the new parameter at time as . Consequently, the nonlinear SDE system (8) can be turned into a linear SDE ,which means is driven by the following linear SDE:
(16) 
where is a vector function , and is Brownian motion of dimension . Similarly, we can define for the fullgradient case. Hence, it can be seen that through such a decomposition in (15), the bound related to a nonlinear SDE system (8) reduces to that of a linear SDE. The second term reflexes the geometric ergodicity of a linear dynamic system with a numerical method. It is known that even if a dynamic system has an exponential convergence rate to its equilibrium, its corresponding numerical method might not. Our bound for is essentially a specification of the result of [MSH02], which has also been applied by [XCZG18]. The third term reflects the numerical error of a linear SDE, which has been studied in related literature such as [CDC15]. To this end, we adopt standard assumptions used in the analysis of SDEs [VZT16, CDC15], rephrased in Assumption 2.
Assumption 2 (Assumption in [Vzt16, Cdc15])
For the linear SDE (16) and a Lipschitz function , let be the solution functional of the Poisson equation: , where denotes the infinite generator of the linear SDE (16). Assume and its up to 4thorder derivatives, , are bounded by a function , i.e., for , . Furthermore, the expectation of on is bounded: , and is smooth such that , for .
Assumption 3
, and are , and Lipschitz; satisfies the dissipative property, i.e., for some ; Remark 4.2 applies to the nonconvex setting, i.e. .

Assumption 2 is necessary to control the gap between a numerical solution and the exact solution of an SDE. Specifically, it is used to bound the term and the term above. Purely relying on the dissipative assumption in Assumption 3 as in nonconvex optimization with SGMCMC [RRT17, XCZG18] would induce a bound increasing linearly w.r.t. time . Thus it is not suitable for our goal. Finally, in Assumption 3 is a mild condition and reasonable because we expect particles to be able to approximate all distributions equally well in the asymptotic limit of by ergodicity due to the injected noise. How to remove/replace this assumption is an interesting future work.
Based on the assumptions above, the bounds for and are summarized below.
Theorem 9

It is seen that in order to make the
Comments
There are no comments yet.