Posterior inference via Monte Carlo sampling is an important topic in Bayesian methods. Recent years have seen the development of scalable sampling methods to facilitate big-data analysis. Stochastic gradient Markov chain Monte Carlo (SG-MCMC) is a family of scalable Bayesian sampling algorithms for efficient posterior sampling[WT11, CFG14, DFB14, CDC15]. The underlying mathematical principle of SG-MCMC relies on the theory of Itó diffusion, a linear stochastic differential equation (SDE) with appropriately designed coefficients such that the corresponding stationary distribution matches a target distribution. By numerically solving the SDE, SG-MCMC generates samples that well approximate a target distribution in theory [TTV16, VZT16, CDC15]. However, due to the Markovian nature, samples are typically highly correlated, leading to low sample-efficiency, an undesirable property of SG-MCMC.
On the other hand, the recently emerging particle-based sampling methods maintain a set of interacting particles, which are optimized iteratively to minimize some distance between the target distribution and an empirical distribution formed by the particles. In this way, one maintains an optimal set of particles at each time, mitigating the correlated-sample issue in SG-MCMC. An outstanding representative work of this direction is the Stein variational gradient descent (SVGD) [LW16a]. Recent development of SVGD has shown that the underlying mathematical principle is based on a family of nonlinear SDEs, in the sense that coefficients of the SDE depend on the current density of the particles. Though achieving numerous practical successes [LW16a, FWL17, LRLP17, HTAL17, LZ18]
, little theory has been developed to understand the convergence property of the algorithm. A recent theoretical development has interpreted SVGD as a special type of gradient flow in the space of probability measures, and developed theory to disclose its asymptotic convergence behavior[Liu17].
To unify SG-MCMC and SVGD, a recent work [CZW18]
proposed a particle-optimization sampling framework by interpreting them as Wasserstein gradient flows (WGFs). Generally speaking, a WGF is a partial differential equation (PDE) defined on the space of probability measures, describing the evolution of a density function over time. In[CZW18], the authors define a WGF by combining the corresponding Fokker-Planck equations for both SG-MCMC and SVGD, and solve it with deterministic particle approximations. However, due to the diffusion nature, deterministic-particle approximation for SG-MCMC leads to a hard-to-control error, challenging for theoretical analysis.
Based on the unified framework in [CZW18], we propose to solve WGFs with stochastic particle approximation, leading to stochastic particle-optimization sampling algorithms (SPOS). The idea is instead of solving the WGF with an uncontrollable deterministic approximation to a diffusion term, we solve it stochastically by injecting random noise in the updates. Remarkably, for the first time, we develop nonasymptotic convergence theory for the family of SPOS algorithms, considering both convex- and nonconvex-energy functions. Different from existing theory for SG-MCMC based algorithms [TTV16, VZT16, CDC15, RRT17, ZLC17, XCZG18], our development relies on the theory of nonlinear SDEs, which is more involved and less explored in the SDE literature. Particularly, our development has borrowed some ideas from granular media equations [Mal03, CGM08]. Within our theoretical framework, we provide a formal theoretical understand of a pitfall of SVGD, e.g., particles tend to collapse to one point understand some particular conditions; while this can be avoided in the proposed SPOS framework due to the injected random noise. Please refer to Section M in the Supplementary Material (SM) for detailed distinctions of our work to existing work. Our experimental results well suggest advantages of our framework compared to existing methods.
All proofs and experiments are presented in the Supplementary Material (SM).
This section introduces necessary preliminaries, along with notations used in this paper. For the sake of clarity, through out the paper, we use bold letters to denote variables in continuous-time diffusions and model definitions, e.g., in (1) defined below (indexed by “time” ). By contrast, normal unbold letters are used to denote parameters in algorithms (discrete solutions of continuous-time diffusions), e.g., in (3) below (indexed by “iteration” ). For conciseness, all the proofs as well as some extra experimental results are presented in the SM. Discussion on the complexity of our algorithm is also included in Section L of the SM.
2.1 Stochastic gradient MCMC
In Bayesian sampling, we aim to generate random samples from a posterior distribution , where represents the model parameter with a prior distribution , and represents the observed data with likelihood . Define the potential energy as: . SG-MCMC algorithms belong to diffusion-based sampling methods, where a continuous-time diffusion process is designed such that its stationary distribution matches the target posterior distribution. The diffusion process is driven by a specific stochastic differential equation (SDE). For example, in stochastic gradient Langevin dynamic (SGLD), the SDE endows the following form:
where ; is the time index; is the temperature parameter; and is a -dimensional Brownian motion. More instances of SDEs corresponding to other SG-MCMC algorithms can be defined by specifying different forms of and potentially other diffusion coefficients. We focus on SGLD and (1) in this paper, and refer interested readers to [MCF15]
for more detailed description of general SG-MCMC algorithms. Denote the probability density function ofin (1) as , and
for two vectorsand . It is known that is characterized by the following Fokker-Planck (FP) equation [Ris89]:
where the stationary distribution equals to our target distribution according to [CH87]. SGLD generates samples from by numerically solving the SDE (1). For scalability, it replaces in each iteration with an unbiased evaluation by randomly sampling a subset of , i.e. is approximated by: , where is a random subset of with size in each iteration. Based on the above settings, SGLD uses the Euler method with stepsize to numerically solve (1) and obtains the update equation: , .
2.2 Stein variational gradient descent
Different from SG-MCMC, SVGD is a deterministic particle-optimization algorithm that is able to generate samples from a target distribution. In the algorithm, a set of particles interact with each other, driving them to high density locations in the parameter space while keeping them far away from each other with repulsive force. The update equations of the particles follow the fastest descent direction of the KL-divergence between current empirical distribution of the particles and the target distribution, on an RKHS induced by a kernel function [LW16a]. Formally, [LW16a] derived the following updating rules for the particles at the -th iteration with stepsize and :
where the first term in the bracket encourages particles to locate on high density modes, and the second term serves as repulsive force that pushes away different particles. Different from SG-MCMC, only particles at the current iteration, , are used to approximate the target distribution.
2.3 Particle-optimization based sampling methods
SG-MCMC and SVGD, though look closely related, behave very differently in terms of algorithms, e.g., stochastic and noninteractive versus deterministic and interactive particle updates. Recently, [CZW18] proposed a deterministic particle-optimization framework that unifies both SG-MCMC and SVGD. Specifically, the authors viewed both SG-MCMC and SVGD as Wasserstein gradient flows (WGFs) on the space of probabilistic measures, and derived several deterministic particle-optimization techniques for particle evolutions, like what SVGD does. For SG-MCMC, the FP equation (2) for SGLD is a special type of WGFs. Together with an interpretation of SVGD as a special case of the Vlasov equation in nonlinear PDE literature, [CZW18] proposed a general form of PDE to characterize the evolution of the density for the model parameter , denoted as at time with matching our target (posterior) distribution, i.e.,
where is a function controlling the interaction of particles in the PDE system. For example, in SVGD, [CZW18] showed that and endow the following forms:
where is a random sample from but independent of . Importantly,
Proposition 1 ([Czw18])
The stationary distribution of (6) equals to our target distribution, which means .
[CZW18] proposed to solve (4) numerically with deterministic particle-optimization algorithms such as the blob method. Specifically, the continuous density is approximated by a set of particles that evolve over time , i.e. , where if and 0 otherwise. Note in (4) is no longer a valid definition when adopting particle approximation for . Consequently, needs nontrivial approximations, e.g., by discrete gradient flows or blob methods proposed in [CZW18]. We omit the details here for simplicity.
3 Stochastic Particle-Optimization Sampling Algorithms
The deterministic particle-approximation methods proposed by [CZW18] to approximately solve the WGF problem (4) introduce approximation errors for that are hard to control analytically. To overcome this problem, we propose to solve (4) stochastically to replace the term with a Brownian motion. Specifically, first note that the term is contributed from Brownian motion, i.e., solving the SDE, , is equivalent to solving the corresponding FP equation: . Consequently, we decompose RHS of (4) into two parts: and . Our idea is to solve deterministically under a PDE setting, and solve stochastically based on its corresponding SDE. When adopting particle approximation for the density , both solutions of and are represented in terms of particles . Thus we can combine the solutions from the two parts directly to approximate the original exact solution of (4). Similar to the results of SVGD in Section 3.3 in [Liu17], we first formally show in Theorem 2 that when approximating with particles, i.e., , the PDE can be transformed into a system of deterministic differential equations with interacting particles.
When approximating in (4) with particles , the PDE reduces to the following system of differential equations describing evolutions of the particles over time:
On the other hand, by solving stochastically with its equivalent SDE counterpart, we arrive at the following differential equation system, describing evolutions of the particles over time :
Our intuition is that if the particle evolution (8) can be solved exactly, the solution of (6) will be well-approximated by the particles . In our theory, we show this intuition is actually true. In practice, however, solving (8) is typically infeasible, and thus numerical methods are adopted. Furthermore, in the case of big data, following SG-MCMC, is typically replaced by a stochastic version evaluated with a minibatch of data of size for computational feasibility. Based on the Euler method [CDC15] with a stepsize , (8) leads to the following updates for the particles at the -th iteration
We called the algorithm with particle update equations (3) stochastic particle-optimization sampling (described in Algorithm 1), in the sense that particles are optimized stochastically with extra random Gaussian noise. Intuitively, the added Gaussian noise would enhance the ability of the algorithm to jump out of local modes, leading to better ergodic properties compared to standard SVGD. This serves as one of our motivations to generalize SVGD to SPOS. To illustrate the advantage of introducing the noise term, we compare SPOS and SVGD on sampling a difficult multi-mode distribution, with the density function given in Section A of the SM. The particles are initialized on a local mode close to zero. Figure 1 plots the final locations of the particles along with the true density, which shows that particles in SPOS are able to reach different modes, while they are all trapped at one mode in SVGD. This pitfall of SVGD will be studied formally in Section 4.4.
4 Non-Asymptotic Convergence Analysis: the Convex Case
In this section, we develop non-asymptotic convergence theory for the proposed SPOS when the energy function is convex. The nonconvex case is discussed in Section 5. We prove non-asymptotic convergence rates for SPOS algorithm under the 1-Wasserstein metric , a special case of p-Wasserstein metric defined as , where
is the set of joint distributions onwith marginal distribution and . Note that SPOS reduces to SVGD when , thus our theory also sheds light on the convergence behavior of SVGD, where non-asymptotic theory is currently missing, despite the asymptotic theory developed recently [Liu17, LLN18]. It is worth noting that part of our proofs are generalization of techniques for analyzing granular media equations in [Mal03, CGM08].
4.1 Basic setup and assumptions
Due to the exchangeability of the particle system in (8), if we initialize all the particles with the same distribution , they would endow the same distribution for each time . We denote the distribution of each as . Similar arguments hold for the particle system in (3), and thus we denote the distribution of each as . To this end, our analysis aims at bounding since equals to our target distribution according to Proposition 1. Before proceeding to our theoretical results, we first present the following basic assumptions.
Assume and satisfy the following conditions:
There exist positive and , such that and .
is bounded by and -Lipschitz continuous i.e., and ; .
is -Lipschitz continuous; is bounded by and -Lipschitz continuous
and is an even function, i.e., .
Note the first bullet indicates to be a convex function and to be a concave function. For an RBF kernel, the later could be achieved by setting the bandwidth large enough and only considering the concave region for simplicity. This assumption is used for revealing some undesired property of SVGD developed below. We do not need such an assumption when analyzing under a nonconvex energy function in Section 5. Then "" in the second bullet is reasonable, as in our setting corresponds to an unnormalized log-posterior, which can be shifted such that for a specific problem. Since we often care about bounded space in practice, we can realize the third bullet due to the continuity of and .
The high-level idea of bounding in this section is to decompose it as follows:
4.2 Bounds with stochastic particle approximation
We firstly bound and with the following theorems.
Under Assumption 1 and let , there exist some positive constants and independent of and satisfying such that
According to Theorem 3, we can bound the term as . Furthermore, by letting , we have , which is an important intermediate result to prove the following theorem.
To ensure to decrease over time, one needs to choose small enough such that . This also sheds light on a failure case of SVGD (where ) discussed in Section 4.4.
4.3 Bounds with a numerical solution
Under Assumptions 1, for a fixed step size that is small enough, the corresponding is bounded as
where is the fixed size of the minibatch in each iteration and are some positive constants independent of .
The dependence of in the bound above makes the bound relatively loose. Fortunately, we can improve the bound to make it independent of by considering a decreasing-stepsize SPOS algorithm, stated in Theorem 6.
Under Assumptions 1, for a decreasing step size , and let the minibatch size in each iteration be , the corresponding term is bounded, for some small enough, as
where is the initial minibatch size, and are some positive constants independent of .
Note increases at a very low speed, e.g., only by 15 after iterations, thus it would not affect algorithm efficiency. According to Theorem 6, would approach zero when . By directly combining results from Theorem 3–6, one can easily bound the target . Detailed statements are given in Theorem 15–16 in Section H of the SM.
4.4 A Pitfall of SVGD
Based on the above analysis, we now formally show a pitfall of SVGD, i.e., particles in SVGD tend to collapse to a local mode under some particular conditions. Inspired by the work on analyzing the granular media equations by [Mal03, CGM08], we measure this by calculating the expected distance between particles, called expected particle distance (EPD). Firstly, we bound the EPD for the proposed SPOS algorithm in Theorem 7.
Under Assumption 1, further assuming every with the same initial probability law and . Choose a such that . Then the EPD of SPOS is bounded as: , where .
There are two interesting cases: When , the EPD would decrease to the bound along time . This represents the phenomenon of an attraction force between particles; When , the EPD would increase to the same bound, which represents the phenomenon of a repulsive force between particles, e.g., when particles are initialized with the same value (), they would be pushed away from each other until the EPD increases to the aforementioned bound.
Under the same conditions of Theorem 7, the EPD in SVGD is bounded as: , where and .
We would like to emphasize two points: 1) In the case of , Corollary 8 indicates that particles in SVGD would collapse to a point when . In practice, we usually find that particles are trapped in a local mode instead of collapsing. This is due to two reasons: numerical errors inject noise into the particles; some particles are out of the concave region of stated in Assumption 1 in SVGD, which is required for the theory to hold. All these make the empirical EPD not exactly the same as the true particle distance. 2) Corollary 8
also applies when the energy function is nonconvex. Our proof in the SM considers the nonconvex case as well. Consequently, this serves as a strong theoretical motivation to apply SPOS instead of SVGD in deep learning.
5 Non-Asymptotic Convergence Analysis: the Nonconvex Case
Since the non-convex case is much more complicated than the convex case, we reply on different assumptions and adopt another distance metric, denoted as , to characterize the convergence behavior of SPOS under the non-convex case. Specifically, is defined as for a known -continuous function satisfying Assumption 2 below. Note such metric has also been adopted in [VZT16, CDC15]. Our analysis considers () as variables in . In addition, we use to denote the particles when full gradients are adopted in (3). The distribution of the particles is denoted as .
Our high-level idea of bounding is to decompose it as follows:
Our second idea is to concatenate the particles at each time into a single vector representation, i.e. defining the new parameter at time as . Consequently, the nonlinear SDE system (8) can be turned into a linear SDE ,which means is driven by the following linear SDE:
where is a vector function , and is Brownian motion of dimension . Similarly, we can define for the full-gradient case. Hence, it can be seen that through such a decomposition in (15), the bound related to a nonlinear SDE system (8) reduces to that of a linear SDE. The second term reflexes the geometric ergodicity of a linear dynamic system with a numerical method. It is known that even if a dynamic system has an exponential convergence rate to its equilibrium, its corresponding numerical method might not. Our bound for is essentially a specification of the result of [MSH02], which has also been applied by [XCZG18]. The third term reflects the numerical error of a linear SDE, which has been studied in related literature such as [CDC15]. To this end, we adopt standard assumptions used in the analysis of SDEs [VZT16, CDC15], rephrased in Assumption 2.
For the linear SDE (16) and a Lipschitz function , let be the solution functional of the Poisson equation: , where denotes the infinite generator of the linear SDE (16). Assume and its up to 4th-order derivatives, , are bounded by a function , i.e., for , . Furthermore, the expectation of on is bounded: , and is smooth such that , for .
, and are , and Lipschitz; satisfies the dissipative property, i.e., for some ; Remark 4.2 applies to the nonconvex setting, i.e. .
Assumption 2 is necessary to control the gap between a numerical solution and the exact solution of an SDE. Specifically, it is used to bound the term and the term above. Purely relying on the dissipative assumption in Assumption 3 as in non-convex optimization with SG-MCMC [RRT17, XCZG18] would induce a bound increasing linearly w.r.t. time . Thus it is not suitable for our goal. Finally, in Assumption 3 is a mild condition and reasonable because we expect particles to be able to approximate all distributions equally well in the asymptotic limit of by ergodicity due to the injected noise. How to remove/replace this assumption is an interesting future work.
Based on the assumptions above, the bounds for and are summarized below.
It is seen that in order to make the