1 Introduction
Sampling has been an effective tool for approximate Bayesian inference, which becomes increasingly important in modern machine learning. In the setting of big data, recent research has developed scalable Bayesian sampling algorithms such as stochastic gradient Markov Chain Monte Carlo (SGMCMC)
[WT11] and Stein variational gradient descent (SVGD) [LW16]. These methods have facilitated important realworld applications and achieved impressive results, such as topic modeling [GCH15, LZS16], matrix factorization [CFG14, DFB14, cBCR16], differential privacy [WFS15, LCLC17], Bayesian optimization [SKFH16]and deep neural networks
[LCCC16]. Generally speaking, these methods use gradient information of a target distribution to generate samples, leading to more effective algorithms compared to traditional sampling methods. Recently, [CZW18] proposed a particleoptimization Bayesian sampling framework based on Wasserstein gradient flows, which unified SGMCMC and SVGD in a new sampling framework called particleoptimization sampling (POS). Very recently, [ZZC18] discovered that SVGD endows some unintended pitfall, i.e. particles tend to collapse under some conditions. As a result, a remedy was proposed to inject random noise into SVGD update equations in the POS framework, leading to stochastic particleoptimization sampling (SPOS) algorithms [ZZC18]. Remarkably, for the first time, nonasymptotic convergence theory was developed for SPOS (SVGDtype algorithms) in [ZZC18].In another aspect, in order to deal with largescale datasets, many gradientbased methods for optimization and sampling use stochastic gradients calculated on a minibatch of a dataset for computational feasibility. Unfortunately, extra variance is introduced into the algorithms, which would potentially degrade their performance. Consequently, variance control has been an important and interesting work for research. Some efficient solutions such as SAGA [DBLJ14] and SVRG [JZ13] were proposed to reduce variance in stochastic optimization. Subsequently, [DRP16] introduced these techniques in SGMCMC for Bayesian sampling, which also has achieved great success in practice.
Since SPOS has enjoyed the best of both worlds by combining SGMCMC and SVGD, it will be of greater value to further reduce its gradient variance. While both algorithm and theory have been developed for SPOS, no work has been done to investigate its variancereduction techniques. Compared with the research on SGMCMC where variance reduction has been well explored by recent work such as [DRP16, CFM18, ZXG18], it is much more challenging for SPOS to control the variance of stochastic gradients. This is because from a theoretical perspective, SPOS corresponds to nonlinear stochastic differential equations (SDE), where fewer existing mathematical tools can be applied for theoretical analysis. Furthermore, the fact that many particles are used in an algorithm makes it difficult to improve its performance by adding modifications to the way they interact with each other.
In this paper, we take the first attempt to study variancereduction techniques in SPOS and develop corresponding convergence theory. We adopt recent ideas on variance reduction in SGMCMC and stochasticoptimization algorithms, and propose three variancereduced SPOS algorithms, denoted as SAGA particleoptimization sampling (SAGAPOS), SVRG particleoptimization sampling (SVRGPOS) and a variant of SVRGPOS without fullgradient computations, denoted as SVRGPOS. For all these variants, we prove rigorous theoretical results on their nonasymptotic convergence rates in terms of 2Wasserstein metrics. Importantly, our theoretical results demonstrate significant improvements of convergence rates over standard SPOS. Remarkably, when comparing our convergence rates with those of variancereduced stochastic gradient Langevin dynamics (SGLD), our theory indicates faster convergence rates of variancereduced SPOS when the number of particles is large enough. Our theoretical results are verified by a number of experiments on both synthetic and real datasets.
2 Preliminaries
2.1 Stochastic gradient MCMC
In Bayesian sampling, one aims at sampling from a posterior distribution , where represents the model parameter, and is the dataset. Let , where
is referred to as the potential energy function, and is the normalizing constant. We further define the full gradient and individual gradient used in our paper:
We can define a stochastic differential equation, an instance of Itó diffusion whose stationary distribution equals to the target posterior distribution . For example, consider the following 1storder Langevin dynamic:
(1) 
where is the time index; is dimensional Brownian motion, and a scaling factor. By the FokkerPlanck equation [Kol31, Ris89], the stationary distribution of (1) equals to .
SGMCMC algorithms are discretized numerical approximations of Itó diffusions (1). To make algorithms efficient in a bigdata setting, the computationallyexpensive term is replaced with its unbiased stochastic approximations with a random subset of the dataset in each interation, e.g. can be approximated by a stochastic gradient:
where is a random subset of with size . The above definition of reflects the fact that we only have information from data points in each iteration. This is the resource where the variance we try to reduce comes from. We should notice that is also used in standard SVGD and SPOS. As an example, SGLD is a numerical solution of (1), with an update equation: , where means the step size and .
2.2 Stein variational gradient descent
Different from SGMCMC, SVGD initializes a set of particles, which are iteratively updated to approximate a posterior distribution. Specifically, we consider a set of particles drawn from some distribution . SVGD tries to update these particles by doing gradient descent on the interactive particle system via
where is a function perturbation direction chosen to minimize the KL divergence between the updated density induced by the particles and the posterior . The standard SVGD algorithm considers
as the unit ball of a vectorvalued reproducing kernel Hilbert space (RKHS)
associated with a kernel . In such a setting, [LW16] shows that(2) 
When approximating the expectation with an empirical distribution formed by a set of particles and adopting stochastic gradients , we arrive at the following update for the particles:
(3) 
SVGD then applies (3) repeatedly for all the particles.
2.3 Stochastic particleoptimization sampling
In this paper, we focus on RBF kernel due to its wide use in both theoretical analysis and practical applications. Hence, we can use a function to denote the kernel . According to the work of [CZW18, ZZC18], the stationary distribution of the
in the following partial differential equation equals to
.(4) 
When approximating the in Eq.(4) with an empirical distribution formed by a set of particles , [ZZC18] derive the following diffusion process characterizing the SPOS algorithm.
(5) 
It is worth noting that if we set the initial distribution of all the particles to be the same, the system of these M particles is exchangeable. So the distributions of all the are identical and can be denoted as . When solving the above diffusion process with a numerical method and adopting stochastic gradients , one arrives at the SPOS algorithm of [ZZC18] with the following update equation:
(6) 
where . And SPOS will apply update (2.3) repeatedly for all the particles . Detailed theoretical results for SPOS are reviewed in the Supplementary Material (SM).
3 Variance Reduction in SPOS
In standard SPOS, each particle is updated by adopting . Due to the fact that one can only access data points in each update, the increased variance of the “noisy gradient” would cause a slower convergence rate. A simple way to alleviate this is to increase by using larger minibatches. Unfortunately, this would bring more computational costs, an undesired side effect. Thus more effective variancereduction methods are needed for SPOS. Inspired by recent work on variance reduction in SGLD, e.g., [DRP16, CFM18, ZXG18], we develop three different variancereduction algorithms for SPOS based on SAGA [DBLJ14] and SVRG [JZ13] in stochastic optimization.
3.1 SagaPos
SAGAPOS generalizes the idea of SAGA [DBLJ14] to an interactive particleoptimization system. For each particle , we use as an approximation for each individual gradient
. An unbiased estimate of the full gradient
is calculated as:(7) 
In each iteration, will be partially updated under the following rule: if , and otherwise. The algorithm is described in Algorithm 1.
Compared with standard SPOS, SAGAPOS also enjoys highly computational efficiency, as it does not require calculation of each to get the full gradient in each iteration. Hence, the computational time of SAGAPOS is almost the same as that of POS. However, our analysis in Section 4 shows that SAGAPOS endows a better convergence rate.
From another aspect, SAGAPOS has the same drawback of SAGAbased algorithms, which requires memory scaling at a rate of in the worst case. For each particle , one needs to store N gradient approximations . Fortunately, as pointed out by [DRP16, CFM18], in some applications, the memory cost scales only as for SAGALD, which corresponds to for SAGAPOS as particles are used.

When compared with SAGALD, it is worth noting that particles are used in both SPOS and SAGAPOS. This makes the memory complexity times worse than SAGALD in training, thus SAGAPOS does not seem to bring any advantages over SAGALD. However, this intuition is not correct. As indicated by our theoretical results in Section 4, when the number of particles is large enough, the convergence rates of our algorithms are actually better than those of variancereduced SGLD counterparts.
3.2 SvrgPos
Under limited memory, we propose SVRGPOS, which is based on the SVRG method of [JZ13]. For each particle , ones needs to store a stale parameter , and update it occasionally for every iterations. At each update, we need to further conduct a global evaluation of full gradients at , i.e., . An unbiased gradient estimate is then calculated by leveraging both and as:
(8) 
The algorithm is depicted in Algorithm 2, where one only needs to store and , instead of gradient estimates of all the individual . Hence the memory cost scales as , almost the same as that of standard SPOS.
We note that although SVRGPOS alleviates the storage requirement of SAGAPOS remarkably, it also endows downside that full gradients, , are needed to be recomputed every iterations, leading to high computational cost in a bigdata scenario.

Similar to SAGAPOS, according to our theory in Section 4, SVRGPOS enjoys a faster convergence rate than SVRDLD – its SGLD counterpart, although times more space are required for the particles. This provides a tradeoff between convergence rates and space complexity. Previous work has shown that SAGA typically outperforms SVRG [DRP16, CFM18] in terms of convergence speed. The conclusion applies to our case, which will be verified both by theoretical analysis in Section 4 and experiments in Section 5.
3.3 SvrgPos
The need of full gradient computation in SVRGPOS motives the development of SVRGPOS. Our algorithm is also inspired by the recent work of SVRGLD on reducing the computational cost in SVRGLD [ZXG18]. The main idea in SVRGPOS is to replace the full gradient computation every iterations with a subsampled gradient, i.e., to uniformly sample data points where are random samples from with replacement. Given the subsampled data, and are updated as: . The full algorithm is shown in Algorithm 3.
4 Convergence Analysis
In this section, we prove nonasymptotic convergence rates for the SAGAPOS, SVRGPOS and SVRGPOS algorithms under the 2Wasserstein metric, defined as
where
is the set of joint distributions on
with marginal distribution and . Let denote our target distribution, and the distribution of derived via (2.3) after iterations. Our analysis aims at bounding . We first introduce our assumptions.Assumption 1
and satisfy the following conditions:

There exist two positive constants and , such that and .

is bounded and Lipschitz continuous with i.e. and ; is Lipschitz continuous for some and bounded by some constant .

is an even function, i.e., .
Assumption 2
There exists a constant such that .
Assumption 3
There exits a constant such that for all ,

Assumption 1 is adopted from [ZZC18] which analyzes the convergence property of SPOS. The first bullet of Assumption 1 suggests is a strongly convex function, which is the general assumption in analyzing SGLD [DK17, DM16] and its variancereduced variants [ZXG18, CFM18]. It is worth noting that although some work has been done to investigate the nonconvex case, it still has significant value to analysis the convex case, which are more instructive and meaningful to address the practical issues [DK17, DM16, ZXG18, CFM18]. All of the , , and could scale linearly with . can satisfy the above assumptions by setting the bandwidth large enough, since we mainly focus on some bounded space in practice. Consequently, can also be Lipschitz continuous and bounded by ; K can also be Hessian Lipschitz with some positive constant
For the sake of clarity, we define some constants which will be used in our theorems.
Now we present convergence analysis for our algorithms, where is some positive constant independent of T.
Theorem 1
Theorem 2
Let denote the distribution of the particles after iterations with SVRGPOS in Algorithm 2. Under Assumption 1 and 2, if we choose Option and set the step size , the batch size and the epoch length , the convergence rate of SVRGPOS is bounded for all T, which mod = 0, as
(10) 
If we choose Option and set the step size , the convergence rate of SVRGPOS is bounded for all as
(11) 
Theorem 3
Since the complexity has been discussed in the Section 3, we mainly focus on discussing the convergence rates here. Due to space limit, we move the comparison between convergence rates of the standard SPOS and its variancereduced counterparts such as SAGAPOS into the SM. Specifically, adopting the standard framework of comparing different variancereduction techniques in SGLD [DRP16, CFM18, ZXG18], we focus on the scenario where , , and all scale linearly with with . In this case, the dominating term in Theorem 1 for SAGAPOS is the last term, . Thus to achieve an accuracy of , we would need the stepsize . For SVRGPOS, the dominating term in Theorem 2 is for Option I and for Option II. Hence, for an accuracy of , the corresponding step sizes are and , respectively. Due to the fact that the mixing time for these methods is roughly proportional to the reciprocal of step size [CFM18], it is seen that when is small enough, one can have , which causes SAGAPOS converges faster than SVRGPOS (Option I). Similar results hold for Option II since the factor in would make the step size even smaller. More theoretical results are given in the SM.

We have provided theoretical analysis to support the statement of in Remark 3.2 . Moreover, we should also notice in SAGAPOS, stepsize has an extra factor, , compared with the step size used in SAGALD [CFM18]^{*}^{*}*For fair comparisons with our algorithms, we consider variancereduced versions of SGLD with independent chains.. This means SAGAPOS with more particles ( is large) would outperform SAGALD. SVRGPOS and SVRGPOS have similar conclusions. This theoretically supports the statements of Remark 3.1 and in Remark 3.2. Furthermore, an interesting result from the above discussion is that when in SVRGPOS, there is an extra factor compared to the stepsize in SVRGLD [CFM18]. Since the order of is higher than , one expects that the improvement of SVRGPOS over SVRGLD is much more significant than that of SAGAPOS over SAGALD. This conclusion will be verified in our experiments.
5 Experiments
We conduct experiments to verify our theory, and compare SAGAPOS, SVRGPOS and SVRGPOS with existing representative Bayesian sampling methods with/without variancereduction techniques, e.g. SGLD and SPOS without variance reduction; SAGALD, SVRGLD and SVRGLD with variance reduction. For SVRGPOS, we focus on Option I in Algorithm 2 to verify our theory.
5.1 Synthetic lognormal distribution
We first evaluate our proposed algorithms on a lognormal synthetic data, defined as where . We calculate logMSE of the sampled “mean” w.r.t. the true value, and plot the logMSE versus number of passes through data [CFM18], like other variancereduction algorithms in Figure 1, which shows that SAGAPOS and SVRGPOS converge the fastest among other algorithms. It is also interesting to see SPOS even outperforms both SAGALD and SVRGLD.
5.2 Bayesian logistic regression
Following related work such as [DRP16]
, we test the proposed algorithms for Bayesianlogisticregression (BLR) on four publicly available datasets from the UCI machine learning repository:
(69014), (7688), (115120) and (10000018), where means a dataset of data points with dimensionality . The first three datasets are relatively small, and the last one is a large dataset suitable to evaluate scalable Bayesian sampling algorithms.Specifically, consider a dataset with samples, where and . The likelihood of a BLR model is written as with regression coefficient , which is assumed to be sampled from a standard multivariate Gaussian prior for simplicity. The datasets are split into 80% training data and 20% testing data. Optimized constant stepsizes are applied for each algorithm via grid search. Following existing work, we report testing accuracy and loglikelihood versus the number of data passes for each dataset, averaging over 10 runs with 50 particles. The minibatch size is set to 15 for all experiments.
5.2.1 Variancereduced SPOS versus SPOS
We first compare SAGAPOS, SVRGPOS and SVRGPOS with SPOS without variance reduction proposed in [ZZC18]. The testing accuracies and loglikelihoods versus number of passes through data on the four datasets are plotted in Figure 2. It is observed that SAGAPOS converges faster than both SVRGPOS and SVRGPOS, all of which outperform SPOS significantly. On the largest dataset SUSY, SAGAPOS starts only after one pass through data, which then converges quickly, outperforming other algorithms. And SVRGPOS outperforms SVRGPOS due to the dataset SUSY is so large. All of these phenomena are consistent with our theory.
5.2.2 Variancereduced SPOS versus variancereduced SGLD
Next we compare the three variancereduced SPOS algorithms with its SGLD counterparts, i.e., SAGALD, SVRGLD and SVRGLD. The results are plotted in Figure 3. Similar phenomena are observed, where both SAGAPOS and SVRGPOS outperform SAGALD and SVRGLD, respectively, consistent with our theoretical results discussed in Remark 3.1 and 3.2. Interestingly, in the PIMA dataset case, SVRGLD is observed to perform even worse (converge slower) than standard SGLD. Furthermore, as discussed in Remark 4, our theory indicates that the improvement of SVRGPOS over SVRGLD is more significant than that of SAGAPOS over SAGALD. This is indeed true by inspecting the plots in Figure 3.
5.2.3 Impact of number of particles
Finally we examine the impact of number of particles to the convergence rates. As indicated by Theorems 13, for a fixed number of iterations , the convergence error in terms of 2Wasserstein distance decreases with increasing number of particles. To verify this, we run SAGAPOS and SVRGPOS for BLR with the number of particles ranging between . The test loglikelihoods versus iteration numbers are plotted in Figure 4, demonstrating consistency with our theory.
6 Conclusion
We propose several variancereduction techniques for stochastic particleoptimization sampling, and for the first time, develop nonasymptotic convergence theory for the algorithms in terms of 2Wasserstein metrics. Our theoretical results indicate the improvement of convergence rates for the proposed variancereduced SPOS compared to both standard SPOS and the variancereduced SGLD algorithms. Our theory is verified by a number of experiments on both synthetic data and real data for Bayesian Logistic regression. Leveraging both our theory and empirical findings, we recommend the following algorithm choices in practice: SAGAPOS is preferable when storage is not a concern; SVRGPOS is a better choice when storage is a concern and full gradients are feasible to calculate; Otherwise, SVRGPOS is a good choice and works well in practice.
References
 [cBCR16] U. Şimşekli, R. Badeau, A. T. Cemgil, and G. Richard. Stochastic QuasiNewton Langevin Monte Carlo. In ICML, 2016.
 [CFG14] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In ICML, 2014.
 [CFM18] Niladri Chatterji, Nicolas Flammarion, YiAn Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient monte carlo. ICML, 2018.
 [CZW18] C. Chen, R. Zhang, W. Wang, B. Li, and L. Chen. A unified particleoptimization framework for scalable Bayesian sampling. In UAI, 2018.
 [DBLJ14] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. Nips, 2014.
 [DFB14] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014.
 [DK17] Arnak S. Dalalyan and Avetik Karagulyan. Userfriendly guarantees for the langevin monte carlo with inaccurate gradient. arxiv preprint arxiv:1710.00095v2, 2017.
 [DM16] Alain Durmus and Eric Moulines. Highdimensional bayesian inference via the unadjusted langevin algorithm. arXiv preprint arXiv:1605.01559, 2016.
 [DRP16] A. Dubey, S. J. Reddi, B. Póczos, A. J. Smola, and E. P. Xing. Variance reduction in stochastic gradient Langevin dynamics. In NIPS, 2016.
 [GCH15] Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep Poisson factor analysis for topic modeling. In ICML, 2015.

[JZ13]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
NIPS, 2013.  [Kol31] A. Kolmogoroff. Some studies in machine learning usin
Comments
There are no comments yet.