Sampling has been an effective tool for approximate Bayesian inference, which becomes increasingly important in modern machine learning. In the setting of big data, recent research has developed scalable Bayesian sampling algorithms such as stochastic gradient Markov Chain Monte Carlo (SG-MCMC)[WT11] and Stein variational gradient descent (SVGD) [LW16]. These methods have facilitated important real-world applications and achieved impressive results, such as topic modeling [GCH15, LZS16], matrix factorization [CFG14, DFB14, cBCR16], differential privacy [WFS15, LCLC17], Bayesian optimization [SKFH16]
and deep neural networks[LCCC16]. Generally speaking, these methods use gradient information of a target distribution to generate samples, leading to more effective algorithms compared to traditional sampling methods. Recently, [CZW18] proposed a particle-optimization Bayesian sampling framework based on Wasserstein gradient flows, which unified SG-MCMC and SVGD in a new sampling framework called particle-optimization sampling (POS). Very recently, [ZZC18] discovered that SVGD endows some unintended pitfall, i.e. particles tend to collapse under some conditions. As a result, a remedy was proposed to inject random noise into SVGD update equations in the POS framework, leading to stochastic particle-optimization sampling (SPOS) algorithms [ZZC18]. Remarkably, for the first time, non-asymptotic convergence theory was developed for SPOS (SVGD-type algorithms) in [ZZC18].
In another aspect, in order to deal with large-scale datasets, many gradient-based methods for optimization and sampling use stochastic gradients calculated on a mini-batch of a dataset for computational feasibility. Unfortunately, extra variance is introduced into the algorithms, which would potentially degrade their performance. Consequently, variance control has been an important and interesting work for research. Some efficient solutions such as SAGA [DBLJ14] and SVRG [JZ13] were proposed to reduce variance in stochastic optimization. Subsequently, [DRP16] introduced these techniques in SG-MCMC for Bayesian sampling, which also has achieved great success in practice.
Since SPOS has enjoyed the best of both worlds by combining SG-MCMC and SVGD, it will be of greater value to further reduce its gradient variance. While both algorithm and theory have been developed for SPOS, no work has been done to investigate its variance-reduction techniques. Compared with the research on SG-MCMC where variance reduction has been well explored by recent work such as [DRP16, CFM18, ZXG18], it is much more challenging for SPOS to control the variance of stochastic gradients. This is because from a theoretical perspective, SPOS corresponds to nonlinear stochastic differential equations (SDE), where fewer existing mathematical tools can be applied for theoretical analysis. Furthermore, the fact that many particles are used in an algorithm makes it difficult to improve its performance by adding modifications to the way they interact with each other.
In this paper, we take the first attempt to study variance-reduction techniques in SPOS and develop corresponding convergence theory. We adopt recent ideas on variance reduction in SG-MCMC and stochastic-optimization algorithms, and propose three variance-reduced SPOS algorithms, denoted as SAGA particle-optimization sampling (SAGA-POS), SVRG particle-optimization sampling (SVRG-POS) and a variant of SVRG-POS without full-gradient computations, denoted as SVRG-POS. For all these variants, we prove rigorous theoretical results on their non-asymptotic convergence rates in terms of 2-Wasserstein metrics. Importantly, our theoretical results demonstrate significant improvements of convergence rates over standard SPOS. Remarkably, when comparing our convergence rates with those of variance-reduced stochastic gradient Langevin dynamics (SGLD), our theory indicates faster convergence rates of variance-reduced SPOS when the number of particles is large enough. Our theoretical results are verified by a number of experiments on both synthetic and real datasets.
2.1 Stochastic gradient MCMC
In Bayesian sampling, one aims at sampling from a posterior distribution , where represents the model parameter, and is the dataset. Let , where
is referred to as the potential energy function, and is the normalizing constant. We further define the full gradient and individual gradient used in our paper:
We can define a stochastic differential equation, an instance of Itó diffusion whose stationary distribution equals to the target posterior distribution . For example, consider the following 1st-order Langevin dynamic:
SG-MCMC algorithms are discretized numerical approximations of Itó diffusions (1). To make algorithms efficient in a big-data setting, the computationally-expensive term is replaced with its unbiased stochastic approximations with a random subset of the dataset in each interation, e.g. can be approximated by a stochastic gradient:
where is a random subset of with size . The above definition of reflects the fact that we only have information from data points in each iteration. This is the resource where the variance we try to reduce comes from. We should notice that is also used in standard SVGD and SPOS. As an example, SGLD is a numerical solution of (1), with an update equation: , where means the step size and .
2.2 Stein variational gradient descent
Different from SG-MCMC, SVGD initializes a set of particles, which are iteratively updated to approximate a posterior distribution. Specifically, we consider a set of particles drawn from some distribution . SVGD tries to update these particles by doing gradient descent on the interactive particle system via
where is a function perturbation direction chosen to minimize the KL divergence between the updated density induced by the particles and the posterior . The standard SVGD algorithm considers
as the unit ball of a vector-valued reproducing kernel Hilbert space (RKHS)associated with a kernel . In such a setting, [LW16] shows that
When approximating the expectation with an empirical distribution formed by a set of particles and adopting stochastic gradients , we arrive at the following update for the particles:
SVGD then applies (3) repeatedly for all the particles.
2.3 Stochastic particle-optimization sampling
In this paper, we focus on RBF kernel due to its wide use in both theoretical analysis and practical applications. Hence, we can use a function to denote the kernel . According to the work of [CZW18, ZZC18], the stationary distribution of the
in the following partial differential equation equals to.
It is worth noting that if we set the initial distribution of all the particles to be the same, the system of these M particles is exchangeable. So the distributions of all the are identical and can be denoted as . When solving the above diffusion process with a numerical method and adopting stochastic gradients , one arrives at the SPOS algorithm of [ZZC18] with the following update equation:
where . And SPOS will apply update (2.3) repeatedly for all the particles . Detailed theoretical results for SPOS are reviewed in the Supplementary Material (SM).
3 Variance Reduction in SPOS
In standard SPOS, each particle is updated by adopting . Due to the fact that one can only access data points in each update, the increased variance of the “noisy gradient” would cause a slower convergence rate. A simple way to alleviate this is to increase by using larger minibatches. Unfortunately, this would bring more computational costs, an undesired side effect. Thus more effective variance-reduction methods are needed for SPOS. Inspired by recent work on variance reduction in SGLD, e.g., [DRP16, CFM18, ZXG18], we develop three different variance-reduction algorithms for SPOS based on SAGA [DBLJ14] and SVRG [JZ13] in stochastic optimization.
SAGA-POS generalizes the idea of SAGA [DBLJ14] to an interactive particle-optimization system. For each particle , we use as an approximation for each individual gradient
. An unbiased estimate of the full gradientis calculated as:
In each iteration, will be partially updated under the following rule: if , and otherwise. The algorithm is described in Algorithm 1.
Compared with standard SPOS, SAGA-POS also enjoys highly computational efficiency, as it does not require calculation of each to get the full gradient in each iteration. Hence, the computational time of SAGA-POS is almost the same as that of POS. However, our analysis in Section 4 shows that SAGA-POS endows a better convergence rate.
From another aspect, SAGA-POS has the same drawback of SAGA-based algorithms, which requires memory scaling at a rate of in the worst case. For each particle , one needs to store N gradient approximations . Fortunately, as pointed out by [DRP16, CFM18], in some applications, the memory cost scales only as for SAGA-LD, which corresponds to for SAGA-POS as particles are used.
When compared with SAGA-LD, it is worth noting that particles are used in both SPOS and SAGA-POS. This makes the memory complexity times worse than SAGA-LD in training, thus SAGA-POS does not seem to bring any advantages over SAGA-LD. However, this intuition is not correct. As indicated by our theoretical results in Section 4, when the number of particles is large enough, the convergence rates of our algorithms are actually better than those of variance-reduced SGLD counterparts.
Under limited memory, we propose SVRG-POS, which is based on the SVRG method of [JZ13]. For each particle , ones needs to store a stale parameter , and update it occasionally for every iterations. At each update, we need to further conduct a global evaluation of full gradients at , i.e., . An unbiased gradient estimate is then calculated by leveraging both and as:
The algorithm is depicted in Algorithm 2, where one only needs to store and , instead of gradient estimates of all the individual . Hence the memory cost scales as , almost the same as that of standard SPOS.
We note that although SVRG-POS alleviates the storage requirement of SAGA-POS remarkably, it also endows downside that full gradients, , are needed to be re-computed every iterations, leading to high computational cost in a big-data scenario.
Similar to SAGA-POS, according to our theory in Section 4, SVRG-POS enjoys a faster convergence rate than SVRD-LD – its SGLD counterpart, although times more space are required for the particles. This provides a trade-off between convergence rates and space complexity. Previous work has shown that SAGA typically outperforms SVRG [DRP16, CFM18] in terms of convergence speed. The conclusion applies to our case, which will be verified both by theoretical analysis in Section 4 and experiments in Section 5.
The need of full gradient computation in SVRG-POS motives the development of SVRG-POS. Our algorithm is also inspired by the recent work of SVRG-LD on reducing the computational cost in SVRG-LD [ZXG18]. The main idea in SVRG-POS is to replace the full gradient computation every iterations with a subsampled gradient, i.e., to uniformly sample data points where are random samples from with replacement. Given the sub-sampled data, and are updated as: . The full algorithm is shown in Algorithm 3.
4 Convergence Analysis
In this section, we prove non-asymptotic convergence rates for the SAGA-POS, SVRG-POS and SVRG-POS algorithms under the 2-Wasserstein metric, defined as
is the set of joint distributions onwith marginal distribution and . Let denote our target distribution, and the distribution of derived via (2.3) after iterations. Our analysis aims at bounding . We first introduce our assumptions.
and satisfy the following conditions:
There exist two positive constants and , such that and .
is bounded and -Lipschitz continuous with i.e. and ; is -Lipschitz continuous for some and bounded by some constant .
is an even function, i.e., .
There exists a constant such that .
There exits a constant such that for all ,
Assumption 1 is adopted from [ZZC18] which analyzes the convergence property of SPOS. The first bullet of Assumption 1 suggests is a strongly convex function, which is the general assumption in analyzing SGLD [DK17, DM16] and its variance-reduced variants [ZXG18, CFM18]. It is worth noting that although some work has been done to investigate the non-convex case, it still has significant value to analysis the convex case, which are more instructive and meaningful to address the practical issues [DK17, DM16, ZXG18, CFM18]. All of the , , and could scale linearly with . can satisfy the above assumptions by setting the bandwidth large enough, since we mainly focus on some bounded space in practice. Consequently, can also be -Lipschitz continuous and bounded by ; K can also be Hessian Lipschitz with some positive constant
For the sake of clarity, we define some constants which will be used in our theorems.
Now we present convergence analysis for our algorithms, where is some positive constant independent of T.
Let denote the distribution of the particles after iterations with SVRG-POS in Algorithm 2. Under Assumption 1 and 2, if we choose Option and set the step size , the batch size and the epoch length , the convergence rate of SVRG-POS is bounded for all T, which mod = 0, as
If we choose Option and set the step size , the convergence rate of SVRG-POS is bounded for all as
Since the complexity has been discussed in the Section 3, we mainly focus on discussing the convergence rates here. Due to space limit, we move the comparison between convergence rates of the standard SPOS and its variance-reduced counterparts such as SAGA-POS into the SM. Specifically, adopting the standard framework of comparing different variance-reduction techniques in SGLD [DRP16, CFM18, ZXG18], we focus on the scenario where , , and all scale linearly with with . In this case, the dominating term in Theorem 1 for SAGA-POS is the last term, . Thus to achieve an accuracy of , we would need the stepsize . For SVRG-POS, the dominating term in Theorem 2 is for Option I and for Option II. Hence, for an accuracy of , the corresponding step sizes are and , respectively. Due to the fact that the mixing time for these methods is roughly proportional to the reciprocal of step size [CFM18], it is seen that when is small enough, one can have , which causes SAGA-POS converges faster than SVRG-POS (Option I). Similar results hold for Option II since the factor in would make the step size even smaller. More theoretical results are given in the SM.
We have provided theoretical analysis to support the statement of in Remark 3.2 . Moreover, we should also notice in SAGA-POS, stepsize has an extra factor, , compared with the step size used in SAGA-LD [CFM18]***For fair comparisons with our algorithms, we consider variance-reduced versions of SGLD with independent chains.. This means SAGA-POS with more particles ( is large) would outperform SAGA-LD. SVRG-POS and SVRG-POS have similar conclusions. This theoretically supports the statements of Remark 3.1 and in Remark 3.2. Furthermore, an interesting result from the above discussion is that when in SVRG-POS, there is an extra factor compared to the stepsize in SVRG-LD [CFM18]. Since the order of is higher than , one expects that the improvement of SVRG-POS over SVRG-LD is much more significant than that of SAGA-POS over SAGA-LD. This conclusion will be verified in our experiments.
We conduct experiments to verify our theory, and compare SAGA-POS, SVRG-POS and SVRG-POS with existing representative Bayesian sampling methods with/without variance-reduction techniques, e.g. SGLD and SPOS without variance reduction; SAGA-LD, SVRG-LD and SVRG-LD with variance reduction. For SVRG-POS, we focus on Option I in Algorithm 2 to verify our theory.
5.1 Synthetic log-normal distribution
We first evaluate our proposed algorithms on a log-normal synthetic data, defined as where . We calculate log-MSE of the sampled “mean” w.r.t. the true value, and plot the log-MSE versus number of passes through data [CFM18], like other variance-reduction algorithms in Figure 1, which shows that SAGA-POS and SVRG-POS converge the fastest among other algorithms. It is also interesting to see SPOS even outperforms both SAGA-LD and SVRG-LD.
5.2 Bayesian logistic regression
Following related work such as [DRP16]
, we test the proposed algorithms for Bayesian-logistic-regression (BLR) on four publicly available datasets from the UCI machine learning repository:(690-14), (768-8), (1151-20) and (100000-18), where means a dataset of data points with dimensionality . The first three datasets are relatively small, and the last one is a large dataset suitable to evaluate scalable Bayesian sampling algorithms.
Specifically, consider a dataset with samples, where and . The likelihood of a BLR model is written as with regression coefficient , which is assumed to be sampled from a standard multivariate Gaussian prior for simplicity. The datasets are split into 80% training data and 20% testing data. Optimized constant stepsizes are applied for each algorithm via grid search. Following existing work, we report testing accuracy and log-likelihood versus the number of data passes for each dataset, averaging over 10 runs with 50 particles. The minibatch size is set to 15 for all experiments.
5.2.1 Variance-reduced SPOS versus SPOS
We first compare SAGA-POS, SVRG-POS and SVRG-POS with SPOS without variance reduction proposed in [ZZC18]. The testing accuracies and log-likelihoods versus number of passes through data on the four datasets are plotted in Figure 2. It is observed that SAGA-POS converges faster than both SVRG-POS and SVRG-POS, all of which outperform SPOS significantly. On the largest dataset SUSY, SAGA-POS starts only after one pass through data, which then converges quickly, outperforming other algorithms. And SVRG-POS outperforms SVRG-POS due to the dataset SUSY is so large. All of these phenomena are consistent with our theory.
5.2.2 Variance-reduced SPOS versus variance-reduced SGLD
Next we compare the three variance-reduced SPOS algorithms with its SGLD counterparts, i.e., SAGA-LD, SVRG-LD and SVRG-LD. The results are plotted in Figure 3. Similar phenomena are observed, where both SAGA-POS and SVRG-POS outperform SAGA-LD and SVRG-LD, respectively, consistent with our theoretical results discussed in Remark 3.1 and 3.2. Interestingly, in the PIMA dataset case, SVRG-LD is observed to perform even worse (converge slower) than standard SGLD. Furthermore, as discussed in Remark 4, our theory indicates that the improvement of SVRG-POS over SVRG-LD is more significant than that of SAGA-POS over SAGA-LD. This is indeed true by inspecting the plots in Figure 3.
5.2.3 Impact of number of particles
Finally we examine the impact of number of particles to the convergence rates. As indicated by Theorems 1-3, for a fixed number of iterations , the convergence error in terms of 2-Wasserstein distance decreases with increasing number of particles. To verify this, we run SAGA-POS and SVRG-POS for BLR with the number of particles ranging between . The test log-likelihoods versus iteration numbers are plotted in Figure 4, demonstrating consistency with our theory.
We propose several variance-reduction techniques for stochastic particle-optimization sampling, and for the first time, develop nonasymptotic convergence theory for the algorithms in terms of 2-Wasserstein metrics. Our theoretical results indicate the improvement of convergence rates for the proposed variance-reduced SPOS compared to both standard SPOS and the variance-reduced SGLD algorithms. Our theory is verified by a number of experiments on both synthetic data and real data for Bayesian Logistic regression. Leveraging both our theory and empirical findings, we recommend the following algorithm choices in practice: SAGA-POS is preferable when storage is not a concern; SVRG-POS is a better choice when storage is a concern and full gradients are feasible to calculate; Otherwise, SVRG-POS is a good choice and works well in practice.
- [cBCR16] U. Şimşekli, R. Badeau, A. T. Cemgil, and G. Richard. Stochastic Quasi-Newton Langevin Monte Carlo. In ICML, 2016.
- [CFG14] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In ICML, 2014.
- [CFM18] Niladri Chatterji, Nicolas Flammarion, Yi-An Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient monte carlo. ICML, 2018.
- [CZW18] C. Chen, R. Zhang, W. Wang, B. Li, and L. Chen. A unified particle-optimization framework for scalable Bayesian sampling. In UAI, 2018.
- [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Nips, 2014.
- [DFB14] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014.
- [DK17] Arnak S. Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. arxiv preprint arxiv:1710.00095v2, 2017.
- [DM16] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. arXiv preprint arXiv:1605.01559, 2016.
- [DRP16] A. Dubey, S. J. Reddi, B. Póczos, A. J. Smola, and E. P. Xing. Variance reduction in stochastic gradient Langevin dynamics. In NIPS, 2016.
- [GCH15] Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep Poisson factor analysis for topic modeling. In ICML, 2015.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.NIPS, 2013.
- [Kol31] A. Kolmogoroff. Some studies in machine learning using the game of checkers. Mathematische Annalen, 104