Introduction
With the increasing size of datasets of interest to machine learning, stochastic gradient Markov Chain Monte Carlo (SGMCMC) has been established as an effective tool for largescale Bayesian learning, with applications in topic modeling
[Gan et al.2015, Liu, Zhu, and Song2016], matrix factorization [Chen, Fox, and Guestrin2014, Ding et al.2014, Şimşekli et al.2016], differential privacy [Wang, Fienberg, and Smola2015], Bayesian optimization [Springenberg et al.2016] and deep neural networks [Li et al.2016]. Typically, in each iteration of an SGMCMC algorithm, a minibatch of data is used to generate the next sample, yielding computational efficiency comparable to stochastic optimization. While a large number of SGMCMC algorithms have been proposed, their optimal convergence rates generally appear to share the same form, and are typically slower than stochastic gradient descent (SGD)
[Chen, Ding, and Carin2015]. The impact of stochastic gradient noise comes from a higherorder term (see Lemma 1 below), which was omitted in the analysis of [Chen, Ding, and Carin2015]. In other words, current theoretical analysis for SGMCMC does not consider the impact of minibatch size (corresponding to stochastic gradient noise), making the underlying convergence theory w.r.t. minibatch size unclear. Recent work by [Dubey et al.2016] on applying variance reduction in stochastic gradient Langevin dynamics (SGLD) claims to improve the convergence rate of standard SGLD [Welling and Teh2011, Teh, Thiery, and Vollmer2016, Vollmer, Zygalakis, and Teh2016].The theoretical analysis in [Dubey et al.2016] omits certain aspects of variance reduction in SGLD, that we seek to address here: i) how does the minibatch size (or equivalently the stochastic gradient noise) affect the convergence rate of an SGMCMC algorithm? and ii) how can one effectively reduce the stochastic gradient noise in SGMCMC to improve its convergence rate, from both an algorithmic and a theoretical perspective? For , we provide theoretical results on the convergence rates of SGMCMC w.r.t. minibatch size. For , we propose a practical variancereduction technique for SGMCMC, as well as theory to analyze improvements of the corresponding convergence rates. The resulting SGMCMC algorithm is referred to as variancereduction SGMCMC (vrSGMCMC).
For a clearer description, we first define notation. In a Bayesian model, our goal is typically to evaluate the posterior average of a test function , defined as , where is the target posterior distribution with the possibly augmented model parameters (see Section Preliminaries). Let be the samples generated from an SGMCMC algorithm. We use the sample average, , to approximate . The corresponding bias and mean square error (MSE) are defined as and
, respectively. In vrSGMCMC, unbiased estimations of full gradients are used, leading to the same bias bound as standard SGMCMC
[Chen, Ding, and Carin2015]. As a result, we focus here on analyzing the MSE bound for vrSGMCMC.Specifically, we first analyze how minibatch size affects the MSE convergence rate of standard SGMCMC, summarized in two cases: i) for a limited computation budget, the optimal MSE bound is achieved when using full gradients in the algorithm; ii) for a large enough computational budget, i.e., in a longrun setting, stochastic gradients with minibatches of size one are preferable. This indicates that stochastic gradient noise hurts SGMCMC at the beginning of the algorithm. While it is computationally infeasible to use full gradients in practice, a remedy to overcome this issue is to use relatively small minibatches with variance reduction techniques to reduce stochastic gradient noise. Consequently, we propose a practical variancereduction scheme, making SGMCMC computationally efficient in a bigdata setting. Finally, we develop theory to analyze the benefit of the proposed variancereduction technique and empirically show improvements of vrSGMCMC over standard SGMCMC algorithms.
Preliminaries
SGMCMC is a family of scalable Bayesian sampling algorithms, developed recently to generate approximate samples from a posterior distribution . Here
represents a model parameter vector and
represents the data available to learn the model. In general, SGMCMC algorithms are discretized numerical approximations of continuoustime Itô diffusions [Chen, Ding, and Carin2015, Ma, Chen, and Fox2015], which are equipped with stationary distributions coincident with the target posterior distributions. An Itô diffusion is written as(1) 
where is the state variable, is the time index, and is dimensional Brownian motion. Typically, is an augmentation of the model parameters, so . Functions and are assumed to satisfy the Lipschitz continuity condition [Ghosh2011].
According to [Ma, Chen, and Fox2015], all SGMCMC algorithms can be formulated by defining appropriate functions and in (1). For example, the stochastic gradient Langevin dynamic (SGLD) model corresponds to , and , where denotes the unnormalized negative logposterior. Similar formula can be defined for other SGMCMC algorithms, such as stochastic gradient Hamiltonian Monte Carlo (SGHMC) [Chen, Fox, and Guestrin2014] and stochastic gradient thermostats (SGNHT) [Ding et al.2014].
An SGMCMC algorithm is usually developed by numerically solving the corresponding Itô diffusion and replacing the full gradient with an unbiased estimate from a minibatch of data in each iteration. For example, in SGLD, this yields an update equation of for the th iteration, where is the stepsize, . This brings two sources of error into the chain: numerical error (from discretization of the differential equation) and stochastic noise error from use of minibatches. In particular, [Chen, Ding, and Carin2015] proved the following bias and MSE bounds for general SGMCMC algorithms:
Lemma 1 ([Chen, Ding, and Carin2015])
Here , where is the infinitesimal generator of the Itô diffusion (1) defined as , for any compactly supported twice differentiable function . for two vectors and , for two matrices and . is defined as the standard operator norm acting on the space of bounded functions, e.g., for a function . is the same as except for the substitution of the stochastic gradient for the full gradient due to the usage of a stochastic gradient in the th iteration. By substituting the definition of and , typically we have .
By using an unbiased estimate of the true gradient, the term in the bias bound in Lemma 1 vanishes, indicating that stochastic gradients (or equivalently minibatch size) only affect the MSE bound. Consequently, we focus on improving the MSE bound with the proposed variancereduction SGMCMC framework.
Practical VarianceReduction SGMCMC
We first motivate the necessity of variance reduction in SGMCMC, by analyzing how minibatch size affects the MSE bound. A practical variance reduction scheme is then proposed, which is efficient from both computational and storage perspectives. Comparison with existing variancereduction SGMCMC approaches is also highlighted. Previous research has revealed that the convergence of diffusionbased MCMC scales at an order of w.r.t. dimension [Durmus et al.2016]. For the interest of SGMCMC, we following standard analysis [Vollmer, Zygalakis, and Teh2016] and do not consider the impact of in our analysis.
The necessity of variance reduction: a theoretical perspective
It is clear from Lemma 1 that the variance of noisy stochastic gradients plays an important role in the MSE bound of an SGMCMC algorithm. What is unclear is how exactly minibatch size affects the convergence rate. Intuitively, minibatch size appears to play the following roles in SGMCMC: i) smaller minibatch sizes introduce larger variance into stochastic gradients; ii) smaller minibatch sizes allow an algorithm to run faster (thus more samples can be obtained in a given amount of computation time). To balance the two effects, in addition to using the standard assumptions for SGMCMC (which basically requires the coefficients of Itô diffusions to be smooth and bounded, and is deferred to Assumption 2 in the Appendix), we assume that the algorithms with different minibatch sizes all run for a fixed computational time/budget in the analysis, as stated in Assumption 1.
Assumption 1
For a fair comparison, all SGMCMC algorithms with different minibatch sizes are assumed to run for a fixed amount of computation time/budget . Further, we assume that linearly depends on the minibatch size and the sample size , i.e., .
For simplicity, we rewrite the gradient of the loglikelihood for data in the th iteration as: . We first derive the following lemma about the property of , which is useful in the subsequent developments, e.g., to guarantee a positive bound in Theorem 3 and an improved bound for the proposed vrSGMCMC (Theorem 6).
Lemma 2
Under Assumption 2, given in the th iteration, , where the expectation is taken over the randomness of an SGMCMC algorithm^{†}^{†}†The same meaning goes for other expectations in the paper if not explicitly specified..
We next generalize Lemma 1 by incorporating the minibatch size into the MSE bound. The basic idea in our derivation is to associate with each data
a binary random variable,
, to indicate whether data is included in the current minibatch or not. These depend on each other such that in order to guarantee minibatches of size . Consequently, the stochastic gradient in the th iteration can be rewritten as: . Substituting the above gradient formula into the proof of standard SGMCMC [Chen, Ding, and Carin2015] and further summing out results in an alternative MSE bound for SGMCMC, stated in Theorem 3. In the analysis, we assume to use a 1storder numerical integrator for simplicity, e.g. the Euler method, though the results generalize to thorder integrators easily.Theorem 3
Under Assumption 2, let the minibatch size of an SGMCMC be , . The finitetime MSE is bounded, for a constant independent of , as:
Theorem 3 represents the bound in terms of minibatch size and sample size . Note in our finitetime setting, and are considered to be constants. Consequently, is also a bounded constant in our analysis. To bring in the computational budget , based on Assumption 1, e.g., , the optimal MSE bound w.r.t. stepsize in Theorem 3 can be written as: . After further optimizing the bound w.r.t. by setting the derivative of the above MSE bound to zero, the optimal minibatch size can be written as . To guarantee this bound for to be finite and integers, it is required that the computational budget to scale at the order of when varying . When considering both and as impact factors, the optimal becomes more interesting, and is concluded in Corollary 4^{‡}^{‡}‡Note we only have that for some unknown constant , i.e., the specific value of is unknown..
Corollary 4
Under Assumption 1 and 2, we have three cases of optimal minibatch sizes, each corresponding to different levels of computational budget.

When the computational budget is small, e.g., , the optimal MSE bound is decreasing w.r.t. in range . The minimum MSE bound is achieved at .

When the computational budget is large, e.g., , the optimal MSE bound is increasing w.r.t. in range . The minimum MSE bound is achieved at .

When the computational budget is in between the above two cases, the optimal MSE bound first increases then decreases w.r.t. in range . The optimal MSE bound is obtained either at or at , depending on .
In many machine learning applications, the computational budget is limited, leading the algorithm to the first case of Corollary 4, i.e., . According to Corollary 4, processing full data (i.e., no minibatch) is required to achieve the optimal MSE bound, which is computationally infeasible when is large (which motivated use of minibatches in the first place). A practical way to overcome this is to use small minibatches and adopt variancereduction techniques to reduce the stochastic gradient noise.
A practical variance reduction algorithm
For practical use, we require that a variancereduction method should achieve both computational and storage efficiency. While variance reduction has been studied extensively in stochastic optimization, it is applied much less often in SGMCMC. In this section we propose a vrSGMCMC algorithm, a simple extension of the algorithm in [Dubey et al.2016], but is more computationally practical in largescale applications. A convergence theory is also developed in Section Convergence rate.
The proposed vrSGMCMC is illustrated in Algorithm 1. Similar to stochastic optimization [Schmidt, Le Roux, and Bach2016], the idea of variance reduction is to balance the gradient noise with a lessnoisy old gradient, i.e., a stochastic gradient is calculated based on a previous sample, as well as using a larger minibatch than that of the current stochastic gradient, resulting in a less noisy estimation. In each iteration of our algorithm, an unbiased stochastic gradient is obtained by combining the above two versions of gradients in an appropriate way (see in Algorithm 1). Such a construction of stochastic gradients essentially inherits a low variance with theoretical guarantees (detailed in Section Convergence rate). In Algorithm 1, the whole parameter is decomposed into the model parameter and the remaining algorithmspecific parameter , e.g., the momentum parameter. The expression “” means assigning the corresponding model parameter from to . The old gradient is denoted as , calculated with a minibatch of size . The current stochastic gradient is calculated on a minibatch of size . We use to denote a function which generates the next sample with an SGMCMC algorithm, based on the current sample , input stochastic gradient , and step size .
One should note that existing variancereduction algorithms, e.g. [Johnson and Zhang2013], use a similar concept to construct lowvariance gradients. However, most algorithms use the whole training data to compute in Algorithm 1, which is computationally infeasible in largescale settings. Moreover, we note that like in stochastic optimization [Reddi et al.2016, AllenZhu and Hazan2016], instead of using a single parameter sample to compute , similar methods can be adopted to compute based on an average of old parameter samples. The theoretical analysis can be readily adopted for such cases, which is omitted here for simplicity. More references are discussed in Section Related Work.
Comparison with existing variancereduction SGMCMC algorithms
The most related variancereduction SGMCMC algorithm we are aware of is a recent work on variancereduction SGLD (SVRGLD) [Dubey et al.2016]. SVRGLD shares a similar flavor to our scheme from the algorithmic perspective, except that when calculating the old gradient , the whole training data set is used in SVRGLD. As mentioned above, this brings a computational challenge for largescale learning. Although the problem is mitigated by using a moving average estimation of the stochastic gradient, this scheme does not match their theory. A more distinctive advantage of vrSGMCMC over SVRGLD [Dubey et al.2016] is in terms of theoretical analysis. Concerning SVRGLD, i) the authors did not show theoretically in which case variance reduction is useful in SGLD, and ii) it is not clear in their theory whether SVRGLD is able to speed up the convergence rate compared to standard SGLD. Specifically, the MSE of SVRGLD was shown to be bounded by , compared to for SGLD, where are constants. By inspecting the above bounds, it is not clear whether SVRGLD improves SGLD because the two bounds are not directly comparable^{§}^{§}§The first term in the of the SVRGLD bound is strictly larger than the first term of the SGLD bound (if the term is used in the “min”), making the bounds not easily compared.. More detailed explanations are provided in Appendix E.
Convergence rate
We derive convergence bounds for Algorithm 1 and analyze the improvement of vrSGMCMC over the corresponding standard SGMCMC. Using a similar approach as in Section The necessity of variance reduction: a theoretical perspective, we first introduce additional binary random variables, , to indicate which data points are included in calculating the old gradient in Algorithm 1. This results in the expression for the stochastic gradient used in the th iteration: . It is easy to verify that the above stochastic gradient is an unbiased estimation of the true gradient in the th iteration (see Appendix C).
In order to see how Algorithm 1 reduces the variance of stochastic gradients, from Lemma 1, it suffices to study , as the minibatch size only impacts this term. For notational simplicity, similar to the defined in Section The necessity of variance reduction: a theoretical perspective, we denote , which is similar to but evaluated on the old parameter . Intuitively, since the old gradient is calculated from to balance the stochastic gradient noise (calculated from ), and are expected to be close to each other. Lemma 5 formulates the intuition, a key result in proving our main theorem, where we only consider the update interval and stepsize as factors. In the lemma below, following [Chen et al.2016] (Assumption 1), we further assume the gradient function to be Lipschitz.
Lemma 5
Under Assumption 2 and assume to be Lipschitz (Assumption 1 in [Chen et al.2016]), and are close to each other in expectation, i.e., .
In the Appendix, we further simplify in the MSE bound by decomposing it into several terms. Finally, we arrive at our main theorem for the proposed vrSGMCMC framework.
Theorem 6
Under the setting of Lemma 5, let , and . The MSE of vrSGMCMC with a thorder integrator is bounded as:
where , and . Furthermore, we have for , so that .
Note that for a fixed , in the above bound is a highorder term relative to . As a result, the MSE is bounded by . Because the MSE of standard SGMCMC is bounded by (see Appendix C) and from Theorem 6, we conclude that vrSGMCMC induces a lower MSE bound compared to the corresponding SGMCMC algorithm, with an improvement of .
It is worth noting that in Algorithm 1, the minibatch for calculating the old gradient is required to be larger than that for calculating the current stochastic gradient, i.e., . Otherwise, in Theorem 6 would become negative, leading to an increased MSE bound compared to standard SGMCMC. This matches the intuition that old gradients need to be more accurate (thus with larger minibatches) than current stochastic gradients in order to reduce the stochastic gradient noise.
Remark 7
In the special case of [Dubey et al.2016] where for SGLD, Theorem 6 gives a MSE bound of , with . According to Lemma 2, is also positive, thus leading to a reduced MSE bound. However, the bound is not necessarily better than that of vrSGMCMC, where a minibatch is used instead of the whole data set to calculate , leading to a significant decrease of computational time.
Remark 8
Following Corollary 4, Theorem 6 can also be formulated in terms of the computational budget . Specifically, according to Algorithm 1, the computational budget would be proportional to . Substituting this into the MSE bound of Theorem 6 gives a reformulated bound of . The optimal MSE w.r.t. and would be complicated since both and depend on and . We omit the details here for simplicity. Nevertheless, our experiments indicate that our algorithm always improves standard SGMCMC algorithms for the same computational time.
Related Work
Variance reduction was first introduced in stochastic optimization, which quickly became a popular research topic and has been actively developed in recent years. [Schmidt, Le Roux, and Bach2013, Schmidt, Le Roux, and Bach2016] introduced perhaps the first variance reduction algorithm, called stochastic average gradient (SAG), where historical gradients are stored and continuously updated in each iteration. Later, stochastic variance reduction gradient (SVRG) was developed to reduce the storage bottleneck of SAG, at the cost of an increased computational time [Johnson and Zhang2013, Zhang, Mahdavi, and Jin2013]. [Defazio, Bach, and LacosteJulien2014] combined ideas of SAG and SVRG and proposed the SAGA algorithm, which improves SAG by using a better and unbiased stochasticgradient estimation.
Variance reduction algorithms were first designed for convex optimization problems, followed by a number of recent works extending the techniques for nonconvex optimization [Reddi et al.2016, Reddi, Sra, and B. Póczos2016, AllenZhu and Hazan2016, AllenZhu et al.2016], as well as for distributed learning [Reddi et al.2015]. All these algorithms are mostly based on SVRG and are similar in algorithmic form, but differ in the techniques for proving the rigorous theoretical results.
For scalable Bayesian sampling with SGMCMC, however, this topic has been studied little until a recent work on variance reduction for SGLD [Dubey et al.2016]. In this work, the authors adapted the SAG and SVRG ideas to SGLD. Although they provided corresponding convergence results, some fundamental problems, such as how minibatch size affects the convergence rate, were not fully studied. Furthermore, their algorithms suffer from an either high computational or storage cost in a bigdata setting, because the whole data set needs to be accessed frequently.
To reduce the computational cost of SVRGbased algorithms, the idea of using a minibatch of data to calculate the old gradient (corresponding to the in Algorithm 1) has also been studied in stochastic optimization. Representative works include, but are not limited to [Harikandeh et al.2015, Frostig et al.2015, Shah et al.2016, Lei and Jordan2016, Lian, Wang, and Liu2017]. The proposed approach adopts similar ideas, with the following main differences: i) Our algorithm represents the first work for largescalable Bayesian sampling with a practical (computationally cheap) variance reduction technique; ii) the techniques used here for analysis are different and appear to be simpler than those used for stochastic optimization; iii) our theory addresses fundamental questions for variance reduction in SGMCMC, such as those raised in the Introduction.
Experiments
A synthetic experiment
We first test the conclusion of the longrun setting in Corollary 4, which indicates that vrSGMCMC with minibatches of size 1 achieve the optimal MSE bound. To make the algorithm go into the longrun setting regime as sufficient as possible, we test vrSGMCMC on a simple Gaussian model, which runs very fast so that a little actual walkclock time is regarded as a large computational budget. The model is defined as: . We generate data samples , and calculate the the MSE for minibatch sizes of . The test function is . The results are ploted in Figure 1. We can see from the figure that achieves the lowese MSE, consistent with the theory (Corollary 4).
Applications on deep neural networks
We apply the proposed vrSGMCMC framework to Bayesian learning of deep neural networks, including the multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN). The latter two have not been empirically evaluated in previous work. Experiments with Bayesian logistic regression are given in Appendix
F. In the experiments, we are interested in modeling weight uncertainty of neural networks, which is an important topic and has been well studied [HernándezLobato and Adams2015, Blundell et al.2015, Li et al.2016, Louizos and Welling2016]. We achieve this goal by applying priors to the weights (in our case, we use simple isotropic Gaussian priors) and performing posterior sampling with vrSGMCMC or SGMCMC. We implement vrSGMCMC based on SGLD, and compare it to the standard SGLD and SVRGLD [Dubey et al.2016] in our experiments^{¶}^{¶}¶The SAGALD algorithm in [Dubey et al.2016] is not compared here because it is too storageexpensive thus is not fair.. For this reason, comparisons to other optimizationbased methods such as the maximum likelihood are not considered. For simplicity, we set the update interval for the old gradient in Algorithm 1 to . For all the experiments, the minibatch sizes for vrSGMCMC are set to and . To be fair, this corresponds to a minibatch size of in SGLD and SVRGLD. Sensitivity of model performance w.r.t. minibatch size is tested in Section Parameter sensitivity. For a fair comparison, following convention [AllenZhu and Hazan2016, Dubey et al.2016], we plot the number of data passes versus error in the figures^{∥}^{∥}∥Since true posterior averages are infeasible, we plot sample averages in terms of accuracy/loss.. Results on the number of data passes versus loss are given in the Appendix. In addition, we use fixed stepsizes in our algorithm for all except for the ResNet model specified below. Following relevant literature [Johnson and Zhang2013, Dubey et al.2016], we tune the stepsizes and plot the best results for all the algorithms to ensure fairness. Note in our Bayesian setup, it is enough to run an algorithm for once since the uncertainty is encoded in the samples.Multilayer perceptron
We follow conventional settings [Reddi et al.2016, AllenZhu and Hazan2016] and use a singlelayer MLP with 100 hidden units, using the sigmoidactivation function as the nonlinear transformation. We test the MLP on the MNIST and CIFAR10 datasets. The stepsizes for both vrSGMCMC and SGLD are set to 0.25 and 0.01 in the two datasets, respectively. Figure 2
plots the number of passes through the data versus test error/loss. Results on the training datasets, including training results for the CNN and RNNbased deep learning models described below, are provided in Appendix
F. It is clear that vrSGMCMC leads to a much faster convergence speed than SGLD, resulting in much lower test errors and loss at the end, especially on the CIFAR10 dataset. SVRGLD, though it leads to potential lower errors/loss, converges slower than vrSGMCMC, due to the high computational cost in calculating the old gradient . As a result, we do not compare vrSGMCMC with SVRGLD in the remaining experiments.Convolutional neural networks
We use the CIFAR10 dataset, and test two CNN architectures for image classification. The first architecture is a deep convolutional neural networks with 4 convolutional layers, denoted as C32C32C64C32, where maxpooling is applied on the output of the first three convolutional layers, and a Dropout layer is applied on the output of the last convolutional layer. The second architecture is a 20layers deep residual network (ResNet) with the same setup as in
[He et al.2016]. Specifically, we use a stepsizedecrease scheme as for both vrSGMCMC and SGLD, where is the number of iterations so far.Figure 3 plots the number of passes through the data versus test error/loss on both models. Similar to the results on MLP, vrSGMCMC converges much faster than SGLD, leading to lower test errors and loss. Interestingly, the gap seems larger in the more complicated ResNet architecture; furthermore, the learning curves look much less noisy (smoother) for vrSGMCMC because of the reduced variance in stochastic gradients.
Recurrent neural networks
The recurrent neural network with LSTM units [Hochreiter and Schmidhuber1997] is a powerful architecture used for modeling sequencetosequence data. We consider the task of language modeling on two datasets, i.e., the Penn Treebank (PTB) dataset and WikiText2 dataset [Merity et al.2016]. PTB is the smaller dataset among the two, containing a vocabulary of size 10,000. We use the default setup of 887,521 tokens for training, 70,390 for validation and 78,669 for testing. WikiTest2 is a large dataset with 2,088,628 tokens from 600 Wiki articles for training, 217,649 tokens from 60 Wiki articles for validation, and 245,569 tokens from an additional 60 Wiki articles for testing. The total vocabulary size is 33,278.
We adopt the hierarchical LSTM achitecture [Zaremba, Sutskever, and Vinyals2014]
. The hierarchy depth is set to 2, with each LSTM containing 200 hidden unites. The step size is set to 0.5 for both datasets. For more stable training, standard gradient clipping is adopted, where gradients are clipped if the norm of the parameter vector exceeds 5. Figure
4 plots the number of passes through the data versus test perplexity on both datasets. The results are consistent with the previous experiments on MLPs and CNNs, where vrSGMCMC achieves faster convergence than SGLD; its learning curves in terms of testing error/loss are also much smoother.Parameter sensitivity
Note that one of the main differences between vrSGMCMC and the recently proposed SVRGLD [Dubey et al.2016] is that the former uses minibatches of size to calculate the old gradient in Algorithm 1, leading to a much more computationally efficient algorithm, with theoretical guarantees. This section tests the sensitivity of model performance to the parameter .
For simplicity, we run on the same MLP model described in Section Multilayer perceptron on the CIFAR10 dataset, where the same parameter settings are used, but varying in . Figure 5 plots the number of passes through data versus test errors/loss, where we use “vrSGMCMCGn” to denote vrSGMCMC with . Interestingly, vrSGMCMC outperforms the baseline SGLD on all values. Notably, when is large enough ( in our case), their corresponding test errors and loss are very close. This agrees with the intuition that computing the old gradient using the whole training data is not necessarily a good choice in order to balance the stochastic gradient noise and computational time.
Conclusion
We investigate the impact of minibatches in SGMCMC and propose a practical variancereduction SGMCMC algorithm to reduce the stochastic gradient noise in SGMCMC. Compared to existing variance reduction techniques for SGMCMC, the proposed method is efficient from both computational and storage perspectives. Theory is developed to guarantee faster convergence rates of vrSGMCMC compared to standard SGMCMC algorithms. Extensive experiments on Bayesian learning of deep neural networks verify the theory, obtaining significant speedup compared to the corresponding SGMCMC algorithms.
References
 [AllenZhu and Hazan2016] AllenZhu, Z., and Hazan, E. 2016. Variance reduction for faster nonconvex optimization. In ICML.
 [AllenZhu et al.2016] AllenZhu, Z.; Richtárik, P.; Qu, Z.; and Yuan, Y. 2016. Even faster accelerated coordinate descent using nonuniform sampling. In ICML.
 [Blundell et al.2015] Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty in neural networks. In ICML.
 [Chen et al.2016] Chen, C.; Ding, N.; Li, C.; Zhang, Y.; and Carin, L. 2016. Stochastic gradient MCMC with stale gradients. In NIPS.
 [Chen, Ding, and Carin2015] Chen, C.; Ding, N.; and Carin, L. 2015. On the convergence of stochastic gradient MCMC algorithms with highorder integrators. In NIPS.
 [Chen, Fox, and Guestrin2014] Chen, T.; Fox, E. B.; and Guestrin, C. 2014. Stochastic gradient Hamiltonian Monte Carlo. In ICML.
 [Şimşekli et al.2016] Şimşekli, U.; Badeau, R.; Cemgil, A. T.; and Richard, G. 2016. Stochastic QuasiNewton Langevin Monte Carlo. In ICML.
 [Defazio, Bach, and LacosteJulien2014] Defazio, A.; Bach, F.; and LacosteJulien, S. 2014. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS.
 [Ding et al.2014] Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; and Neven, H. 2014. Bayesian sampling using stochastic gradient thermostats. In NIPS.
 [Dubey et al.2016] Dubey, A.; Reddi, S. J.; Póczos, B.; Smola, A. J.; and Xing, E. P. 2016. Variance reduction in stochastic gradient Langevin dynamics. In NIPS.
 [Durmus et al.2016] Durmus, A.; Roberts, G. O.; Vilmart, G.; and Zygalakis, K. C. 2016. Fast Langevin based algorithm for MCMC in high dimensions. Technical Report arXiv:1507.02166.
 [Frostig et al.2015] Frostig, R.; Ge, R.; Kakade, S. M.; and Sidford, A. 2015. Competing with the empirical risk minimizer in a single pass. In COLT.
 [Gan et al.2015] Gan, Z.; Chen, C.; Henao, R.; Carlson, D.; and Carin, L. 2015. Scalable deep Poisson factor analysis for topic modeling. In ICML.
 [Ghosh2011] Ghosh, A. P. 2011. Backward and Forward Equations for Diffusion Processes. Wiley Encyclopedia of Operations Research and Management Science.
 [Harikandeh et al.2015] Harikandeh, R.; Ahmed, M. O.; Virani, A.; Schmidt, M.; Konecný, J.; and Sallinen, S. 2015. Stop wasting my gradients: Practical SVRG. In NIPS.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.

[HernándezLobato and
Adams2015]
HernándezLobato, J. M., and Adams, R. P.
2015.
Probabilistic backpropagation for scalable learning of Bayesian neural networks.
In ICML.  [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural Computation 9(8):1735–1780.
 [Johnson and Zhang2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS.
 [Lei and Jordan2016] Lei, L., and Jordan, M. I. 2016. Less than a single pass: Stochastically controlled stochastic gradient method. In NIPS.
 [Li et al.2016] Li, C.; Chen, C.; Carlson, D.; and Carin, L. 2016. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI.
 [Lian, Wang, and Liu2017] Lian, X.; Wang, M.; and Liu, J. 2017. Finitesum composition optimization via variance reduced gradient descent. In AISTATS.
 [Liu, Zhu, and Song2016] Liu, C.; Zhu, J.; and Song, Y. 2016. Stochastic gradient geodesic MCMC methods. In NIPS.
 [Louizos and Welling2016] Louizos, C., and Welling, M. 2016. Structured and efficient variational deep learning with matrix Gaussian posteriors. In ICML.
 [Ma, Chen, and Fox2015] Ma, Y. A.; Chen, T.; and Fox, E. B. 2015. A complete recipe for stochastic gradient MCMC. In NIPS.
 [Mattingly, Stuart, and Tretyakov2010] Mattingly, J. C.; Stuart, A. M.; and Tretyakov, M. V. 2010. Construction of numerical timeaverage and stationary measures via Poisson equations. SIAM J. NUMER. ANAL. 48(2):552–577.
 [Merity et al.2016] Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
 [Reddi et al.2015] Reddi, S. J.; Hefny, A.; Sra, S.; and B. Póczos, A. S. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS.
 [Reddi et al.2016] Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. 2016. Stochastic variance reduction for nonconvex optimization. In ICML.
 [Reddi, Sra, and B. Póczos2016] Reddi, S. J.; Sra, S.; and B. Póczos, A. S. 2016. Fast stochastic methods for nonsmooth nonconvex optimization. In NIPS.
 [Schmidt, Le Roux, and Bach2013] Schmidt, M.; Le Roux, N.; and Bach, F. 2013. Minimizing finite sums with the stochastic average gradient. Technical Report arXiv:1309.2388.
 [Schmidt, Le Roux, and Bach2016] Schmidt, M.; Le Roux, N.; and Bach, F. 2016. Minimizing finite sums with the stochastic average gradient. Mathematical Programming.
 [Shah et al.2016] Shah, V.; Asteris, M.; Kyrillidis, A.; and Sanghavi, S. 2016. Tradingoff variance and complexity in stochastic gradient descent. Technical Report arXiv:1603.06861.
 [Springenberg et al.2016] Springenberg, J. T.; Klein, A.; Falkner, S.; and Hutter, F. 2016. Bayesian optimization with robust Bayesian neural networks. In NIPS.
 [Teh, Thiery, and Vollmer2016] Teh, Y. W.; Thiery, A. H.; and Vollmer, S. J. 2016. Consistency and fluctuations for stochastic gradient Langevin dynamics. JMLR (17):1–33.
 [Vollmer, Zygalakis, and Teh2016] Vollmer, S. J.; Zygalakis, K. C.; and Teh, Y. W. 2016. Exploration of the (Non)Asymptotic bias and variance of stochastic gradient Langevin dynamics. JMLR.
 [Wang, Fienberg, and Smola2015] Wang, Y. X.; Fienberg, S. E.; and Smola, A. 2015. Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. In ICML.
 [Welling and Teh2011] Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
 [Zaremba, Sutskever, and Vinyals2014] Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
 [Zhang, Mahdavi, and Jin2013] Zhang, L.; Mahdavi, M.; and Jin, R. 2013. Linear convergence with condition number independent access of full gradients. In NIPS.
Appendix A Basic Setup for Stochastic Gradient MCMC
Given data , a generative model
with model parameter , and prior , we want to compute the posterior distribution:
where
(2) 
Consider the SDE:
(3) 
where is the state variable, typically is an augmentation of the model parameter, thus ; is the time index, is dimensional Brownian motion; functions and are assumed to satisfy the usual Lipschitz continuity condition [Ghosh2011]. In Langevin dynamics, we have and
For the SDE in (3), the generator is defined as:
(4) 
where is a measurable function, means the derivative of , means transpose. for two vectors and , for two matrices and . Under certain assumptions, we have that there exists a function on such that the following Poisson equation is satisfied [Mattingly, Stuart, and Tretyakov2010]:
(5) 
where denotes the model average, with being the equilibrium distribution for the SDE (3).
In stochastic gradient Langevin dynamics (SGLD), we update the parameter at step , denoted as ^{**}^{**}**Strictly speaking, should be indexed by “time” instead of “step”, i.e., instead of . We adopt the later for notation simplicity in the following. This applies for the general case of ., using the following descreatized method:
where is the step size, a Gaussian random variable with mean 0 and variance 1, is an unbiased estimate of in (2) with a random minibatch of size n, e.g.,
(6) 
where is a subset of a random permutation of .
In our analysis, we are interested in the mean square error (MSE) at iteration , defined as
where denotes the sample average, is the true posterior average defined in (5).
In this paper, for the function in an space, i.e., a space of functions for which the th power of the absolute value is Lebesgue integrable, we consider the standard norm defined as ( is simplified as ):
In order to guarantee wellbehaved SDEs and the corresponding numerical integrators, following existing literatures such as [Vollmer, Zygalakis, and Teh2016, Chen, Ding, and Carin2015], we impose the following assumptions.
Assumption 2
The SDE (3) is ergodic. Furthermore, the solution of (5) exists, and the solution functional of the Poisson equation (5) satisfies the following properties:

and its up to 3thorder derivatives , are bounded by a function , i.e., for , .

The expectation of on is bounded: .

is smooth such that , for some .
Appendix B Proofs of Extended Results for Standard SGMCMC
First, according to the definition of , we note that for the solution functional of the Poisson equation 5. Since , and is assumed to be bounded for a test function , we omit the operator in our following analysis (which only contributes to a constant), manifesting a slight abuse of notation for conciseness.
The proofs of Lemma 2 and Theorem 3 are closely related. We will first prove Theorem 3, the proof for Lemma 2 is then directly followed.
Proof [Proof of Theorem 3]
Let , and
then we have
Since
we have , i.e., is an unbiased estimate of .
In addition, we have
When ,
When , because
We have
As a result,
(7) 
Because we assume using a 1s
Comments
There are no comments yet.