On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo

02/15/2018 ∙ by Niladri S. Chatterji, et al. ∙ berkeley college 0

We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the log-posterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof technique combining ideas from finite-sum optimization and the analysis of sampling methods. Our sharp theoretical bounds allow us to identify regimes of interest where each method performs better than the others. Our theory is verified with experiments on real-world and synthetic datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the major themes in machine learning is the use of stochasticity to obtain procedures that are computationally efficient and statistically calibrated. There are two very different ways in which this theme has played out—one frequentist and one Bayesian. On the frequentist side, gradient-based optimization procedures are widely used to obtain point estimates and point predictions, and stochasticity is used to bring down the computational cost by replacing expensive full-gradient computations with unbiased stochastic-gradient computations. On the Bayesian side, posterior distributions provide information about uncertainty in estimates and predictions, and stochasticity is used to represent those distributions in the form of Monte Carlo (MC) samples. Despite the different conceptual frameworks, there are overlapping methodological issues. In particular, Monte Carlo sampling must move from an out-of-equilibrium configuration towards the posterior distribution and must do so quickly, and thus optimization ideas are relevant. Frequentist inference often involves sampling and resampling, so that efficient approaches to Monte Carlo sampling are relevant.

Variance control has been a particularly interesting point of contact between the two frameworks. In particular, there is a subtlety in the use of stochastic gradients for optimization: Although the per-iteration cost is significantly lower by using stochastic gradients; extra variance is introduced into the sampling procedure at every step so that the total number of iterations is required to be larger. A natural question is whether there is a theoretically-sound way to manage this tradeoff. This question has been answered affirmatively in a seminal line of research [Schmidt et al., 2017, Shalev-Shwartz and Zhang, 2013, Johnson and Zhang, 2013] on variance-controlled stochastic optimization. Theoretically these methods enjoy the best of the gradient and stochastic gradient worlds—they converge at the fast rate of full gradient methods while making use of cheaply-computed stochastic gradients.

A parallel line of research has ensued on the Bayesian side in a Monte Carlo sampling framework. In particular, stochastic-gradient Markov chain Monte Carlo (SG-MCMC) algorithms have been proposed in which approximations to Langevin diffusions make use of stochastic gradients instead of full gradients 

Welling and Teh [2011]. There have been a number of theoretical results that establish mixing time bounds for such Langevin-based sampling methods when the posterior distribution is well behaved [Dalalyan, 2017a, Durmus and Moulines, 2017, Cheng and Bartlett, 2017, Dalalyan and Karagulyan, 2017]. Such results have set the stage for the investigation of variance control within the SG-MCMC framework Dubey et al. [2016], Durmus et al. [2016], Bierkens et al. [2016], Baker et al. [2017], Nagapetyan et al. [2017], Chen et al. [2017]. Currently, however, the results of these investigations are inconclusive. Dubey et al. [2016] obtain mixing time guarantees for SAGA Langevin diffusion and SVRG Langevin diffusion (two particular variance-reduced sampling methods) under the strong assumption that the log-posterior has the norm of its gradients uniformly bounded by a constant. Another approach that has been explored involves calculating the mode of the log posterior to construct a control variate for the gradient estimate [Baker et al., 2017, Nagapetyan et al., 2017], an approach that makes rather different assumptions. Indeed, the experimental results from these two lines of work are contradictory, reflecting the differences in assumptions.

In this work we aim to provide a unified perspective on variance control for SG-MCMC. Critically, we identify two regimes: we show that when the target accuracy is small, variance-reduction methods are effective, but when the target accuracy is not small (a low-fidelity estimate of the posterior suffices), stochastic gradient Langevin diffusion (SGLD) performs better. These results are obtained via new theoretical techniques for studying stochastic gradient MC algorithms with variance reduction. We improve upon the techniques used to analyze Langevin Diffusion (LD) and SGLD Dalalyan [2017a], Dalalyan and Karagulyan [2017], Durmus and Moulines [2017] to establish non-asymptotic rates of convergence (in Wasserstein distance) for variance-reduced methods. We also apply control-variate techniques to underdamped Langevin MCMC [Cheng et al., 2017], a second-order diffusion process (CV-ULD). Inspired by proof techniques for variance-reduction methods for stochastic optimization, we design a Lyapunov function to track the progress of convergence and we thereby obtain better bounds on the convergence rate. We make the relatively weak assumption that the log posteriors are Lipschitz smooth, strongly convex and Hessian Lipschitz—a relaxation of the strong assumption that the gradient of the log posteriors are globally bounded.

As an example of our results, we are able to show that when using a variance-reduction method steps are required to obtain an accuracy of , versus the iterations required for SGLD, where is the dimension of the data and is the total number of samples. As we will argue, results of this kind support our convention that when the target accuracy is small, variance-reduction methods outperform SGLD.

Main Contributions

We provide sharp convergence guarantees for a variety of variance-reduction methods—SAGA-LD, SVRG-LD, and CV-ULD under the same set of realistic assumptions (see Sec. 4). This is achieved by a new proof technique that yiels bounds on Wasserstein distance. Our bounds allow us to identify windows of interest where each method performs better than the others (see Fig. 1

). The theory is verified with experiments on real-world datasets. We also test the effects of breaking the central limit theorem using synthetic data, and find that in this regime variance-reduced methods fare far better than SGLD (see Sec. 

5).

2 Preliminaries

Throughout the paper we aim to make inference on a vector of parameters

. The resulting posterior density is then . For brevity we write , for , and . Moving forward we state all results in terms of general sum-decomposable functions (see Assumption 1), however it is useful to keep the above example in mind as the main motivating example. We let denote the Euclidean norm, for a vector . For a matrix we let denote its spectral norm and let denote its Frobenius norm.

Assumptions on : We make the following assumptions about the potential function .

  1. [label=(A0)]

  2. Sum-decomposable: The function is decomposable, .

  3. Smoothness: The functions are twice continuously-differentiable on and have Lipschitz-continuous gradients; that is, there exist positive constants such that for all and for all we have, We accordingly characterize the smoothness of with the parameter .

  4. Strong Convexity: is -strongly convex; that is, there exists a constant such that for all , We also define the condition number .

  5. Hessian Lipschitz: We assume that the function is Hessian Lipschitz; that is, there exists a constant such that, for every .

It is worth noting that and can all scale with .

Wasserstein Distance:

We define the Wasserstein distance between a pair of probability measures (

) as follows:

where

denotes the set of joint distributions such that the first set of coordinates has marginal

and the second set has marginal . (See Appendix A for a more formal definition of ).

Langevin Diffusion: The classical overdamped Langevin diffusion is based on the following Itô Stochastic Differential Equation (SDE):

(1)

where and represents standard Brownian motion [see, e.g., Mörters and Peres, 2010]. It can be shown that under mild conditions like (absolutely integrable) the invariant distribution of Eq. (1) is given by . This fact motivates the Langevin MCMC algorithm where given access to full gradients it is possible to efficiently simulate the discretization,

(2)

where the gradient is evaluated at a fixed point (the previous iterate in the chain) and the SDE (2) is integrated up to time (the step size) to obtain

with . Welling and Teh [2011] proposed an alternative algorithm—Stochastic Gradient Langevin Diffusion (SGLD)—for sampling from sum-decomposable function where the chain is updated by integrating the SDE:

(3)

and where

is an unbiased estimate of the gradient at

. The attractive property of this algorithm is that it is computationally tractable for large datasets (when is large). At a high level the variance reduction schemes that we study in this paper replace the simple gradient estimate in Eq. (3) (and other variants of Langevin MCMC) with more sophisticated unbiased estimates that have lower variance.

3 Variance Reduction Techniques

In the seminal work of Schmidt et al. [2017] and Johnson and Zhang [2013]

, it was observed that the variance of Stochastic Gradient Descent (SGD) when applied to optimizing sum-decomposable strongly convex functions decreases to zero only if the step-size also decays at a suitable rate. This prevents the algorithm from converging at a linear rate, as opposed to methods like batch gradient descent that use the entire gradient at each step. They introduced and analyzed different gradient estimates with lower variance. Subsequently these methods were also adapted to Monte Carlo sampling by

Dubey et al. [2016], Nagapetyan et al. [2017], Baker et al. [2017]. These methods use information from previous iterates and are no longer Markovian. In this section we describe several variants of these methods.

3.1 SAGA Langevin MC

We present a sampling algorithm based on SAGA of Defazio et al. [2014] which was developed as a modification of SAG by Schmidt et al. [2017]. In SAGA, which is presented as Algorithm 1, an approximation of the gradient of each function is stored as and is iteratively updated in order to build an estimate with reduced variance. At each step of the algorithm, if the function is selected in the mini-batch , then the value of the gradient approximation is updated by setting . Otherwise the gradient of is approximated by the previous value . Overall we obtain the following unbiased estimate of the gradient:

(4)

In Algorithm 1 we form this gradient estimate and plug it into the classic Langevin MCMC method driven by the SDE (3). Computationally this algorithm is efficient; essentially it enjoys the oracle query complexity (number of calls to the gradient oracle per iteration) of methods like SGLD but due to the reduced variance of the gradient estimator it converges almost as quickly (in terms of number of iterations) to the posterior distribution as methods such as Langevin MCMC that use the complete gradient at every step. We prove a novel non-asymptotic convergence result in Wasserstein distance for Algorithm 1 in the next section that formalizes this intuition.

The principal downside of this method is its memory requirement. It is necessary to store the gradient estimator for each individual , which essentially means that in the worst case the memory complexity scales as . However in many interesting applications, including some of those considered in the experiments in Sec. 5, the memory costs scale only as since each function depends on a linear function in and therefore the gradient is just a re-weighting of the single data point .

  Input: Gradient oracles , step size , batch size , initial point .
  Initialize .
  for  do
     Draw uniformly with replacement
     Sample
     Update using (4)
     Update .
     Update : for set , for , set
  end for
  Output: Iterates .
Algorithm 1 SAGA Langevin MCMC

3.2 SVRG Langevin MC

The next algorithm we explore is based on the SVRG method of Johnson and Zhang [2013] which takes its roots in work of Greensmith et al. [2004]. The main idea behind SVRG is to build an auxiliary sequence at which the full gradient is calculated and used as a reference in building a gradient estimate: . Again this estimate is unbiased under the uniform choice of . While using this gradient estimate to optimize sum-decomposable functions, the variance will be small when and are close to the optimum as is small and is of the order . We also expect a similar behavior in the case of Monte Carlo sampling and we thus use this gradient estimate in Algorithm 2. Observe that crucially—unlike SAGA-based algorithms—this method does not require an estimate of all of the individual , so the memory cost of this algorithm scales in the worst case as . In Algorithm 2 we use the unbiased gradient estimate

(5)

which uses a mini-batch of size .

  Input: Gradient oracles , step size

, epoch length

, batch size , initial point .
  Initialize ,
  for  do
     if  mod  then
        Option I: Sample and Update
        Update
        Option II: Update
        
     end if
     Draw uniformly with replacement
     Sample
     Update using (5)
     Update .
  end for
  Output: Iterates .
Algorithm 2 SVRG Langevin MCMC

The downside of this algorithm compared to SAGA however is that every few steps (an epoch) the full gradient, , needs to be calculated at . This results in the query complexity of each epoch being

. Also SVRG has an extra parameter that needs to be set—its hyperparameters are the epoch length (

), the step size () and the batch size (), as opposed to just the step size and batch size for Algorithm 1 which makes it harder to tune. It also turns out that in practice, SVRG seems to be consistently outperformed by SAGA and control-variate techniques for sampling which is observed both in previous work and in our experiments.

3.3 Control Variates with Underdamped Langevin MC

Another approach is to use control variates [Ripley, 2009] to reduce the variance of stochastic gradients. This technique has also been previously explored both theoretically and experimentally by Baker et al. [2017] and Nagapetyan et al. [2017]. Similarly to SAGA and SVRG the idea is to build an unbiased estimate of the gradient at a point :

where the set is the mini-batch and is a fixed point that is called the centering value. Observe that taking an expectation over the choice of the set yields . A good centering value would ensure that this estimate also has low variance; a natural choice in this regard is the global minima of ,

. A motivating example is the case of a Gaussian random variable where the mean of the distribution and

coincide.

A conclusion of previous work that applies control variate techniques to stochastic gradient Langevin MCMC is the following—the variance of the gradient estimates can be lowered to be of the order of the discretization error. Motivated by this, we apply these techniques to underdamped Langevin MCMC where the underlying continuous time diffusion process is given by the following second-order SDE:

(6)

where , represents the standard Brownian motion and and are constants. At a high level the advantage of using a second-order MCMC method like underdamped Langevin MCMC Cheng et al. [2017], or related methods like Hamiltonian Monte Carlo [see, e.g, Neal et al., 2011, Girolami and Calderhead, 2011], is that the discretization error is lower compared to overdamped Langevin MCMC. However when stochastic gradients are used [see Chen et al., 2014, Ma et al., 2015, for implementation], this advantage can be lost as the variance of the gradient estimates dominates the total error. We thus apply control variate techniques to this second-order method. This reduces the variance of the gradient estimates to be of the order of the discretization error and enables us to recover faster rates of convergence.

  Input: Gradient oracles , step size , smoothness , batch size .
  Set .
  Set
  for  do
     Draw a set of size u.a.r.
     Update using (8)
     Sample defined in (41)
  end for
  Output: Iterates .
Algorithm 3 CV Underdamped Langevin MCMC

The discretization of SDE (6) (which we can simulate efficiently) is

(7)

with initial conditions (the previous iterate of the Markov Chain) and is the estimate of the gradient at , defined in (8). We integrate (7) for time (the step size) to get our next iterate of the chain— for some . This MCMC procedure was introduced and analyzed by Cheng et al. [2017] where they obtain that given access to full gradient oracles the chain converges in steps (without Assumption 4) as opposed to standard Langevin diffusion which takes steps (with Assumption 4). With noisy gradients (variance ), however, the mixing time of underdamped Langevin MCMC again degrades to .

In Algorithm 3 we use control variates to reduce variance and are able to provably recover the fast mixing time guarantee () in Theorem 4.3. Algorithm 3 requires a pre-processing step of calculating the (approximate) minimum of as opposed to Algorithm 1,2; however since is strongly convex this pre-processing cost (using say SAGA for optimizing with stochastic gradients) is small compared to the computational cost of the other steps.

In Algorithm 3 the updates of the gradients are dictated by,

(8)

The random vector that we draw, , conditioned on , is a Gaussian vector with conditional mean and variance that can be explicitly calculated in closed form expression in terms of the algorithm parameters and . Its expression is presented in Appendix C. Note that is a Gaussian vector and can be sampled in time.

4 Convergence results

In this section we provide convergence results of the algorithms presented above, which improve upon the convergence guarantees for SGLD. Dalalyan and Karagulyan [2017] show that for SGLD run for iterations:

(9)

under assumptions 2-4 with access to stochastic gradients with bounded variance – . The term involving the variance – dominates the others in many interesting regimes. For sum-decomposable functions that we are studying in this paper this is also the case as the variance of the gradient estimate usually scales linearly with . Therefore the performance of SGLD sees a deterioration when compared to the convergence guarantees of Langevin Diffusion where . To prove our convergence results we follow the general framework established by Dalalyan and Karagulyan [2017], with the noteworthy difference of working with more sophisticated Lyapunov functions (for Theorems 4.1 and 4.2) inspired by proof techniques in optimization theory. This contributes to strengthening the connection between optimization and sampling methods raised in previous work and may potentially be applied to other sampling algorithms (we elaborate on these connections in more detail in Appendix B). This comprehensive proof technique also allows us to sharpen the convergence guarantees obtained by Dubey et al. [2016] on variance reduction methods like SAGA and SVRG by allowing us to present bounds in and to drop the assumption on requiring uniformly bounded gradients. We now present convergence guarantees for Algorithm 1.

Theorem 4.1.

Let assumptions 1-4 hold. Let be the distribution of the iterate of Algorithm 1 after steps. If we set the step size to be and the batch size then we have the guarantee:

(10)

For the sake of clarity, only results for small step-size are presented however, it is worth noting that convergence guarantees hold for any (see details in Appendix B.2). If we consider the regime where and all scale linearly with the number of samples , then for SGLD the dominating term is . If the target accuracy is , SGLD would require the step size to scale as while for SAGA a step size of is sufficient. The mixing time for both methods is roughly proportional to the inverse step-size; thus SAGA provably takes fewer iterations while having almost the same computational complexity per step as SGLD. Similar to the optimization setting, theoretically SAGA Langevin diffusion recovers the fast rate of Langevin diffusion while just using cheap gradient updates. Next we present our guarantees for Algorithm 2.

Theorem 4.2.

Let assumptions 1-4 hold. Let be the distribution of the iterate of Algorithm 2 after steps.

If we set , , and run then for all mod we have

(11)

If we set and run for iterations then,

(12)

For Option I, if we study the same regime as before where and are scaling linearly with we find that the discretization error is dominated by the term which is of order . To achieve target accuracy of we would need . This is less impressive than the guarantees of SAGA and essentially we only gain a constant factor as compared to the guarantees for SGLD. This behavior may be explained as follows: at each epoch, a constant decrease of the objective is needed in the classical proof of SVRG when applied to optimization. When the step-size is small, the epoch length is required to be large that washes away the advantages of variance reduction.

For Option II, similar convergence guarantees as SAGA are obtained, but worse by a factor of . In contrast to SAGA, this result holds only for small step-size, with the constants in Eq. (12) blowing up exponentially quickly for larger step sizes (for more details see proof in Appendix B.1). We also find that experimentally SAGA routinely outperforms SVRG both in terms of run-time and iteration complexity to achieve a desired target accuracy. However, it is not clear whether it is an artifact of our proof techniques that we could not recover matching bounds as SAGA or if SVRG is less suited to work with sampling methods. We now state our results for the convergence guarantees of Algorithm 3.

Theorem 4.3.

Let assumptions 1-3 hold. Let be the distribution of the iterate of Algorithm 3 after steps starting with the initial distribution . If we set the step size to be and run Algorithm 3 then we have the guarantee that

(13)

We initialize the chain in Algorithm 3 with , the global minimizer of as we already need to calculate it to build the gradient estimate. Observe that Theorem 4.3 does not guarantee the error drops to when

but is proportional to the standard deviation of our gradient estimate. This is in contrast to SAGA and SVRG based algorithms where a more involved gradient estimate is used. The advantage however of using this second order method is that we get to a desired error level

at a faster rate as the step size can be chosen proportional to , which is better than the corresponding results of Theorem 4.1 and 4.2 and without Assumption 4 (Hessian Lipschitzness).

Note that by Lemma C.7 we have the guarantee that ; this motivates the choice of and, with . It is easy to check that under this choice of and , Theorem 4.3 guarantees that . We note that no attempt has been made to optimize the constants. To interpret these results more carefully let us think of the case when both scale linearly with the number of samples . Here the number of steps and the batch size is . If we compare it to previous results on control variate variance reduction techniques applied to overdamped Langevin MCMC by Baker et al. [2017], the corresponding rates are and , essentially it is possible to get a quadratic improvement by using a second order method even in the presence of noisy gradients. Note however that these methods are not viable when the target accuracy is small as the batch size needs to grow as .


Algorithm Mixing Time Computation
LD
ULD
SGLD
SGULD
SAGA-LD
SVRG-LD (I)
SVRG-LD (II)
CV-LD
CV-ULD
Table 1: Mixing time and computational complexity comparison of Langevin sampling algorithms. All the entries in the table are in Big-O notation which hides constants and poly-logarithmic factors. Note that the guarantees presented for ULD, SGULD, CV-LD and CV-ULD are without the Hessian Lipschitz assumption, 4.
Comparison of Methods.

Here we compare the theoretical guarantees of Langevin MCMC [LD, Durmus and Moulines, 2016], Underdamped Langevin MCMC [ULD, Cheng et al., 2017], SGLD [Dalalyan and Karagulyan, 2017], stochastic gradient underdamped Langevin diffusion [SGULD, Cheng et al., 2017], SAGA-LD (Algorithm 1), SVRG-LD (Algorithm 2 with Option I and II), Control Variate Langevin diffusion [CV-LD, Baker et al., 2017] and Control Variate underdamped Langevin diffusion (CV-ULD, Algorithm 3). We always consider the scenario where and are scaling linearly with and where (tall-data regime). We note that the memory cost of all these algorithms except SAGA-LD is ; for SAGA-LD the worst-case memory cost scales as . Next we compare the mixing time (), that is, the number of steps needed to provably have error less than measured in and the computational complexity, which is the mixing time times the query complexity per iteration. In the comparison below we focus on the dependence of the mixing time and computational complexity on the dimension , number of samples , condition number , and the target accuracy . The mini-batch size has no effect on the computational complexity of SGLD, SGULD and SAGA-LD; while for SVRG-LD, CV-LD and CV-ULD the mini-batch size is chosen to optimize the upper bound.

SAGA-LDCV-LDSG-LD
Figure 1: Different Regimes: The -axis represents the target accuracy and the -axis represents the predicted run-time (number of queries to the gradient oracle) of different algorithms.

As illustrated in Fig. 1

we see a qualitative difference in behavior of variance reduced algorithms compared to methods like SGLD. In applications like calculating higher order statistics or computing confidence intervals to quantify uncertainty it is imperative to calculate the posterior with very high accuracy. In this regime when the target accuracy

, the computational complexity of SGLD starts to grow larger than at rate whereas the computational cost of variance reduced methods is lower. For SAGA-LD the computational cost is up until when after which it grows at a rate . CV-ULD also has a computational cost of up until the point where after which it starts to grow as . When our bounds predict both SAGA-LD and CV-ULD to have comparative performance () and in some scenarios one might outperform the other. For higher accuracy our results predict SAGA-LD performs better than CV-ULD. Note that Option II of SVRG performs also well in this regime of small but not as well as SAGA-LD or CV-ULD.

At the other end of the spectrum for most classical statistical problems accuracy of is sufficient and less than a single pass over the data is enough. In this regime when and we are looking to find a crude solution quickly, our bounds predict that SGLD is the fastest method. Other variance reduction methods need at least a single pass over the data to initialize.

Our sharp theoretical bounds allow us to classify and accurately identify regimes where the different variance reduction algorithms are efficient; bridging the gap between experimentally observed phenomenon and theoretical guarantees of previous works. Also noteworthy is that here we compare the algorithms only in the tall-data regime which grossly simplifies our results in Sec. 

4, many other interesting regimes could be considered, for example the fat-data regime where , but we omit this discussion here.

5 Experiments

Figure 2: Number of passes through the datasets versus log held-out probability on test datasets.

In this section we explore the performance of SG-MCMC with variance reduction via experiments. We compare SAGA-LD, SVRG-LD (with option II), CV-LD, CV-ULD and use SGLD as the baseline method.

5.1 Bayesian Logistic Regression

We demonstrate results from sampling a Bayesian logistic regression model. We consider an

design matrix comprised of samples each with

covariates and a binary response variable

Gelman et al. [2004]. If we denote the logistic link function by , a Bayesian logistic regression model of the binary response with likelihood is obtained by introducing regression coefficients with a Gaussian prior , where in the experiments.

We make use of three datasets available at the UCI machine learning repository. The first two datasets describe the connections between heart disease and diabetes with various patient-specific covariates. The third dataset captures the generation of supersymmetric particles and its relationship with the kinematic properties of the underlying process. We use part of the datasets to obtain a mean estimate of the parameters and hold out the rest to test their likelihood under the estimated models. Sizes of the datasets being used in Bayesian estimation are , , and , respectively.

Performance is measured by the log probability of the held-out dataset under the trained model. We first find the optimal log held-out probability attainable by all the currently methods being tested. We then target to obtain levels of log held-out probability increasingly closer to the optimal one with each methods. We record number of passes through data that are required for each method to achieve the desired log held-out probability (averaged over trials) for comparison in Fig. 2. We fix the batch size as constant, to explore whether the overall computational cost for SG-MCMC methods can grow sub-linearly (or even be constant) with the overall size of the dataset . A grid search is performed for the optimal hyperparameters in each algorithm, including an optimal scheduling plan of decreasing stepsizes. For CV-LD, we first use a stochastic gradient descent with SAGA variance reduction method to find the approximate mode . We then calculate the full data gradient at and initialize the sampling algorithm at .

From the experiments, we recover the three regimes displayed in Fig. 1 with different data size and accuracy level with error . When is large, SGLD performs best for big . When is small, CV-LD/ULD is the fastest for relatively big . When and are both small so that many passes through data are required, SAGA-LD is the most efficient method. It is also clear from Fig. 2 that although CV-LD/ULD methods initially converges fast, there is a non-decreasing error (with the constant mini-batch size) even after the algorithm converges (corresponding to the last term in Eq. (13)). Because CV-LD and CV-ULD both converge fast and have the same non-decreasing error, their performance overlap with each other. Convergence of SVRG-LD is slower than SAGA-LD, because the control variable for the stochastic gradient is only updated every epoch. This attribute combined with the need to compute the full gradient periodically makes it less efficient and costlier than SAGA-LD. We also see that number of passes through the dataset required for SG-MCMC methods (with and without variance reduction) is decreasing with the dataset size . Close observation shows that although the overall computational cost is not constant with growing , it is sublinear.

5.2 Breaking CLT: Synthetic Log Normal Data

Many works using SG-MCMC assume that the data in the mini-batches follow the central limit theorem (CLT) such that the stochastic gradient noise is Gaussian. But as explained by Bardenet et al. [2017], if the dataset follows a long-tailed distribution, size of the mini-batch needed for CLT to take effect may exceed that of the entire dataset. We study the effects of breaking this CLT assumption on the behaviors of SGLD and its variance reduction variants.

We use synthetic data generated from a log normal distribution:

and sample the parameters and according to the likelihood . It is worth noting that this target distribution not only breaks the CLT for a wide range of mini-batch sizes, but also violates assumptions 2-4.

To see whether each method can perform well when CLT assumption is greatly violated, we still let mini-batch size to be and grid search for the optimal hyperparameters for each method. We use mean squared error (MSE) as the convergence criteria and take LD as the baseline method to compare and verify convergence.

From the experimental results, we see that SGLD does not converge to the target distribution. This is because most of the mini-batches only contain data close to the mode of the log normal distribution. Information about the tail is hard to capture with stochastic gradient. It can be seen that SAGA-LD and SVRG-LD are performing well because history information is recorded in the gradient so that data in the tail distribution is accounted for. As in the previous experiments, CV-LD converges fastest at first, but retains a finite error. For LD, it converges to the same accuracy as SAGA-LD and SVRG-LD after number of passes through data. The variance reduction methods which uses long term memory may be especially suited to this scenario, where data in the mini-batches violates the CLT assumption.

It is also worth noting that the computation complexity for this problem is higher than our previous experiments. Number of passes through the entire dataset is on the order of to reach convergence even for SAGA-LD and SVRG-LD. It would be interesting to see whether non-uniform subsampling of the dataset Schmidt et al. [2015] can accelerate the convergence of SG-MCMC even more.

Figure 3: Number of passes through the datasets versus log mean square error (MSE).

6 Conclusions

In this paper, we derived new theoretical results for variance-reduced stochastic gradient MC. Our theory allows us to accurately classify two major regimes. When a low-accuracy solution is desired and less than one pass on the data is sufficient, SGLD should be preferred. When high accuracy is needed, variance-reduced methods are much more powerful. There are a number of further directions worth pursuing. It would be of interest to connect sampling with advances in finite-sum optimization. specifically advances in accelerated gradient [Lin et al., 2015] or single-pass methods [Lei and Jordan, 2017]. Finally the development of a theory of lower bounds for sampling will be an essential counterpart to this work.

References

  • Baker et al. [2017] J. Baker, P. Fearnhead, E. B. Fox, and C. Nemeth. Control variates for stochastic gradient MCMC. arXiv preprint arXiv:1706.05439, 2017.
  • Bardenet et al. [2017] R. Bardenet, A. Doucet, and C. C. Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(47):1–43, 2017.
  • Bierkens et al. [2016] J. Bierkens, P. Fearnhead, and G. Roberts. The Zig-Zag process and super-efficient sampling for Bayesian analysis of big data. arXiv preprint arXiv:1607.03188, 2016.
  • Chen et al. [2017] C. Chen, W. Wang, Y. Zhang, Q. Su, and L. Carin. A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC. arXiv preprint arXiv:1709.01180, 2017.
  • Chen et al. [2014] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceeding of 31st International Conference on Machine Learning (ICML’14), 2014.
  • Cheng and Bartlett [2017] X. Cheng and P. L. Bartlett. Convergence of Langevin MCMC in KL-divergence. arXiv preprint arXiv:1705.09048, 2017.
  • Cheng et al. [2017] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: a non-asymptotic analysis. arXiv preprint arXiv:1707.03663, 2017.
  • Clark [1987] D. S. Clark. Short proof of a discrete Gronwall inequality. Discrete applied mathematics, 16(3):279–281, 1987.
  • Dalalyan [2017a] A. Dalalyan. Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. B, 79:651–676, 2017a.
  • Dalalyan [2017b] A. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 678–689. PMLR, 07–10 Jul 2017b.
  • Dalalyan and Karagulyan [2017] A. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.
  • Defazio [2016] A. Defazio. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems 29, pages 676–684, 2016.
  • Defazio et al. [2014] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
  • Dubey et al. [2016] K. A. Dubey, S. J. Reddi, S. A. Williamson, B. Poczos, A. J. Smola, and E. P. Xing. Variance reduction in stochastic gradient Langevin dynamics. In Advances in Neural Information Processing Systems 29, pages 1154–1162, 2016.
  • Durmus and Moulines [2016] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. arXiv preprint arXiv:1605.01559, 2016.
  • Durmus and Moulines [2017] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 06 2017.
  • Durmus et al. [2016] A. Durmus, U. Simsekli, E. Moulines, R. Badeau, and G. Richard. Stochastic gradient Richardson-Romberg Markov chain Monte Carlo. In Advances in Neural Information Processing Systems 29, pages 2047–2055, 2016.
  • Gelman et al. [2004] A. Gelman, J. B. Carhn, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, 2004.
  • Girolami and Calderhead [2011] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol., 73(2):123–214, 2011.
  • Greensmith et al. [2004] E. Greensmith, P. L. Bartlett, and J. Baxter.

    Variance reduction techniques for gradient estimates in reinforcement learning.

    Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
  • Hofmann et al. [2015] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems 28, pages 2305–2313, 2015.
  • Johnson and Zhang [2013] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
  • Lei and Jordan [2017] L. Lei and M. I. Jordan. Less than a single pass: stochastically controlled stochastic gradient. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    , volume 54 of Proceedings of Machine Learning Research, pages 148–156. PMLR, 20–22 Apr 2017.
  • Lin et al. [2015] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384–3392. 2015.
  • Ma et al. [2015] Y.-A. Ma, T. Chen, and E. B. Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems 28, pages 2899–2907. 2015.
  • Mörters and Peres [2010] P. Mörters and Y. Peres. Brownian Motion, volume 30. Cambridge University Press, 2010.
  • Nagapetyan et al. [2017] T. Nagapetyan, A. B. Duncan, L. Hasenclever, S. J. Vollmer, L. Szpruch, and K. Zygalakis. The true cost of stochastic gradient Langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
  • Neal et al. [2011] R. M. Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
  • Pavliotis [2016] G. A. Pavliotis. Stochastic Processes and Applications. Springer, 2016.
  • Ripley [2009] B. D. Ripley. Stochastic Simulation, volume 316. John Wiley & Sons, 2009.
  • Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 09 1951.
  • Schmidt et al. [2015] M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Clifton, and A. Sarkar. Non-uniform stochastic average gradient method for training conditional random fields. In 18th International Conference on Artificial Intelligence and Statistics, 2015.
  • Schmidt et al. [2017] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  • Shalev-Shwartz and Zhang [2013] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res., 14:567–599, 2013.
  • Villani [2008] C. Villani. Optimal Transport: Old and New. Springer Science and Business Media, 2008.
  • Welling and Teh [2011] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pages 681–688, June 2011.

Appendix A Wasserstein Distance

We formally define the Wasserstein distance in this section. Denote by the Borel -field of . Given probability measures and on , we define a transference plan between and as a probability measure on such that for all sets , and . We denote as the set of all transference plans. A pair of random variables is called a coupling if there exists a such that are distributed according to . With some abuse of notation, we will also refer to as the coupling.

We define the Wasserstein distance of order two between a pair of probability measures as follows:

Finally we denote by the set of transference plans that achieve the infimum in the definition of the Wasserstein distance between and [see, e.g., Villani, 2008, for more properties of ].


Appendix B SVRG and SAGA: Proofs and Discussion

In this section we will prove Theorem 4.1 and Theorem 4.2 and include details about Algorithms 1 and 2 that were omitted in our discussion in the main paper. Throughout this section we assume that assumptions 1-4 holds. First we define the continuous time (overdamped) Langevin diffusion process defined by the Itô SDE:

(14)

here , is a standard Brownian motion process and is a drift added to the process, with the initial condition that . Under fairly mild assumptions, for example if (absolutely integrable), then the unique invariant distribution of the process (14) is . The Euler-Mayurama discretization of this process can be denoted by the Itô SDE:

(15)

Note that in contrast to the process (14), the process (15) is driven by the drift evaluated at a fixed initial point . In our case we don’t have access to the gradient of function , but only to an unbiased estimate of the function, (as defined in Eq. (4) and Eq. (5) for ). This gives rise to an Itô SDE:

(16)

Throughout this section we will denote by the iterates of Algorithm 1 or Algorithm 2. Also we will define the distribution of the iterate of Algorithm 1 or Algorithm 2 by . With this notation in place we are now ready to present the proofs of Theorem 4.2 and Theorem 4.1.

Proof Overview and Techniques

In both the proof of Theorem 4.1 and Theorem 4.2 we draw from and sharpen techniques established in the literature of analyzing Langevin MCMC methods and variance reduction techniques in optimization. In both the proofs we use Lyapunov functions that are standard in the optimization literature for analyzing these methods; we use them to define Wasserstein distances and adapt methods from analysis of sampling algorithms to proceed.

b.1 Stochastic Variance Reduced Gradient Langevin Monte Carlo

In the proof of SVRG for Langevin diffusion, it is common to consider the Lyapunov function to the standard 2-norm. We define a Wasserstein distance with respect to distance – .

Proof of Theorem 4.2.

For any for some integer , let be a random vector drawn from such that it is optimally coupled to , that is, . We also define and (corresponding to the instance of being updated) analogously, such that and and are optimally coupled. We drop the in superscript in the proof to simplify notation. We also assume that the random selection of the set of indices to be updated ( depends on the iteration , but we drop this dependence in the proof to simplify notation) in Algorithm 2 is independent of . We evolve the random variable under the continuous process described by (14),

where the Brownian motion is independent of . Thus integrating the above SDE we get upto time (the step-size),

(17)

where . Note that since , we have that . Similarly we also have that the next iterate is given by

(18)

where is the same normally distributed random variable as in (17). Let us define and . Also define

Now that we have the notation setup, we will prove the first part of this Theorem. We procede in 5 steps. In Step 1 we will express in terms of and , in Step 2 we will control the expected value of . In Step 3 we will express in terms of and other terms, while in Step 4 we will use the characterization of in terms of combined with the techniques established by Dubey et al. [2016] to bound the expected value of . Finally in Step 5 we will put this all together and establish our result. First we prove the result for Algorithm 2 run with Option I.

Step 1: By Young’s inequality we have that ,

(19)

We will choose at a later stage in the proof to minimize the bound on the right hand side.

Step 2: By Lemma 6 of Dalalyan and Karagulyan [2017] we have the bound,

(20)

Step 3: Next we will bound the other term in (19), . First we express in terms of ,

(21)

Step 4: Using the above characterization of in terms of established above, we get

Now we take expectation with respect to all sources of randomness (Brownian motion and the randomness in the choice of ) conditioned on and (thus is fixed). Recall that conditioned on and , and are zero mean, thus we get

where denotes conditioning on and . First we bound