# Stochastic Gradient Descent as Approximate Bayesian Inference

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

## Authors

• 34 publications
• 20 publications
• 76 publications
• ### A Variational Analysis of Stochastic Gradient Algorithms

Stochastic Gradient Descent (SGD) is an important algorithm in machine l...
02/08/2016 ∙ by Stephan Mandt, et al. ∙ 0

• ### A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

This work provides a simplified proof of the statistical minimax optimal...
10/25/2017 ∙ by Prateek Jain, et al. ∙ 0

• ### A Simple Baseline for Bayesian Uncertainty in Deep Learning

We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose ...
02/07/2019 ∙ by Wesley Maddox, et al. ∙ 20

• ### Mix and Match: An Optimistic Tree-Search Approach for Learning Models from Mixture Distributions

We consider a co-variate shift problem where one has access to several m...
07/23/2019 ∙ by Matthew Faw, et al. ∙ 5

• ### On Compression Principle and Bayesian Optimization for Neural Networks

Finding methods for making generalizable predictions is a fundamental pr...
06/23/2020 ∙ by Michael Tetelman, et al. ∙ 13

• ### The Hierarchical Adaptive Forgetting Variational Filter

A common problem in Machine Learning and statistics consists in detectin...
05/15/2018 ∙ by Vincent Moens, et al. ∙ 0

• ### Isotropic SGD: a Practical Approach to Bayesian Posterior Sampling

In this work we define a unified mathematical framework to deepen our un...
06/09/2020 ∙ by Giulio Franzese, et al. ∙ 0

## Code Repositories

### foundations_for_deep_learning

Building a scalable foundation for deep learning

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient descent (SGD) has become crucial to modern machine learning. SGD optimizes a function by following noisy gradients with a decreasing step size. The classical result of

Robbins and Monro (1951) is that this procedure provably reaches the optimum of the function (or local optimum, when it is nonconvex) (Bouleau and Lepingle, 1994). Recent studies investigate the merits of adaptive step sizes (Duchi et al., 2011; Tieleman and Hinton, 2012), gradient or iterate averaging (Toulis et al., 2016; Défossez and Bach, 2015), and constant step-sizes (Bach and Moulines, 2013; Flammarion and Bach, 2015). Stochastic gradient descent has enabled efficient optimization with massive data, since one can often obtain noisy-but-unbiased gradients very cheaply by randomly subsampling a large dataset.

Recently, stochastic gradients (SG) have also been used in the service of scalable Bayesian Markov Chain Monte Carlo (MCMC) methods, where the goal is to generate samples from a conditional distribution of latent variables given a data set. In Bayesian inference, we assume a probabilistic model with data and hidden variables ; our goal is to approximate the posterior

 p(θ|x)=exp{logp(θ,x)−logp(x)}. (1)

These so-called stochastic gradient MCMC algorithms—such as SG Langevin dynamics (Welling and Teh, 2011), SG Hamiltonian Monte Carlo (Chen et al., 2014), SG thermostats (Ding et al., 2014), and SG Fisher scoring (Ahn et al., 2012)—employ stochastic gradients of to improve convergence and computation of existing sampling algorithms. Also see Ma et al. (2015) for a complete classification of these algorithms.

The similarities between SGD as an optimization algorithm and stochastic gradient MCMC algorithms raise the question of how exactly these two types of algorithms relate to each other. The main questions we try to address in this paper are:

• What is the simplest modification to SGD that yields an efficient approximate Bayesian sampling algorithm?

• How can we construct other sampling algorithms based on variations of SGD such as preconditioning (Duchi et al., 2011; Tieleman and Hinton, 2012), momentum (Polyak, 1964) or Polyak averaging (Polyak and Juditsky, 1992)?

To answer these questions, we draw on the theoretical analysis tools of continuous-time stochastic differential equations (Bachelier, 1900; Gardiner et al., 1985) and variational inference (Jordan et al., 1999).

As a simple example, consider SGD with a constant learning rate (constant SGD). Constant SGD first marches toward an optimum of the objective function and then bounces around its vicinity. (In contrast, traditional SGD converges to the optimum by decreasing the learning rate.) Our analysis below rests on the idea that constant SGD is a stochastic process with a stationary distribution, one that is centered on the optimum and that has a certain covariance structure. The main idea is that we can use this stationary distribution to approximate a posterior. In contrast, stochastic gradient MCMC algorithms take precautions to sample from an asymptotically exact posterior, but at the expense of slower mixing. Our inexact approach enjoys minimal implementation effort and typically faster mixing. It is a hybrid of Monte Carlo and variational algorithms, and complements the toolbox of approximate Bayesian inference.

Here is how it works. We apply constant SGD as though we were trying to minimize the negative log-joint probability

over the model parameters . Constant SGD has several tunable parameters: the constant learning rate, the minibatch size, and the preconditioning matrix (if any) that we apply to the gradient updates. These tuning parameters all affect the stationary distribution of constant SGD; depending on how they are set, this stationary distribution of will be closer to or farther from the posterior distribution . If we set these parameters appropriately, we can perform approximate Bayesian inference by simply running constant SGD.

In more detail, we make the following contributions:

• First, we develop a variational Bayesian view of stochastic gradient descent. Based on its interpretation as a continuous-time stochastic process—specifically a multivariate Ornstein-Uhlenbeck (OU) process (Uhlenbeck and Ornstein, 1930; Gardiner et al., 1985)

—we compute stationary distributions for a large class of SGD algorithms, all of which converge to a Gaussian distribution with a non-trivial covariance matrix. The stationary distribution is parameterized by the learning rate, minibatch size, and preconditioning matrix.

Results about the multivariate OU process make it easy to minimize the KL divergence between the stationary distribution and the posterior analytically. We can thus relate the optimal step size or preconditioning matrix to the Hessian and noise covariances near the optimum. The optimal preconditioners relate to AdaGrad (Duchi et al., 2011)

(Tieleman and Hinton, 2012), and classical Fisher scoring (Longford, 1987). We demonstrate how these different optimization methods compare when used for approximate inference.

• We show that constant SGD gives rise to a new variational EM algorithm (Bishop, 2006) which allows us to use SGD to optimize hyperparameters while performing approximate inference in a Bayesian model. We demonstrate this by fitting a posterior to a Bayesian multinomial regression model.

• We use our formalism to derive the stationary distribution for SGD with momentum (Polyak, 1964). Our results show that adding momentum only changes the scale of the covariance of the stationary distribution, not its shape. This scaling factor is a simple function of the damping coefficient. Thus we can also use SGD with momentum for approximate Bayesian inference.

• Then, we analyze scalable MCMC algorithms. Specifically, we use the stochastic-process perspective to compute the stationary distribution of Stochastic-Gradient Langevin Dynamics (SGLD) by Welling and Teh (2011) when using constant learning rates, and analyze stochastic gradient Fisher scoring (SGFS) by Ahn et al. (2012). The view from the multivariate OU process reveals a simple justification for this method: we confirm that the preconditioning matrix suggested in SGFS is indeed optimal. We also derive a criterion for the free noise parameter in SGFS that can enhance numerical stability, and we show how the stationary distribution is modified when the preconditioner is approximated with a diagonal matrix (as is often done in practice for high-dimensional problems).

• Finally, we analyze iterate averaging (Polyak and Juditsky, 1992)

, where one successively averages the iterates of SGD to obtain a lower-variance estimator of the optimum. Based on the stochastic-process methodology, we give a shorter derivation of a known result, namely that the convergence speed of iterate averaging cannot be improved by preconditioning the stochastic gradient with any matrix. Furthermore, we show that (under certain assumptions), Polyak iterate averaging can yield an

optimal stochastic-gradient MCMC algorithm, and that this optimal sampler can generate exactly one effectively independent sample per pass through the dataset. This result is both positive and negative; it suggests that iterate averaging can be used as a powerful Bayesian sampler, but it also argues that no SG-MCMC algorithm can generate more than one useful sample per pass through the data, and so the cost of these algorithms must scale linearly with dataset size.

Our paper is organized as follows. In Section 3 we review the continuous-time limit of SGD, showing that it can be interpreted as an OU process. In Section 4 we present consequences of this perspective: the interpretation of SGD as variational Bayes and results around preconditioning and momentum. Section 5 discusses SG Langevin Dynamics and SG Fisher Scoring. In Section 6 we discuss Polyak averaging for optimization and sampling. In the empirical study (Section 7), we show that our theoretical assumptions are satisfied for different models, that we can use SGD to perform gradient-based hyperparameter optimization, and that iterate averaging gives rise to a Bayesian sampler with fast mixing.

## 2 Related Work

Our paper relates to Bayesian inference and stochastic optimization.

Scalable MCMC.

Recent work in Bayesian statistics focuses on making MCMC sampling algorithms scalable by using stochastic gradients. In particular,

Welling and Teh (2011) developed stochastic-gradient Langevin dynamics (SGLD). This algorithm samples from a Bayesian posterior by adding artificial noise to the stochastic gradient which, as the step size decays, comes to dominate the SGD noise. Also see Sato and Nakagawa (2014) for a detailed convergence analysis of the algorithm. Though elegant, one disadvantage of SGLD is that the step size must be decreased to arrive at the correct sampling regime, and as step sizes get small so does mixing speed. Other research suggests improvements to this issue, using Hamiltonian Monte Carlo (Chen et al., 2014) or thermostats (Ding et al., 2014). Shang et al. (2015) build on thermostats and use a similar continuous-time formalism as used in this paper.

Ma et al. (2015) give a complete classification of possible stochastic gradient-based MCMC schemes.

Below, we will analyze properties of stochastic gradient Fisher scoring (SGFS; Ahn et al., 2012), an extention to SGLD. This algorithm speeds up mixing times in SGLD by preconditioning gradients with the inverse gradient noise covariance. Ahn et al. (2012) show that (under some assumptions) SGFS can eliminate the bias associated with running SGLD with non-vanishing learning rates. Our approach to analysis can extend and sharpen the results of Ahn et al. (2012)

. For example, they propose using a diagonal approximation to the gradient noise covariance as a heuristic; in our framework, we can analyze and rigorously justify this choice as a variational Bayesian approximation.

Maclaurin et al. (2016) also interpret SGD as a non-parametric variational inference scheme, but with different goals and in a different formalism. The paper proposes a way to track entropy changes in the implicit variational objective, based on estimates of the Hessian. As such, the authors mainly consider sampling distributions that are not stationary, whereas we focus on constant learning rates and distributions that have (approximately) converged. Note that their notion of hyperparameters does not refer to model parameters but to parameters of SGD.

Stochastic Optimization.   Stochastic gradient descent is an active field (Zhang, 2004; Bottou, 1998). Many papers discuss constant step-size SGD. Bach and Moulines (2013); Flammarion and Bach (2015) discuss convergence rate of averaged gradients with constant step size, while Défossez and Bach (2015) analyze sampling distributions using quasi-martingale techniques. Toulis et al. (2014) calculate the asymptotic variance of SGD for the case of decreasing learning rates, assuming that the data is distributed according to the model. None of these papers consider the Bayesian setting. Dieuleveut et al. (2017) also analyzed SGD with constant step size and its relation to Markov chains. Their analysis resulted in a novel extrapolation scheme to improve the convergence behavior of iterate averaging.

The fact that optimal preconditioning (using a decreasing Robbins-Monro schedule) is achieved by choosing the inverse noise covariance was first shown in (Sakrison, 1965), but here we derive the same result based on different arguments and suggest a scalar prefactor. Note the optimal scalar learning rate of , where is the SGD noise covariance (as discussed in Section 4 or this paper), can also be derived based on stability arguments. This was done in the context of least mean square filters (Widrow and Stearns, 1985).

Finally, Chen et al. (2016) also draw analogies between SGD and scalable MCMC. They suggest annealing the posterior over time to use scalable MCMC as a tool for global optimization. We follow the opposite idea and suggest to use constant SGD as an approximate sampler by choosing appropriate learning rate and preconditioners.

Stochastic differential equations.   The idea of analyzing stochastic gradient descent with stochastic differential equations is well established in the stochastic approximation literature (Kushner and Yin, 2003; Ljung et al., 2012). Recent work focuses on dynamical aspects of the algorithm. Li et al. (2015) discuss several one-dimensional cases and momentum. Li et al. (2017) give a mathematically rigorous justification of the continuous-time limit. Chen et al. (2015) analyze stochastic gradient MCMC and study their convergence properties using stochastic differential equations.

Our work makes use of the same formalism but has a different focus. Instead of analyzing dynamical properties, we focus on stationary distributions. Further, our paper introduces the idea of minimizing KL divergence between multivariate sampling distributions and the posterior.

Variational Inference.   Variational Inference (VI) denotes a set of methods which aim at approximating a Bayesian posterior by a simpler, typically factorized distribution. This is done by minimizing Kullback-Leibler divergence or related divergences between these distributions (Jordan et al., 1999; Opper and Saad, 2001). For the class of models where the conditional distributions are in the exponential family, the variational objective can be optimized by closed-form updates (Ghahramani and Beal, 2000)

, but this is a restricted class of models with conjugate priors. A scalable version of VI, termed Stochastic Variational Inference (SVI), relies on stochastic gradient descent for data subsampling

(Hoffman et al., 2013). For non-conjugate models, Black-Box variational inference (Ranganath et al., 2014) has enabled SVI for a large class of models, but this approach may suffer from high-variance gradients. A modified form of black-box variational inference relies on re-parameterization gradients (Salimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014; Kucukelbir et al., 2015; Ruiz et al., 2016). This version is limited to continuous latent variables but typically has much lower-variance gradients.

In this paper, we compare against the Gaussian reparameterization gradient version of black-box variational inference as used in Kingma and Welling (2014); Rezende et al. (2014); Kucukelbir et al. (2015) which we refer to as BBVI. We find that our approach performs similarly in practice, but it is different in that it does not optimize the parameters of a simple variational distribution. Rather, it controls the shape of the approximate posterior via the parameters of the optimization algorithm, such as the learning rate or preconditioning matrix.

## 3 Continuous-Time Limit Revisited

We first review the theoretical framework that we use throughout the paper. Our goal is to characterize the behavior of SGD when using a constant step size. To do this, we approximate SGD with a continuous-time stochastic process (Kushner and Yin, 2003; Ljung et al., 2012).

### 3.1 Problem Setup

Consider loss functions of the following form:

 L(θ)=1N∑Nn=1ℓn(θ),g(θ)≡∇θL(θ). (2)

Such loss functions are common in machine learning, where is a loss function that depends on data and parameters . Each is the contribution to the overall loss from a single observation

. For example, when finding a maximum-a-posteriori estimate of a model, the contributions to the loss may be

 ℓn(θ)=−logp(xn|θ)−1Nlogp(θ), (3)

where is the likelihood and is the prior. For simpler notation, we will suppress the dependence of the loss on the data.

From this loss we construct stochastic gradients. Let be a set of random indices drawn uniformly at random from the set . This set indexes functions , and we call a “minibatch” of size . Based on the minibatch, we used the indexed functions to form a stochastic estimate of the loss and a stochastic gradient,

 ^LS(θ)=1S∑n∈Sℓn(θ),^gS(θ)=∇θ^LS(θ). (4)

In expectation the stochastic gradient is the full gradient, i.e., . We use this stochastic gradient in the SGD update

 θ(t+1)=θ(t)−ϵ^gS(θ(t)). (5)

Above and for what follows, we assume a constant (non-decreasing) learning rate .

Eqs. 4 and 5 define the discrete-time process that SGD simulates from. We will approximate it with a continuous-time process that is easier to analyze.

### 3.2 SGD as an Ornstein-Uhlenbeck Process

We now show how to approximate the discrete-time Eq. 5 with a continuous-time Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930). This leads to the stochastic differential equation below in Eq. 11. To justify the approximation, we make four assumptions. We verify its accuracy empirically in Section 7.

###### Assumption 1

Observe that the stochastic gradient is a sum of

independent, uniformly sampled contributions. Invoking the central limit theorem, we assume that the gradient noise is Gaussian with covariance

, hence

 ^gS(θ)≈g(θ)+1√SΔg(θ),Δg(θ)∼N(0,C(θ)). (6)
###### Assumption 2

We assume that the covariance matrix is approximately constant with respect to . As a symmetric positive-semidefinite matrix, this constant matrix factorizes as

 C(θ)≈C=BB⊤. (7)

Assumption 2 is justified when the iterates of SGD are confined to a small enough region around a local optimum of the loss (e.g. due to a small ) that the noise covariance does not vary significantly in that region.

We now define and combine Eqs. 56, and 7 to rewrite the process as

 Δθ(t)=−ϵg(θ(t))+ϵ√SBΔW,ΔW∼N(0,I). (8)

This can be interpreted as a finite-difference equation that approximates the following continuous-time stochastic differential equation:

 dθ(t)=−ϵg(θ)dt+ϵ√SBdW(t). (9)
###### Assumption 3

We assume that we can approximate the finite-difference equation (8) by the stochastic differential equation (9).

This assumption is justified if either the gradients or the learning rates are small enough that the discretization error becomes negligible.

###### Assumption 4

We assume that the stationary distribution of the iterates is constrained to a region where the loss is well approximated by a quadratic function,

 L(θ)=12θ⊤Aθ. (10)

(Without loss of generality, we assume that a minimum of the loss is at .) We also assume that is positive definite.

The symmetric matrix is thus the Hessian at the optimum. Assumption 4 makes sense when the loss function is smooth and the stochastic process reaches a low-variance quasi-stationary distribution around a deep local minimum. The exit time of a stochastic process is typically exponential in the height of the barriers between minima, which can make local optima very stable even in the presence of noise (Kramers, 1940).

SGD as an Ornstein-Uhlenbeck process.   The four assumptions above result in a specific kind of stochastic process, the multivariate Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930):

 dθ(t)=−ϵAθ(t)dt+1√SϵBdW(t) (11)

This connection helps us analyze properties of SGD because the Ornstein-Uhlenbeck process has an analytic stationary distribution that is Gaussian. This distribution will be the core analytic tool of this paper:

 q(θ)∝exp{−12θ⊤Σ−1θ}. (12)

The covariance satisfies

 ΣA+AΣ=ϵSBB⊤. (13)

(More details are in Appendix B.) Without explicitly solving this equation, we see that the resulting covariance is proportional to the learning rate and inversely proportional to the magnitude of and minibatch size . This characterizes the stationary distribution of running SGD with a constant step size.

Discussion of Assumptions 1–4.   Our analysis suggests that constant SGD and Langevin-type diffusion algorithms (Welling and Teh, 2011) are very similar. Both types of algorithms can be characterized by three regimes. First, there is a search phase where the algorithm approaches the optimum. In this early phase, assumptions 1–4 are often violated and it is hard to say anything general about the behavior of the algorithm. Second, there is a phase where SGD has converged to the vicinity of a local optimum. Here, the objective already looks quadratic, but the gradient noise is small relative to the average gradient . Thus SGD takes a relatively directed path towards the optimum. This is the regime where our assumptions should be approximately valid, and where our formalism reveals its use. Finally, in the third phase the iterates are near the local optimum. Here, the average gradient is small and the sampling noise becomes more important. In this final phase, constant SGD begins to sample from its stationary distribution.

Finally, we note that if the gradient noise covariance does not have full rank, then neither will the stationary covariance . This scenario complicates the analysis in Section 4, so below we will assume that has full rank. This could be enforced easily by adding very low-magnitude isotropic artificial Gaussian noise to the stochastic gradients.

## 4 SGD as Approximate Inference

We discussed a continuous-time interpretation of SGD with a constant step size (constant SGD). We now discuss how to use constant SGD as an approximate inference algorithm. To repeat the set-up from the introduction, consider a probabilistic model with data and hidden variables ; our goal is to approximate the posterior in Eq. 1.

We set the loss to be the negative log-joint distribution (Eqs.

2 and 3

), which equals the negative log-posterior up to an additive constant. The classical goal of SGD is to minimize this loss, leading us to a maximum-a-posteriori point estimate of the parameters. This is how SGD is used in many statistical models, including logistic regression, linear regression, matrix factorization, and neural networks. In contrast, our goal here is to tune the parameters of SGD so that its stationary distribution approximates the posterior.

Fig. 1 shows an example. Here we illustrate two Bayesian posteriors—from a linear regression problem (top) and a logistic regression problem (bottom)—along with iterates from a constant SGD algorithm. In these figures, we set the parameters of the optimization to values that minimize the Kullback-Leibler (KL) divergence between the stationary distribution of the OU process and the posterior—these results come from our theorems below. The left plots optimize both a preconditioning matrix and the step size; the middle plots optimize only the step size (both are outlined in Section 4.1). We can see that the stationary distribution of constant SGD can be made close to the exact posterior.

Fig. 2 also compares the empirical covariance of the iterates with the predicted covariance in terms of Eq. 13. The close match supports the assumptions of Section 3.

We will use this perspective in three ways. First, we develop optimal conditions for constant SGD to best approximate the posterior, connecting to well-known results around adaptive learning rates and preconditioners (Duchi et al., 2011; Tieleman and Hinton, 2012). Second, we propose an algorithm for hyperparameter optimization based on constant SGD. Third, we use it to analyze the stationary distribution of stochastic gradient descent with momentum (Polyak, 1964).

### 4.1 Constant Stochastic Gradient Descent

First, we show how to tune constant SGD’s parameters to minimize KL divergence to the posterior; this is a type of variational inference (Jordan et al., 1999). The analysis leads to three versions of constant SGD—one with a constant step size, one with a full preconditioning matrix, and one with a diagonal preconditioning matrix. Each yields samples from an approximate posterior, and each reflects a different tradeoff between efficiency and accuracy. Finally, we show how to use these algorithms to learn hyperparameters.

Assumption 4 from Section 3 says that the posterior is approximately Gaussian in the region that the stationary distribution focuses on,

 f(θ)∝exp{−N2θ⊤Aθ}. (14)

The scalar corrects the averaging in equation 2. Furthermore, in this section we will consider a more general SGD scheme that may involve a preconditioning matrix instead of a scalar learning rate :

 θt+1=θt−H^gS(θ(t)).

We will set the parameters of SGD to minimize the KL divergence between the stationary distribution (Eqs. 12 and 13) and the posterior (Eq. 14). This involves the the learning rate or more generally the preconditioning matrix and the minibatch size :

 {H∗,S∗}=argminH,SKL(q∣∣f).

First, consider a scalar learning rate (or a trivial preconditioner ). The distributions and are both Gaussians. Their means coincide, at the minimum of the loss, and so their KL divergence is

 KL(q||f) =−Eq[logf(θ)]+Eq[logq(θ)] =12(NEq[θ⊤Aθ]−log|NA|−log|Σ|−D) =12(NTr(AΣ)−log|NA|−log|Σ|−D),

where is the determinant and is the dimension of .

We suggest three variants of constant SGD that generate samples from an approximate posterior.

[Constant SGD] Under Assumptions A1-A4, the constant learning rate that minimizes KL divergence from the stationary distribution of constant SGD to the posterior is

 ϵ∗=2SNDTr(BB⊤). (15)

To prove this claim, we face the problem that the covariance of the stationary distribution depends indirectly on through Eq. 13. Inspecting this equation reveals that is independent of and . This simplifies the entropy term . Since is constant, we can neglect it when minimizing KL divergence.

We also need to simplify the term , which still depends on and through . To do this, we again use Eq. 13, from which follows that . The KL divergence is therefore, up to constant terms,

 KL(q||f)c=ϵN2STr(BB⊤)−Dlog(ϵ/S). (16)

Minimizing KL divergence over results in Eq. 15 for the optimal learning rate.

Theorem 4.1 suggests that the learning rate should be chosen inversely proportional to the average of diagonal entries of the noise covariance, and proportional to the ratio between the minibatch size and dataset size.

Since both the posterior and the variational distribution are Gaussian, one might wonder if also the reverse KL divergence is viable. While the KL divergence can be computed up to a constant, we cannot remove its dependence on the unknown stationary distribution using Eq. 13, unless and commute. This setup is discussed in Appendix C.

Instead of using a scalar learning rate, we now consider a positive-definite preconditioning matrix . This gives us more tuning parameters to better approximate the posterior.

[Preconditioned constant SGD] The preconditioner for constant SGD that minimizes KL divergence from the stationary distribution to the posterior is

 H∗=2SN(BB⊤)−1. (17)

To prove this, we need the Ornstein-Uhlenbeck process which corresponds to preconditioned SGD. Replacing the constant learning rate in Eq. 11 with a positive-definite preconditioning matrix results in

 dθ(t)=−HAθ(t)dt+1√SHBdW(t).

All our results carry over after substituting . Eq. 13, after the transformation and multiplication by from the left, becomes

 AΣ+H−1ΣAH=1SBB⊤H. (18)

Using the cyclic property of the trace, this implies that

 Tr(AΣ)=12(Tr(AΣ)+Tr(H−1AΣH))=12STr(BB⊤H). (19)

Consider now the log determinant term, , which still has an implicit dependence on . We first define , hence since , and are symmetric. Eq. 18 can be written as , which is equivalent to . Thus, we see that is independent of . The log determinant term is up to a constant Combining Eq. 19 with this term, the KL divergence is up to a constant

 KL(q||f) c=N2STr(BB⊤H)+log|H|+log|Q|. (20)

Taking derivatives with respect to the entries of results in Eq. 17.

In high-dimensional applications, working with large dense matrices is impractical. In those settings we can constrain the preconditioner to be diagonal. The following corollaries are based on the proof of Theorem 4.1:

The optimal diagonal preconditioner for SGD that minimizes KL divergence to the posterior is . This follows from Eq. 20, where we restrict the preconditioning matrix to be diagonal.

Under assumptions A1-A4, preconditioning with the full inverse noise covariance as in Theorem 4.1 results in samples from the exact posterior.

Consider Eq. 18. Inserting results in which is solved by which is the posterior covariance.

We showed that the optimal diagonal preconditioner is the inverse of the diagonal part of the noise matrix. Similar preconditioning matrices have been suggested earlier in optimal control theory based on very different arguments, see (Widrow and Stearns, 1985). Our result also relates to AdaGrad and its relatives (Duchi et al., 2011; Tieleman and Hinton, 2012), which also adjust the preconditioner based on the square root of the diagonal entries of the noise covariance. In Appendix F we derive an optimal global learning rate for AdaGrad-style diagonal preconditioners. In Section 7, we compare three versions of constant SGD for approximate posterior inference: one with a scalar step size, one with a dense preconditioner, and one with a diagonal preconditioner.

Remark on estimating the noise covariance.   In order to use our theoretical insights in practice, we need to estimate the stochastic gradient noise covariance . We do this in an online manner. As before, let be the full gradient, be the stochastic gradient of the full minibatch, and be the stochastic gradient of the first sample in the minibatch at time (which has a much larger variance if ). For large , we can approximate , and thus obtain an estimator of the noise covariance by . Following Ahn et al. (2012), we can now build an online estimate that approaches by the following recursion,

 Ct=(1−κt)Ct−1+κt(^g1,t−^gS,t)(^g1,t−^gS,t)⊤. (21)

Above, is a decreasing learning rate. Ahn et al. (2012) have proven that such an online average converges to the noise covariance in the optimum at long times (provided that and that is sufficiently large). We found that this online estimator works well in practice, even though our theoretical assumptions would require preconditioning SGD with the true noise covariance in finite time. Regarding the computational overhead of this procedure, note that similar online estimates of the gradient noise are carried out in adaptive SGD schemes such as AdaGrad (Duchi et al., 2011) or RMSProp (Tieleman and Hinton, 2012). When using a diagonal approximation to the noise covariance, the costs are proportional in the number of dimensions and mini-batch size; this efficiency means that the online estimate does not spoil the efficiency of SGD. Full preconditioning scales quadratically in the dimension and is therefore impractical in many real-word setups.

### 4.2 Constant SGD as Variational EM

Consider a supervised probabilistic model with joint distribution that factorizes into a likelihood and prior, respectively. Our goal is to find optimal hyperparameters . Jointly point-estimating and by following gradients of the log joint leads to overfitting or degenerate solutions. This can be prevented in a Bayesian approach, where we treat the parameters as latent variables. In Empirical Bayes (or type-II maximum likelihood), we maximize the marginal likelihood of the data, integrating out the main model parameters:

 λ⋆=argmaxλlogp(y|x,λ)=argmaxλlog∫θp(y,θ|x,λ)dθ.

When this marginal log-likelihood is intractable, a common approach is to use

variational expectation-maximization (VEM)

(Bishop, 2006), which iteratively optimizes a variational lower bound on the marginal log-likelihood over . If we approximate the posterior with some distribution , then VEM tries to find a value for that maximizes the expected log-joint probability .

Constant SGD gives rise to a simple VEM algorithm that applies to a large class of differentiable models. Define . If we interpret the stationary distribution of SGD as a variational approximation to a model’s posterior, we can justify following a stochastic gradient descent scheme on both and :

 θt+1=θt−ϵ∗∇θL(θt,λt);λt+1=λt−ρt∇λL(θt,λt). (22)

While the update for uses the optimal constant learning rate and therefore samples from an approximate posterior, the update uses a decreasing learning rate and therefore converges to a local optimum. The result is a type of VEM algorithm.

We stress that the optimal constant learning rate is not unknown, but can be estimated based on an online estimate of the noise covariance , as given in Eq. 21. In Section 7 we show that gradient-based hyperparameter learning is a cheap alternative to cross-validation.

### 4.3 Stochastic Gradient with Momentum

The continuous-time formalism also allows us to explore extensions of classical SGD. One of the most popular methods is stochastic gradient with momentum (Polyak, 1964; Sutskever et al., 2013). Here, we give a version of this algorithm that allows us to sample from an approximate posterior.

SGD with momentum doubles the dimension of parameter space in introducing an additional momentum variable v that has the same dimension as . Position and momentum are coupled in such a way that the algorithm keeps memory of some of its past gradients. The updates of SGD with momentum are

 v(t+1) = (1−μ)v(t)−ϵ^gS(θ(t)) θ(t+1) = θ(t)+v(t+1).

This involves the damping coefficient . For (infinite damping or overdamping), the momentum information gets lost and we recover SGD.

As before we assume a quadratic objective . Going through the same steps A1-A4 of Section 3 that allowed us to derive the Ornstein-Uhlenbeck process for SGD, we find

 dv = −μvdt−ϵAθdt+1√SϵBdW, (23) dθ = vdt.

We solve this set of stochastic equations asymptotically for the long-time limit. We reformulate the stochastic equation in terms of coupled equations for the moments. (This strategy was also used by

Li et al. (2015) in a more restricted setting). The first moments of the set of coupled stochastic differential equations give

 dE[v]=−μE[v]dt−ϵAE[θ]dt,dE[θ]=E[v]dt.

Note we used that expectations commute with the differential operators. These deterministic equations have the simple solution and , which means that the momentum has expectation zero and the expectation of the position variable converges to the optimum (at ).

In order to compute the stationary distribution, we derive and solve similar equations for the second moments. These calculations are carried out in Appendix D, where we derive the following conditions:

 E[vv⊤] = ϵ2E[θθ⊤]A+ϵ2AE[θθ⊤], (24) μE[vv⊤] = ϵ22SBB⊤.

Eq. 4.3 relate to equilibrium thermodynamics, where is a matrix of expected kinetic energies, while has the interpretation of a matrix of expected potential energies. The first equation says that energy conservation holds in expectation, which is the case for an equilibrium system which exchanges energy with its environment. The second equation relates the covariance of the velocities with , which plays the role of a matrix-valued temperature. This is known as the fluctuation-dissipation theorem (Nyquist, 1928). Combining both equations and using yields

 ΣA+AΣ = ϵμSBB⊤. (25)

This is exactly Eq. 13 of SGD without momentum, however, with the difference that the noise covariance is re-scaled by a factor instead of .

We have shown that , , and play similar roles: only the combination affects the KL divergence to the posterior. Thus, no single optimal constant learning rate exists—many combinations of , , and can yield the same stationary distribution. But different choices of these parameters will affect the dynamics of the Markov chain. For example, Sutskever et al. (2013) observe that, for a given effective learning rate , using a smaller sometimes makes the discretized dynamics of SGD more stable. Also, using very small values of while holding fixed will eventually increase the autocorrelation time of the Markov chain (but this effect is often negligible in practice).

## 5 Analyzing Stochastic Gradient MCMC Algorithms

We have analyzed well-known stochastic optimization algorithms such as SGD, preconditioned SGD and SGD with momentum. We now investigate Bayesian sampling algorithms. A large class of modern MCMC methods rely on stochastic gradients (Welling and Teh, 2011; Ahn et al., 2012; Chen et al., 2014; Ding et al., 2014; Ma et al., 2015). The central idea is to add artificial noise to the stochastic gradient to asymptotically sample from the true posterior.

In practice, however, the algorithms are used in combination with several heuristic approximations, such as non-vanishing learning rates or diagonal approximations to the Hessian. In this section, we use the variational perspective to quantify these biases and help understanding these algorithms better under more realistic assumptions.

### 5.1 SGLD with Constant Rates

To begin with, we we analyze the well-known Stochastic Gradient Langevin Dynamics by Welling and Teh (2011). This algorithm has been carefully analyzed in the long time limit where the stochastic gradient noise vanishes as the learning rate goes to zero, and where mixing becomes infinitely slow (Sato and Nakagawa, 2014). Here we analyze an approximate version of the algorithm, SGLD with a constant learning rate. We are interested in the stationary distribution of this approximate inference algorithm.

The discrete-time process that describes Stochastic Gradient Langevin dynamics is

 θt+1 = θt−ϵ2N^∇θL(θt)+√ϵV(t),

where

is a vector of independent Gaussian noises. Following assumptions 1–4 of Section

3, becomes the Wiener noise and the corresponding continuous-time process is

 dθ = −12ϵNAθdt+√ϵdV+ϵ1√SNBdW.

Above, and are vectors of independent Wiener noises, i.e. . The analog of Eq. 13 is then

 12N(AΣ+ΣA) = I+ϵN2SBB⊤.

In the limit of , we find that , meaning the stationary distribution becomes identical to the posterior. However, for non-zero , there are discrepancies. These correction-terms are positive. This shows that the posterior covariance is generally overestimated by Langevin dynamics, which can be attributed to non-vanishing learning rates at long times.

### 5.2 Stochastic Gradient Fisher Scoring

We now investigate Stochastic Gradient Fisher Scoring (Ahn et al., 2012), a scalable Bayesian MCMC algorithm. We use the variational perspective to rederive the Fisher scoring update and identify it as optimal. We also analyze the sampling distribution of the truncated algorithm, one with diagonal preconditioning (as it is used in practice), and quantify the bias that this induces.

The basic idea here is that the stochastic gradient is preconditioned and additional noise is added to the updates such that the algorithm approximately samples from the Bayesian posterior. More precisely, the update can be cast into the following form:

 θ(t+1)=θ(t)−ϵH^g(θ(t))+√ϵHEW(t). (26)

The matrix is a preconditioner and is Gaussian noise; we control the preconditioner and the covariance of the noise. Stochastic gradient Fisher scoring suggests a preconditioning matrix that leads to samples from the posterior even if the learning rate is not asymptotically small. We show here that this preconditioner follows from our variational analysis.

[Stochastic Gradient Fisher Scoring] Under Assumptions A1-A4, the positive-definite preconditioner in Eq. 26 that minimizes KL divergence from the stationary distribution of SGFS to the posterior is

 H∗=2N(ϵBB⊤+EE⊤)−1. (27)

To prove the claim, we go through the steps of Section 3 to derive the corresponding Ornstein-Uhlenbeck process, For simplicity, we have set the minibatch size to . In Appendix E, we derive the following KL divergence between the posterior and the sampling distribution:

 KL(q||p)=−N4Tr(H(ϵBB⊤+EE⊤))+12log|T|+12log|H|+12log|NA|+D2.

( is constant with respect to , , and .) We can now minimize this KL divergence over the parameters and . When is given, minimizing over gives Eq. 27.

Eq. 27 not only minimizes the KL divergence, but makes it , meaning that the stationary sampling distribution is the posterior. This solution corresponds to the suggested Fisher Scoring update in the idealized case when the sampling noise distribution is estimated perfectly (Ahn et al., 2012). Through this update, the algorithm thus generates posterior samples without decreasing the learning rate to zero. (This is in contrast to Stochastic Gradient Langevin Dynamics by Welling and Teh (2011).)

In practice, however, SGFS is often used with a diagonal approximation of the preconditioning matrix (Ahn et al., 2012; Ma et al., 2015). However, researchers have not explored how the stationary distribution is affected by this truncation, which makes the algorithm only approximately Bayesian. We can quantify its deviation from the exact posterior and we derive the optimal diagonal preconditioner, which follows from the KL divergence in Theorem 5.2:

When approximating the Fisher scoring preconditioner by a diagonal matrix or a scalar , respectively, then

Note that we have not made any assumptions about the noise covariance . We can adjust it in favor of a more stable algorithm. For example, in the interests of stability we might want to set a maximum step size , so that for all . We can adjust such that in Eq. 27 becomes independent of . Solving for yields .

Hence, to keep the learning rates bounded in favor of stability, one can inject noise in dimensions where the variance of the gradient is too small. This guideline is opposite to the advice of Ahn et al. (2012) to choose proportional to , but follows naturally from the variational analysis.

An additional benefit of SGFS over simple constant SGD is that the sum of gradient noise and Gaussian noise will always look “more Gaussian” than the gradient noise on its own. An extreme case is when the gradient covariance is not full rank; in this situation injecting full-rank Gaussian noise could prevent degenerate behavior.

## 6 A Bayesian View on Iterate Averaging

We now apply our continuous-time analysis to the technique of iterate averaging (Polyak and Juditsky, 1992). Iterate averaging was derived as an optimization algorithm, and we analyze this algorithm in Section 6.1 from this perspective. In Section 6.2 we show that iterate averaging can also be used as a Bayesian sampling algorithm.

### 6.1 Iterate Averaging for Optimization

Iterate averaging further refines SGD’s estimate of the optimal parameters by averaging the iterates produced by a series of SGD steps. Polyak and Juditsky (1992) proved the remarkable result that averaged SGD achieves the best possible convergence rate among stochastic gradient algorithms111To be more precise, averaged SGD is optimal among methods that only have access to a stochastic gradient oracle—if more is known about the source of the noise then sometimes better rates are possible (e.g., Johnson and Zhang, 2013; Defazio et al., 2014)., including those that use second-order information such as Hessians. This implies that the convergence speed of iterate averaging cannot be improved when premultiplying the stochastic gradient with any preconditioning matrix. We use stochastic calculus to show that the stationary distribution of iterate averaging is the same for any constant preconditioner. Based on slightly stronger-than-usual assumptions, we give a short proof on why this result holds.

To eliminate asymptotic biases, iterate averaging usually requires a slowly decreasing learning rate. We consider a simplified version with a constant rate, also analyzed in (Zhang, 2004; Nemirovski et al., 2009).

Algorithm.   Before we begin iterate averaging, we assume that we have run ordinary constant SGD for long enough that it has reached its stationary distribution and forgotten its intialization. In this scenario, we use iterate averaging to “polish” the results obtained by SGD.

Iterate averaging estimates the location of the minimum of using a sequence of stochastic gradients , and then computes an average of the iterates in an online manner,

 θt+1 = θt−ϵ^gS(θt), (28) ^μt+1 = tt+1^μt+1t+1θt+1.

After stochastic gradient steps and going over to continuous times, this average is

 ^μ≈1T∫T0θ(t)dt≡^μ′. (29)

The average and its approximation

are random variables whose expected value is the minimum of the objective. (Again, this assumes

is drawn from the SGD process’s stationary distribution.)

The accuracy of this estimator after a fixed number of iterations is characterized by its covariance matrix . Using stochastic calculus, we can compute from the autocorrelation matrix of the stationary distribution of the OU process, shown in Appendix G. We prove that

 D≡E[^μ′^μ′⊤]≈1ϵT(Σ(A−1)⊤+A−1Σ). (30)

This approximation ignores terms of order . This term is small since we assume that the number of iterations is much larger than the inverse learning rate . The covariance shrinks linearly over iterations as we expect from Polyak and Juditsky (1992). We use Eq. 13 to derive

 D≈1TSA−1BB⊤(A−1)⊤.

Note that this covariance depends only on the number of iterations times the minibatch size , not on the step size . Since is the total number of examples that are processed, this means that this iterate averaging scheme’s efficiency does not depend on either the minibatch size or the step size, proven first in (Polyak and Juditsky, 1992).

We can make a slightly stronger statement. If we precondition the stochastic gradients with a positive-definite matrix (for example, the inverse of the Hessian evaluated at the minimum), it turns out that the covariance of the estimator remains unchanged. The resulting Ornstein-Uhlenbeck process is The resulting stationary covariance of preconditioned iterate averaging is the same as above:

 D′≈1T(HA)−1(1√SHB)(1√SHB)⊤((HA)⊤)−1=1TSA−1H−1HBB⊤HH−1A−1=1TSA−1BB⊤A−1. (31)

Both the stationary distribution and the autocorrelation matrix change as a result of the preconditioning, and these changes exactly cancel each other out.

This optimality of iterate averaging was first derived by Polyak and Juditsky (1992), using quasi-martingales. Our derivation is based on stronger assumptions, but is shorter.

### 6.2 Finite-Window Iterate Averaging for Posterior Sampling

Above, we used the continuous-time formalism to quickly rederive known results about the optimality of iterate averaging as an optimization algorithm. Next, as in Section 4, we will analyze iterate averaging as an algorithm for approximate posterior inference.

We will show that (under some optimistic assumptions), iterate averaging requires exactly gradient calls to generate one sample drawn from the exact posterior distribution, where is the number of observations. That is, there exist conditions under which iterate averaging generates one true posterior sample per pass over the data. This result is both exciting and discouraging; it implies that, since our assumptions are all optimistic and iterate averaging is known to saturate the Cramér-Rao bound, no black-box stochastic-gradient MCMC algorithm can generate samples in time sublinear in the number of data points.

In addition to Assumptions 1–4 of Section 3, we need an additional assumption for our theoretical considerations to hold:

###### Assumption 5

Assume that the sample size

is large enough that the Bernstein-von Mises theorem

(Le Cam, 1986) applies (hence the posterior is Gaussian). Also assume that the observed dataset was drawn from the model with parameter . Then , that is, the Fisher information matrix equals the Hessian.

This simplifies equation 13:

 AΣ+ΣA=ϵSBB⊤A5⟹Σ=ϵ2SI. (32)

That is, the sampling distribution of SGD is isotropic.

Stationary distribution.   We will consider the sampling distribution of the average of successive samples from the stationary SGD process with step size . Going to the continuous-time OU formalism, we show in Appendix G that the stationary covariance of the iterate averaging estimator defined in Eq. 30 is

 D=1STA−1+1ϵST2UΛ−2(e−ϵTΛ−I)U⊤, (33)

where is orthonormal, is diagonal, and is the eigendecomposition of the Hessian . We have previously assumed that the posterior has covariance . Thus, to leading order in the ratio , the stationary distribution of fixed-window iterate averaging is a scaled version of the posterior.

If we choose , so that we average the iterates of a single pass through the dataset, then the iterate-averaging sampling distribution will have approximately the same covariance as the posterior,

 D⋆=1NA−1+Sϵ1N2UΛ−2(e−ϵSNΛ−I)U⊤=1NUΛ−1(I+Sϵ1NΛ−1(e−ϵSNΛ−I))U⊤. (34)

and

have identical eigenvectors, and their eigenvalues differ by a factor that goes to zero as

becomes large. Conversely, as approaches zero, all of these eigenvalues approach as in Eq. 32. (This can be shown by taking a second-order Maclaurin approximation of .)

Our analysis gives rise to the Iterate Averaging Stochastic Gradient sampler (IASG), described in Algorithm 1. We now investigate its approximation error and efficiency.

Approximation error, step size, and minibatch size.   We now focus on the correction terms that lead to deviations between the iterate averaging estimator’s covariance and the posterior covariance .

The analysis above tells us that we can ignore these correction terms if we choose a large enough . But the analysis in previous chapters assumes that is small enough that assumptions 1–4 hold. These considerations are in tension.

When is “large enough”? Eq. 34 shows that the relative error is largest in the direction of the smallest eigenvalue of (corresponding to the least-constrained direction in the posterior). We will focus our analysis on the relative error in this direction, which is given by

 errmax≡Sϵ1Nλmin(e−SϵNλmin−1).

We are given and , but we can control . To make small, we must therefore choose for some constant . So larger datasets and lower-variance posteriors let us use smaller stepsizes.

It is hard to say in general how small needs to be to satisfy assumptions 1–4. But if assumption 4 is satisfied (i.e., the cost is approximately quadratic), then assumption 3 (no discretization error) cannot hold if . This is the step size at which the discretized noise-free gradient descent process becomes unstable for quadratic costs. (We define analogous to .)

Combining this observation with the analysis above, we see that this iterate averaging scheme will not work unless

 2Sλmax>ϵS>cNλmin⇒2c>SNλmaxλmin.

That is, we need the dataset size to be large enough relative to the condition number of the Hessian if this simple iterate-averaging scheme is to generate good posterior samples. If the condition number is large relative to , then it may be necessary to replace the scalar step size with a preconditioning matrix to reduce the effective condition number of the Hessian .

Efficiency.   Next, we theoreticaly analyze the efficiency with which iterate averaging can draw samples from the posterior, and compare this method to other approaches. We assume that the cost of analyzing a minibatch is proportional to . We have shown above that we need to average over samples of SGD to create a sample of iterate averaged SGD. Since this averaging induces a strong autocorrelation, we can only use every th sample of the chain of averaged iterates. Furthermore, every sample of SGD incurs a cost at least proportional to where is the dimensionality of . This means that we cannot generate an independent sample from the posterior in less than time; we must analyze at least observations per posterior sample.

We compare this result with the more classical strategy of estimating the posterior mode via Newton’s method (which has an cost) and then estimating the posterior covariance by computing the inverse-Hessian at the mode, again incurring an cost. By contrast, getting an unbiased full-rank estimate of the covariance using MCMC requires generating at least samples, which again costs . If , then this is within a constant cost of the classical approach.

However, it is conceivable that Polyak averaging (Section 6.2) could be used to estimate the first few principal components of the posterior relatively quickly (i.e., in time). This corresponds to finding the smallest principal components of , which cannot be done efficiently in general. A related question is investigated experimentally in Section 7.2.

The analysis above implies an upper bound on the efficiency of stochastic-gradient MCMC (SGMCMC) methods. The argument is this: given that assumptions 1–5 hold, suppose that there exists an SGMCMC method that, for large , is able to generate effectively independent posterior samples using operations. Then, if we wanted to estimate the posterior mode, we could simply average some large number of those samples to obtain an estimator whose covariance would be . This approach would require operations, whereas iterate averaging would require operations to obtain an estimator with the same covariance. But this contradicts the result of Polyak and Juditsky (1992) that no stochastic-gradient-oracle algorithm can outperform iterate averaging. Thus, the optimality of iterate averaging as an optimization algorithm, taken with assumptions 1–5, implies that no SGMCMC algorithm can generate posterior samples in sublinear time222At least, not without exploiting additional knowledge about the source of gradient noise as do methods like SVRG and SAGA (Johnson and Zhang, 2013; Defazio et al., 2014)..

This argument relies on assumptions 1–5 being true, but one can easily construct scenarios in which they are violated. However, these assumptions are all optimistic; there seems (to us) little reason to think that problems that violate assumptions 1–5 will be easier than those that do not.

## 7 Experiments

We test our theoretical assumptions from Section 3 and find good experimental evidence that they are reasonable in some settings. We also investigate iterate averaging and show that the assumptions outlined in 6.2 result in samples from a close approximation to the posterior. We also compare against other approximate inference algorithms, including SGLD (Welling and Teh, 2011), NUTS (Hoffman and Gelman, 2014), and black-box variational inference (BBVI) using Gaussian reparametrization gradients (Kucukelbir et al., 2015). In Section 7.3 we show that constant SGD lets us optimize hyperparameters in a Bayesian model.

### 7.1 Confirming the Stationary Distribution’s Covariance

In this section, we confirm empirically that the stationary distributions of SGD with KL-optimal constant learning rates are as predicted by the Ornstein-Uhlenbeck process.

Real-world data.   We first considered the following data sets.

• The Wine Quality Data Set333P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, ’Wine Quality Data Set’, UCI Machine Learning Repository., containing instances, features, and one integer output variable (the wine rating).

• A data set of Protein Tertiary Structure444Prashant Singh Rana, ’Protein Tertiary Structure Data Set’, UCI Machine Learning Repository., containing instances, features and one output variable.

• The Skin Segmentation Data Set555Rajen Bhatt, Abhinav Dhall, ’Skin Segmentation Dataset’, UCI Machine Learning Repository., containing instances, features, and one binary output variable.

We applied linear regression on data sets and and applied logistic regression on data set . We rescaled the feature to unit length and used a mini-batch of size , and for the three data sets, respectively. The quadratic regularizer was . The constant learning rate was adjusted according to Eq. 15.

Fig. 1 shows two-dimensional projections of samples from the posterior (blue) and the stationary distribution (cyan), where the directions were chosen two be the smallest and largest principal component of the posterior. Both distributions are approximately Gaussian and centered around the maximum of the posterior. To check our theoretical assumptions, we compared the covariance of the sampling distribution (yellow) against its predicted value based on the Ornstein-Uhlenbeck process (red), where very good agreement was found. Since the predicted covariance is based on approximating SGD as a multivariate Ornstein-Uhlenbeck process, we conclude that our modeling assumptions are satisfied to a very good extent. Since the wine dataset is smaller than the skin segmentation data set, it has a broader posterior and therefore requires a larger learning rate to match it. For this reason, discretization effects play a bigger role and the stationary distribution of preconditioned SGD on wine does not exactly match the posterior. The unprojected -dimensional covariances on wine data are also compared in Fig. 2. The rightmost column of Fig. 1 shows the sampling distributions of black box variational inference (BBVI) using the reparametrization trick (Kucukelbir et al., 2015). Our results show that the approximation to the posterior given by constant SGD is not worse than the approximation given by BBVI.

We also computed KL divergences between the posterior and stationary distributions of various algorithms: constant SGD with KL-optimal learning rates and preconditioners, Stochastic Gradient Langevin Dynamics, Stochastic Gradient Fisher Scoring (with and without diagonal approximation) and BBVI. For SG Fisher Scoring, we set the learning rate to of Eq. 15, while for Langevin dynamics we chose the largest rate that yielded stable results ( for data sets , and , respectively). Table 1 summarizes the results. We found that constant SGD can compete in approximating the posterior with the MCMC algorithms under consideration. This suggests that the most important factor is not the artificial noise involved in scalable MCMC, but rather the approximation of the preconditioning matrix.

### 7.2 Iterate Averaging as Approximate MCMC

In the following, we show that under the assumptions specified in Section 6.2, iterate averaging with a constant learning rate and fixed averaging window results in samples from the posterior.

Synthetic data.   In order to strictly satisfy the assumptions outlined in Section 6.2, we generated artificial data that came from the model. We chose a linear regression model with a Gaussian prior with precision . We first generated covariates by drawing them from a

dimensional Gaussian with unit covariance. We drew the true weight vector from the prior. We then generated the corresponding response variables from the linear regression model.

In a first experiment, we confirmed that iterate averaging may result in samples from the exact posterior, shown in Fig. 3. The left panel shows the empirical covariance matrix of the iterates of SGD. The middle and right panel show the posterior covariance and empirical covariance of the averaged iterates, respectively. As predicted by our theory, there is a very strong resemblance which demonstrates that constant-rate iterate averaging with the right rates and averaging windows may actually result in samples from the posterior. To generate this plot, we then ran constant SGD with a constant learning rate for iterations and a minibatch size and used an averaging window of , as predicted by the theory presented earlier in order to achieve samples from the posterior.

Next, we analyzed the convergence speed of Iterate-Averaged Stochastic Gradients (IASG, see Algorithm 1) compared to related methods. Fig. 4 shows these comparisons for three modern scalable MCMC algorithms: SGLD (Welling and Teh, 2011) and NUTS (Hoffman and Gelman, 2014). We investigated how quickly these algorithms could give us estimates of the posterior covariances. To better visualize the convergence behavior, we focussed on diagonal entries of the posterior covariance, the marginal variances. As before, the data were generated from the model such the theoretical assumptions of 6.2 applied. We then ran the samplers for up to effective passes through the data. For IASG and SGLD we used a minibatch size of and an averaging window of . The constant learning rate of IASG was and for SGLD we decreased the learning rate according to the Robbins-Monro schedule of where we found to be optimal. NUTS automatically adjusts its learning rate and uses non-stochastic gradients.

The left column of Fig. 4 shows the results of these experiments. We found the convergence speeds of the samplers to be highly dependent on whether we optimized the samplers in the maximum posterior mode (termed MAP-initialization: this partially eleminates the initial bias and burn-in phase) or whether the samplers were initialized randomly, as in a real-world application. Thus, we show both results: while the the right column shows random initializations, the left one shows MAP-initialization. The rows of Fig. 4 show results of IASG (top), SGLD (middle), and NUTS (bottom). Each entry shows the smallest and largest marginal variance of the posterior over iterations, as estimated from these methods, where we excluded the first 10 iterations to avoid large biases. We also give the standard deviations for these estimates based on 100 independent Markov chains. The solid red lines show the ground truth of these marginal variances. Fig. 5 shows additional results on the same experiments, where we display the posterior estimates of the three different samplers under the two different initializations.

We found that in both initializations, IASG can find a fast approximate solution to the posterior. It uses stochastic gradients which gives it a competitive advantage over NUTS (which uses full gradients) in particular in the early search phase of the sampler. Langevin dynamics behaves similarly at early iterations (it also employs stochastic gradients). However, compared to IASG, we see that the Langevin algorithm has a much larger standard error when estmating the posterior covariances. This is because it uses a decreasing Robbins-Monro learning rate that slows down equilibration at long times. In contrast, IASG uses a constant learning rate and therefore converges fast. Note that in practice a large variance may be as bad as a large bias, especially if posterior estimates are based on a single Markov chain. This is also evident in Fig.

5, which shows that for random initialization, both NUTS and SGLD reveal less of the structure of the true posterior covariance compared to IASG.

When initialized in the posterior maximum, we see that all algorithms perform reasonably well (though SGLD’s estimates are still highly variable even after sweeps through the dataset). IASG converges to a stable estimate much faster than SGLD or NUTS, but produces slightly biased estimates of the smallest variance.

### 7.3 Optimizing Hyperparameters

We test the hypothesis of Section 4.2, namely that constant SGD as a variational algorithm gives rise to a variational EM algorithm where we jointly optimize hyperparameters using gradients while drawing samples form an approximate posterior. To this end, we experimented with a Bayesian multinomial logistic (a.k.a. softmax) regression model with normal priors. The negative log-joint is

 L≡−logp(y,θ|x)=λ2∑d,kθ2dk−DK2log(λ)+DK2log2π+∑nlog∑kexp{∑dxndθdk}−∑dxndθdyn, (35)

where indexes examples, indexes features and indexes classes. is the feature vector for the th example and is the class for that example. Eq. 35 has the degenerate maximizer , , which has infinite posterior density which we hope to avoid in our approach.

Real-world data.

In all experiments, we applied this model to the MNIST dataset (

training examples, test examples, features) and the cover type dataset ( training examples, testing examples, features).

Fig. 6 shows the validation loss achieved by maximizing equation 35 over for various values of . This gives rise to the continuous blue curve. The value for constant SGD was obtained using Eq. 22, hence using constant learning rates for and decreasing learning rates for . BBVI was carried out by optimizing a variational lower bound in mean-field variational inference, and optimizing hyperparameters based on this lower bound.

The results suggest that BBVI and constant SGD yield similar results. Thus, constant SGD can be used as an inexpensive alternative to cross-validation or other VEM methods for hyperparameter selection.

## 8 Conclusions

In this paper, we built on a stochastic process perspective of stochastic gradient descent and various extensions to derive several new results. Under specified assumptions, SGD is approximated by a multivariate Ornstein-Uhlenbeck process, which possesses an analytic solution. We computed the stationary distribution of constant SGD and analyzed its properties.

We analyzed SGD together with several extensions, such as momentum, preconditioning, and iterate averaging. The shape of the stationary distribution is controlled by the parameters of the algorithm such as the constant learning rate, preconditioning matrix, or averaging period. We can thus tune these parameters to minimize the Kullback-Leibler divergence between the stationary distribution and a Bayesian posterior. This view uses these stochastic optimization algorithms as approximate inference. We also analyzed stochastic gradient Langevin dynamics and stochastic gradient Fisher scoring and were able to analyze approximation errors for these algorithms.

The Bayesian view on constant-rate SGD allows us to use this algorithm as a new variational EM algorithm. We suggested and tested a double SGD scheme which uses decreasing learning rates on the hyperparameters and constant learning rates on the main parameters. We showed that this is both easy to implement and prevents us from finding degenerate solutions; it is a cheap alternative to cross-validation for many complex models.

Last, our analysis suggests the many similarities between sampling and optimization algorithms that can be explored using the stochastic process perspective. A future direction might be to explore similarities in the noise characteristics of black-box variational inference algorithms and Langevin-type MCMC. Further exploring the use of iterate averaging as a Bayesian algorithm is another interesting avenue for further studies.

We would like to thank Yingzhen Li and Thomas Hofmann for their valuable feedback on our manuscript.

## Appendix A Examples: Ornstein-Uhlenbeck Formalism

Let us illustrate the Ornstein-Uhlenbeck formalism based on two simple examples. First, consider the following quadratic loss,

 L(θ)=−12N∑Nn=1||xn−θ||2. (36)

Let us define as the empirical mean of the data points. The gradient is , and the stochastic gradient is . Because the gradient is linear in , the noise covariance is just the covariance of the data: We can shift the parameter , resulting in . Note that the Hessian is just unity. According to Eq. 38,

 q(θ)∝exp{−S2ϵ(θ−¯x)⊤Σ−1x(θ−¯x)}.

as the resulting stationary distribution. Next, consider linear regression, where we minimize

 L(θ)=−12N∑n(yn−x⊤nθ)2. (37)

We can write the stochastic gradient as , where and are estimates based on a mini-batch of size . The sampling noise covariance is , where . We see that the noise covariance is quadratic, but unfortunately it cannot be further simplified.

Fig. 1 shows the objective function of linear regression (blue) and the sampling distribution of stochastic gradient descent (yellow) on simulated data. We see that both distributions do not coincide, because the sampling distribution is also affected by the noise covariance.

## Appendix B Stationary Covariance

The Ornstein-Uhlenbeck process has an analytic solution in terms of the stochastic integral (Gardiner et al., 1985),

 θ(t)=exp(−At)θ(0)+√ϵS∫t0exp[−A(t−t′)]BdW(t′) (38)

Following Gardiner’s book and using , we derive an algebraic relation for the stationary covariance of the multivariate Ornstein-Uhlenbeck process. Define . Using the formal solution for given in the main paper, we find

 AΣ+ΣA =ϵS∫t−∞Aexp[−A(t−t′)]BB⊤exp[−A(t−t′)]dt′ +ϵS∫t−∞exp[−A(t−t′)]BB⊤exp[−A(t−t′)]dt′A =ϵS∫t−∞ddt′(exp[−A(t−t′)]BB⊤exp[−A(t−t′)]) =ϵSBB⊤.

We used that the lower limit of the integral vanishes by the positivity of the eigenvalues of .

## Appendix C Reverse KL Divergence Setup

It is interesting to also consider the case of trying to minimize the reverse KL divergence, i.e.