Often, the distribution describes either a population distribution or an empirical distribution over a given dataset. At each iteration, the SGLD is updated by
Here is the stochastic gradient of the objective function, are i.i.d. samples from , is a standard
-dimensional Gaussian random vector, andis the step size parameter. As compared to the stochastic gradient descent (SGD), the SGLD imposes a larger step size for the noise term (i.e., instead of for SGD), which allows the SGLD to aptly navigate a landscape containing multiple critical points. SGLD obtains its name because it is a discrete approximation to the continuous Langevin diffusion process, which can be described by the following stochastic differential equation (SDE)
Theoretically, SGLD has been studied from various perspectives. Statistically, SGLD has been shown to have better generalization ability than the simple stochastic gradient descent (SGD) algorithm (Mou et al., 2018; Tzen et al., 2018). From the optimization point of view, it is well known that SGLD traverses all stationary points asymptotically. More recently, quantitative characterizations of the mixing-time are derived (Raginsky et al., 2017; Xu et al., 2018a). However, bounds in these papers often depend on a quantity called the spectral gap of Langevin diffusion process (Equation (1.2)), which in general has an exponential dependence on the dimension. We refer readers to Section 2 for more discussions.
While these bounds are pessimistic, in many machine learning applications, finding a local minimum has already been useful. In other words, we only need the critical point hitting time bound instead of mixing time bound.444See Section 1.1 for the precise definition. To the best of our knowledge, Zhang et al. (2017) is the first work studying the hitting time property of SGLD. The analysis of Zhang et al. (2017) consists of two parts. First, they defined a geometric quantity called Cheeger’s constant of the target regions, and showed that the Cheeger’s constant of certain regions (e.g. the union of all approximate local minima) can be estimated. Next, they derived a generic bound that relates the hitting time of SGLD and this Cheeger’s constant. Through these two steps, they showed the hitting time of critical points only has a polynomial dependence on the dimension.
However, due to this two-step analysis framework, the hitting time bound derived is often not tight. Technically, it is very challenging to accurately estimate the Cheeger’s constant of the region of interest. In particular, in many machine learning problems, useful structures such as low-rank and sparsity are available, which can be potentially exploited by SGLD to achieve faster convergence. Therefore, a natural research question is that: instead of using a two-stage approach, is there a direct method to obtain tighter hitting time bounds of SGLD that can incorporate underlying structural assumptions?
In this paper we consider the hitting time of SGLD to first order and second order approximation stationary points. For both types of stationary points, we provide a simple analysis of the hitting time of SGLD, which is motivated from the succinct continuous-time analysis. Notably, our analysis only relies basic real analysis, linear algebra, and probability tools. In contrast to the indirect approach adopted by Zhang et al. (2017), we directly estimate the hitting time of SGLD and thus obtain tighter bounds in terms of the dimension, error metric, and other problem-dependent quantities such as smoothness. Comparing our results with Zhang et al. (2017), we have two main advantages. First our results are applicable to decreasing step sizes—the setting widely used in practice. While previous analysis (including Zhang et al. (2017)) mainly considers the constant step size setting, which limits its potential applications. Second, in certain scenarios (see Section 5), we can obtain dimension independent hitting time bounds, whereas bounds in Zhang et al. (2017) all require at least a polynomial dependence of dimension.
1.1 Preliminaries and Problem Setup
We use to denote the Euclidean norm of a finite-dimensional vector. We also use to denote the inner product of two vectors. For a real symmetric matrix , we use
to denote its largest eigenvalue andits smallest eigenvalue. Let denote standard Big-O notation, only hiding absolute constants.
In this paper we use either symbol , or to denote problem dependent parameters. The difference is, the -constants can often be picked independently of the dimension, but the and -constants usually increase with the dimension. Typical example can be, the spectral norm of the
-dimensional identity matrix remainsfor any , but its trace increases linearly with . On the other hand, the constants are in practice controlled by the batch sizes. By writing two types of constants differently, help us to interpret the performance of SGLD in high dimensional settings. On the other hand, our results hold even if the -constants increase with , which is possible in certain scenarios. We denote , , . Throughout the paper, we use to denote an absolute constant, which may change from line to line.
In this paper we focus on SGLD defined in Equation (1.1). Note that the stochastic gradient can be decomposed into two parts: one part is its mean , the other part is the difference between the stochastic gradient and its mean:
We note that is a martingale difference series. With these notations, we can write the SGLD iterates as
Note that the step size is not a constant. This is crucial, since we know for SGD to converge, the step size needs to converge to zero. In practice, it is often picked as an inverse of polynomial,
where leads to a constant step size.
We study the hitting time of SGLD. For a region of interest, the hitting time to is defined as the first time that the SGLD sequence falls into the region :
In this paper we are interested in finding approximate first order stationary points (FOSP) and second order stationary points (SOSP) of the objective function. Note that finding a FOSP or SOSP is sufficient for many machine learning applications. For convex problems, an FOSP is already a global minimum and for certain problems like matrix completion, an SOSP is a global minimum and enjoys good statistical properties as well (Ge et al., 2017). Please see Section 2 for discussions.
Definition 1.1 (Approximate First Order Stationary Points (FOSP)).
Given , we define approximate first order stationary points as
and denote the corresponding hitting time.
Definition 1.2 (Approximate Second Order Stationary Points (SOSP)).
Given , we define approximate second order stationary points as
and denote the corresponding hitting time.
This paper is organized as follows. In Section 2, we review related works. In Section 3, we present our hitting time analysis of SGLD to the first order stationary points. In Section 4, we extend our analysis to second order stationary points. In Section 5, we provide three examples to illustrate the convergence of SGLD for machine learning applications. We conclude and discuss future directions in Section 6. Proofs for technical lemmas and the results in Section 5 are deferred to the appendix.
2 Related Works
From the optimization point of view, Raginsky et al. (2017) gave a non-asymptotic bound showing that SGLD finds an approximate global minimizer in under certain conditions. In particular, this bound is a mixing-time bound and it depends on the inverse of the uniform spectral gap parameter of the Langevin diffusion dynamics (Equation (1.2)), which is in general . More recently, Xu et al. (2018a)
improved the dimension dependency of the bound and analyzed the finite-sum setting and the variance reduction variant of SGLD. All these bounds depend on the spectral gap parameter which scales exponentially with the dimension.Tzen et al. (2018) gave a finer analysis on the recurrence time and escaping time of SGLD through empirical metastability. While this result does not depend on the spectral gap, there is not much discussion on how does SGLD escape from saddle points. Therefore, finding a global minimum for a general non-convex objective with a good dimension dependence might be too ambitious.
On the other hand, in many machine learning problems, finding an FOSP or an SOSP is sufficient. Recently, a line of work shows FOSP or SOSP has already provided good statistical properties for achieving desirable prediction performance. Examples include matrix factorization, neural networks, dictionary learning, e.t.c. (Hardt and Ma, 2016; Ge et al., 2015; Sun et al., 2017; Ge et al., 2017; Park et al., 2017; Bhojanapalli et al., 2016; Du and Lee, 2018; Du et al., 2018a; Ge et al., 2018; Du et al., 2018b; Mei et al., 2017).
These findings motivate the research on designing provably algorithms to find FOSP and SOSP. For FOSP, it is well known that stochastic gradient descent finds an FOSP in polynomial time (Ghadimi and Lan, 2016) and recently improved by Allen-Zhu and Hazan (2016); Reddi et al. (2016) in the finite-sum setting. For finding SOSP, Lee et al. (2016) showed if all SOSPs are local minima, randomly initialized gradient descent with a fixed step size also converges to local minima almost surely. The classical cubic-regularization (Nesterov and Polyak, 2006) and trust region (Curtis et al., 2014) algorithms find SOSP in polynomial time if full Hessian matrix information is available. Later, Carmon et al. (2018); Agarwal et al. (2017); Carmon and Duchi (2016) showed that the requirement of full Hessian access can be relaxed to Hessian-vector products. When only gradient information is available, a line of work shows noise-injection helps escape from saddle points and find an SOSP (Jin et al., 2017; Du et al., 2017; Allen-Zhu, 2018; Jin et al., 2018; Levy, 2016). If we can only access to stochastic gradient, Ge et al. (2015) show that adding perturbation in each iteration suffices to escape saddle points in polynomial time. The convergence rates are improved later by Allen-Zhu and Li (2018); Allen-Zhu (2018); Xu et al. (2018b); Yu et al. (2018); Daneshmand et al. (2018).
Theoretically, however, there is a significant difference between SGD-based algorithms and SGLD. In SGD, the squared norm of the noise scales quadratically with the step size, which has a smaller order than the true gradient. On the other hand, in SGLD, the squared norm of the noise scales linearly with the step size, which has the same order as the true gradient. In this case, the noise in SGLD is lower bounded away from zero which enable SGLD to escape saddle points. Nevertheless, this escape mechanism is subtle, and it requires careful balances of hyper parameters and sophisticated analyses. To our knowledge, Zhang et al. (2017) is the only work studied the hitting time of stationary points. As we discussed in Section 1, their analysis is indirect, which leads to loose bounds. In this paper, we directly analyze the hitting time of SGLD and obtain tighter bounds.
Finally, beyond improving the training process for non-convex learning problems, SGLD and its extensions have also been widely used in Bayesian learning (Welling and Teh, 2011; Chen et al., 2015; Ma et al., 2015; Dubey et al., 2016; Ahn et al., 2012) and approximate sampling (Brosse et al., 2017; Bubeck et al., 2018; Durmus et al., 2017; Dalalyan, 2017a, b; Dalalyan and Karagulyan, 2019)
. These directions are orthogonal to ours because their primary goal is to characterize the probability distribution induced from SGLD.
3 Hitting Time to First Order Stationary Points
In this section we analyze the hitting time to FOSP by SGLD. We first use a continuous time analysis to illustrate the main proof idea.
3.1 Warm Up: A Continuous Time Analysis
Recall that if we let the step size , the dynamics of SGLD can be characterized by an SDE
Using Ito’s formula, we obtain the dynamics of :
Now given and recall is the hitting time of the first order stationary points (FOSP). By Dynkin’s formula, for any , we have
Note that before , the gradient satisfies . If we assume the Hessian satisfies then . Using the assumption that is non-negative, we can obtain the following estimate
Therefore, re-arranging terms, we have
Applying Markov’s inequality, we have
The above derivations show if we pick a small and is large enough, we know SGLD hits an approximate first order stationary point in time less than with high probability. In the next section, we use this insight from the continuous time analysis to derive hitting time bound of the discrete time SGLD algorithm.
3.2 Discrete Time SGLD Analysis
We first list technical assumptions for bounding the hitting time. The first assumption is on the objective function.
There exists such that for all , the objective function satisfies .
This condition assumes the spectral norm of the Hessian are bounded. It is a standard smoothness assumption, which guarantees the gradient descent algorithm can hit an approximate first order stationary.
Our second assumption is on the noise from the stochastic gradient.
There exists such that for all and any , the gradient noise defined in (1.3) satisfies,
In the sequel, we consider the natural filtration that describes the information up to the -th iteration,
while and denote the conditional probability and expectation with respect to .
This assumption states that the noise has bounded moments. Such an assumption is necessary for guaranteeing the convergence even for SGD. Note herealso implies by Cauchy-Schwartz inequality. Furthermore, using the property of the spectral norm, we know also implies . Here we explicit assume for some in order to exploit certain finer properties of the problem. Now we are ready to state our first main theorem. Please recall the definition of from (1.4), and and from (1.5).
Let be the desired accuracy and be the failure probability. Suppose we set . Then there is an absolute constant that depends only on such that
for , if , or,
for , if , or,
for , if and ,
This theorem states that SGLD can easily hit a first order stationary point. As compared with Zhang et al. (2017), we provide an explicit hitting time estimate since we use a more direct analysis. Note for different , we have different hitting time estimates. The reason will be clear in the following proof. Also note that the number of iterations can be independent of the dimension, as long as and are dimension independent, and the diffusion parameter uses the correct scaling.
Proof of Theorem 3.3.
Our proof follows closely to the continuous time analysis in the previous section. Denote . This quantity corresponds to the quantity in the continuous time analysis. To proceed, we expand one iteration
Where is some number in . The last term can be bounded by
and furthermore by Holder’s inequality, i.e. ,
Summing this bound over all , apply total expectation, we obtain
Rearranging terms, recall that is nonnegative, we have
Combining this bound and Markov’s inequality, we have
Lastly, note that
Plugging in our choice and , we have that the right hand side is smaller than . ∎
4 Hitting Time Analysis to Second Order Stationary Points
In this section we analyze the hitting time of second order stationary points. The key insight here is that because we add Gaussian noise at each iteration, the accumulative noise together with the negative eigenvalue of the Hessian will decrease the function value. Again, we first use a continuous time analysis on a simple example to illustrate the main idea.
4.1 Warm Up: A Continuous Time Analysis for Escaping Saddle Points
To motivate our analysis, we demonstrate how does the Langevin dynamics escape a strict saddle point. For this purpose, we assume , and with being a symmetric matrix with . This example characterize the situation when the SGLD starts at a saddle point. The resulting Langevin diffusion is actually an Ornstein Uhlenbeck (OU) process,
Knowing it is an OU process, it can be written through an explicit formula,
Thus by Ito’s isometry, if has eigenvalues ,
Since is a strict saddle point, i.e., , we can pick to make , which indicates that has escape the saddle point.
4.2 Escaping Saddle Points
In this section, we provide theoretical justifications on why SGLD is able to escape strict saddle points. In addition to Assumptions 3.1 and 3.2 made in Section 3, we also need some additional regularity conditions to guarantee that SGLD escape from strict saddle points.
We assume the following hold for the objective function .
There exist such that for all pairs of , . Note that is defined in Assumption 3.1.
There exists such that for all .
There exists such that for all , we assume the .
The first assumption states that the Hessian is Lipschitz and the second condition states the function value is bounded. These two conditions are widely adopted in papers on analyzing how the first order methods can escape saddle points (Ge et al., 2015; Jin et al., 2017). The third condition states that the sum of positive eigenvalues is bounded . Note there is a naïve upper bound . However, in many cases can be much smaller than . We provide several examples in Section 5. So we take as a separate parameter in order to exploit more refined properties of the problem.
The following lemma characterizes the behavior of SGLD around a strict saddle point. It can be viewed as a discrete version of the discussions in Section 4.1.
Lemma 4.2 (Escaping saddle point).
This lemma states that SGLD is able to escape strict saddle points in polynomial time. Its full proof can be found in Section A.1.
4.3 Hitting time of SOSP
This theorem state SGLD is able to hit an SOSP in polynomial time, thus verifying adding noise is helpful in non-convex optimization. This result has been established by Zhang et al. (2017) using an indirect approach as we discussed in Section 1. Our proof relies on a direct analysis. The proof intuition is simple and similar to what is demonstrated in Section 4.1. Yet, there are three layers of technicalities. First, SGLD is a discretized version of the Langevin diffusion process (4.1
). Second, the loss function is only an approximation ofconsidered in Section 4.1. Third, in order apply the approximation , needs to be close to the critical point. We relegate the entire proof to Appendix A.
5 Applications in Online Estimation Problems
In the previous sections, we have shown that, as long as the objective function satisfies some smoothness conditions and the noise satisfies certain moment conditions, then SGLD hits a first/second order stationary point in polynomial time in terms of the following parameters:
In particular, first order stationary point hitting time bound only relies on the first two constants. In this section we provide three concrete example problems, linear regression, online matrix sensing, and online PCA to illustrate the usage of our analysis of SGLD. We will calculate the specific problem dependent constants in Assumption 3.1 and Assumption 4.1. Note that once we know these constants are bounded, Theorem 3.3 and Theorem 4.3 directly imply polynomial time hitting time. For all examples, we investigate the stochastic optimization setting, i.e., each random sample is only used once. All the proofs are deferred to Appendix B.
When calculating the constants defined above, they often have a positive dependence on the norm of the location where they are evaluated. Therefore we will assume in below that the iterate is bounded. This assumption sometimes is assumed even in the general theoretical analysis for simplicity Zhang et al. (2017). We remark this is not a restrictive assumption. Since in practice, if the SGLD iterates diverge to infinity for a particular application, it is a clear indication that the algorithm is not fit for this application.
5.1 Linear regression
Our first example is the classical linear regression problem. Let the -th sample be , where the input is a sequence of random vectors independently drawn from the same distribution , and the response follows a linear model,
Here represents the true parameters of the linear model, and are independently and identically distributed (i.i.d.) random variables, which are uncorrelated with . For simplicity, we assume and have all moments being finite. We consider the quadratic loss, i.e., given , the loss function is
and the population loss function is defined as
When , the constants for the population loss function in (5.2) are give by
Here, is a constant independent of other parameters.
In other words, the number of iterations for the SGLD to hit a first order stationary point only depend on , but not directly on the problem dimension . When is fixed, depends on the rank of , which can be much smaller than . Moreover, when the spectrum of is a summable sequence (e.g., where and is the -th largest eigenvalue of ), then is independent of
. Such settings rise frequently in Bayesian problems related to partial differential equations, wherein theory can be infinity (Stuart, 2010).
5.2 Matrix Factorization
In the online matrix factorization problem, we want to minimize the following function
where has rank and denotes the Frobenius norm. The stochastic version is given by
Suppose for some constant , the constants for the population loss function (5.3) is bounded by
Here is a constant independent of other parameters.
Similar to the linear regression case, if is low rank or its spectrum decays rapidly to zero, the number of iterations for the SGLD to hit a first order stationary point is independent of the problem dimension.
5.3 Online PCA
In the online principle component analysis (PCA) problem, we consider the scenario where we conduct PCA for data samples . The population loss function is given by
where . The stochastic version is given by,
Suppose for some constant , the constants for the population loss function (5.4) is bounded by
Since the hitting time to a first order stationary point only depends on and , Proposition 5.3 shows that the SGLD hits a first order stationary point with the number of iterations independent of the problem dimension.
6 Conclusion and Discussions
In this paper we present a direct analysis for hitting time of SGLD for the first and second order stationary points. Our proof only relies on basic linear algebra, and probability theory. Through this directly analysis, we show how different factors, such as smoothness of the objective function, noise strength, and step size, affect the final hitting time of SGLD. We also present three examples, online linear regression, online matrix factorization, and online PCA, which demonstrate the usefulness of our theoretical results in understanding SGLD for stochastic optimization tasks. An interesting future direction is to extend our proof techniques to analyze SGLD for optimizing deep neural networks. We believe combing recent progress in the landscape of deep learning(Yun et al., 2018; Kawaguchi, 2016; Du and Lee, 2018; Hardt and Ma, 2016), this direction is promising.
The research of Xi Chen is supported by NSF Award (IIS-1845444), Alibaba Innovation Research Award, and Bloomberg Data Science Research Grant. The research of Xin T. Tong is supported by the National University of Singapore grant R-146-000-226-133 and Singapore MOE grant R-146-000-258-114.
Appendix A Technical Proofs of SOSP hitting analysis
In this section, we provide technical verifications of our claims made in Section 4.
a.1 Taylor expansions and preliminary lemmas
Our analysis relies on Taylor expansions near critical points. To this end, for any given iterate , denote
Assumption 4.1 leads to the following expansion
where , and are reminder terms of Taylor expansion.
In order to apply the Taylor expansion, it is necessary for the subsequent iterates to be close to . To this end, we set up the following a-priori upper bound.
Denote , it follows the recursion below,
Note that if , , so
We take square of this bound, note that , the following estimate holds because of Young’s inequality