We consider the stochastic optimization problem:
is a real-valued, smooth, possibly non-convex objective function with two inputs, the decision vectorand a random vector defined on a set . A standard approach for solving stochastic optimization problems is to approximate the expectation as an average over independent observations and to solve the problem:
Such problems with finite-sum structure arise frequently in many applications including data analysis and machine learning. For example, in the context of stochastic learning problems, if
is an input-output pair of data sampled from an unknown underlying joint distribution, the function
corresponds to the loss functionof using the decision variable and input to predict . This formulation encompasses a number of regression and classification problems where
can be both convex and non-convex with respect to its first argument. For example, for linear and logistic regressions,is convex whereas corresponds to the choice of the parameters of the neural network we want to fit to the dataset (see e.g. [GBCB16]). In this work, our primary focus will be non-convex objectives.
First-order methods such as gradient descent and stochastic gradient descent and their variants with momentum have been popular for solving large-scale optimization problems due to their empirical performance, their scalability and cheaper iteration and storage cost compared to second-order methods in high dimensions[Ber15, Bub14]. These first-order methods admit some theoretical guarantees to locate a local minimizer, however their convergence depends strongly on the initialization (a different initial point can result in convergence to a different stationary point) and they do not have guarantees to visit a global minimum. The (gradient descent) Langevin dynamics (LD) is a variant of gradient descent where a properly scaled Gaussian noise is added to the gradients:
where is the stepsize, is a -dimensional anistropic Gaussian noise with distribution where for every , the noise is independent of the (filtration) past up to time and is called the inverse temperature parameter. With proper choice of the parameters and under mild assumptions, LD algorithm converges to a stationary distribution that concentrates around a global minimum (see e.g. [RRT17, BM99, GM91]) from an arbitrary initial point. Therefore, LD algorithm has a milder dependency on the initialization, visiting a global minimum eventually.444In the worst case, the number of iterations required to converge to a global minimum can be exponential in the dimension for an objective with multiple minima, in particular finding a global minimum of a non-convex objective is hard in general. However, when the objective has further structure such as a growth condition, the dependency can also be polynomial in dimension.
The analysis of the convergence behavior of LD is often based on viewing LD as a discretization of the associated stochastic differential equation (SDE), known as the overdamped Langevin diffusion or the first-order Langevin diffusion,
where is a normalizing constant. Without the noise term, overdamped Langevin SDE reduces to the gradient descent dynamics
which is an ordinary differential equation (ODE) that arises naturally in the study of LD (see e.g.[GM91], [FGQ97, Sec. 4]). This ODE is the continuum limit of the gradient descent algorithm as the stepsize goes to zero (see e.g. [SRBd17]).
Langevin dynamics has a long history, and has also been studied under simulated annealing algorithms in the optimization, physics and statistics literature and its asymptotic convergence guarantees are well known (see e.g. [Gid85, Haj85, GM91, KGV83, BT93, BLNR15, BM99]). However, finite-time performance guarantees for LD have not been studied until more recently. In particular, Raginsky et al. [RRT17] showed that Langevin dynamics with stochastic gradients converges to an -neighborhood of a global minimizer of (1.2) in iterations where is a spectral gap parameter related to the overdamped Langevin SDE. Zhang et al. [ZLC17] showed that LD with stochastic gradients is able to escape shallow local minima, hitting an -neighborhood of a local minimizer in time polynomial in the variables and . More recently, Tzen et al. [TLR18] showed that for a given local optimum , with high probability and arbitrary initialization, either LD iterates arrive at a point outside an -neighborhood of this local minimum within a recurrence time where is smallest eigenvalue of the Hessian at the local minimum or they enter this -neighborhood by the recurrence time and stay there until a potentially exponentially long escape time . The escape time measures how quickly the LD algorithm can get away from a given neighborhood around a local minimum, therefore it can be viewed as a measure of the effectiveness of LD for the search of a global minimizer, whereas the recurrence time can be viewed as the order of the time-scale for which search for local minima in the basin of attraction of that minimum happens [TLR18]. Typically, we have
, both time-scales are associated to the discrete-time LD dynamics, however they are often analyzed and estimated by considering the recurrence time and the exit time of the continuous-time overdamped Langevin diffusion[RRT17, GM91]. The simplest non-convex function studied frequently in the literature that sheds light into the exit time behavior of overdamped Langevin dynamics is the double-well example demonstrated in Figure 1, where the objective has two local minima separated by a saddle point in between.
For the overdamped Langevin diffusion, it is known that the expected time of the process starting from and hitting a small neighborhood of is given by
Here, as stands for the determinant of the Hessian of at , and is the unique negative eigenvalue of the Hessian of at the saddle point . This formula is known as the Eyring-Kramers formula for reversible diffusions. Its rigorous proof was first obtained by [BGK04] by a potential analysis approach, and then by [HKN04] through Witten Laplacian analysis. We refer to [Ber13] for a survey on mathematical approaches to the Eyring-Kramers formula.
We note that in many practical applications, for instance in the training of neural networks, the eigenvalues of the Hessian at local extrema concentrate around zero and the magnitude of the eigenvalues and can often be very small [SBL16, CCS17]. In this case, the recurrence time and the expected hitting time , being quantities that scale inversely with and respectively, can be quite large. This results in slow convergence of the overdamped process to its equilibrium [BGK05]. It is also known that the overdamped Langevin diffusion is a reversible555We note that overdamped Langevin SDE is reversible in the sense that if is distributed according to the stationary measure , then and have the same law. Mathematically, this is equivalent to the infinitesimal generator of process being (self-adjoint) symmetric in [LNP13, RW00]. Roughly speaking, reversibility of a Markov process says that statistical properties of the process are preserved if the process is run backwards in time. Markov process, and reversible processes can be much slower than their non-reversible variants that admit the same equilibrium distribution in terms of their convergence rate to the equilibrium (see e.g. [S09, LNP13, DHN00, HHMS93, HHMS05, EGZ17, GO07]).
There are two popular non-reversible variants of overdamped Langevin that can improve its performance in practice for a variety of applications [LNP13, Sec. 4]. The first variant is based on the underdamped Langevin diffusion, also known as the second-order Langevin diffusion,
where for each , and is known as the friction coefficient, and is a standard -dimensional Brownian motion. This diffusion goes back to Kramers [Kra40] and was derived in the physics literature to model particles moving in a potential subject to random noise. It is known that under mild assumption on , the Markov process is ergodic and have a unique stationary distribution
where is a normalizing constant. Hence, the marginal distribution in of the Gibbs distribution is exactly the same as the invariant distribution (1.4) of the overdamped Langevin dynamics (1.3). We refer the reader to [Pav14] for more on underdamped Langevin diffusions. The discretized dynamics are called the underdamped Langevin dynamics (ULD):666This algorithm is also known as the inertial Langevin dynamics or the Hamiltonian Langevin Monte Carlo algorithm.
where is a sequence of i.i.d standard Gaussian random vectors in . Recent work [GGZ18] showed that ULD admits better non-asymptotic performance guarantees compared to known guarantees for LD in the context of non-convex optimization when the objective satisfies a dissipativity condition. Recent work also showed that ULD or alternative discretizations of the underdamped diffusion can sample from the Gibbs distribution more efficiently than LD when is globally strongly convex [CCBJ17, DRD18, MS17] or strongly convex outside a compact set [CCA18].
The second variant of overdamped Langevin involves adding a drift term to make the dynamics non-reversible:
where is a anti-symmetric matrix, i.e. and is the identity matrix, and is a standard -dimensional Brownian motion. It can be shown that such a drift preserves the stationary distribution (1.4) (Gibbs distribution) of the overdamped Langevin dynamics [HHMS05, LNP13, Pav14, GM16]. Note that the non-reversible Langevin diffusion reduces to the overdamped Langevin diffusion (1.3) when . When is quadratic, the process in (1.12) is Gaussian. Using the rate of convergence of the covariance of as the criterion, [HHMS93] showed that is the worst choice, and improvement is possible if and only if the eigenvalues of the matrix associated with the linear drift term are not identical. [LNP13] proved the existence of the optimal such that the rate of convergence to equilibrium is maximized, and provided an easily implementable algorithm for constructing them. [WHC14] proposed two approaches to obtain the optimal rate of Gaussian diffusion and they also made the comparison with [LNP13]. [GM16] studied optimal linear drift for the speed of convergence in the hypoelliptic case. For more general non-quadratic , [HHMS05] showed by comparing the spectral gaps that by adding , the convergence to the Gibbs distribution is at least as fast as the overdamped Langevin diffusion (), and is strictly faster except for some rare situations. [DLP16]
showed that the asymptotic variance can also be reduced by using the non-reversible Langevin diffusion.
Finally, the discretization of the non-reversible Langevin diffusion (1.12) leads to
which we refer to as the non-reversible Langevin dynamics (NLD).
In this paper, we study two non-reversible variants of the gradient Langevin dynamics, the underdamped Langevin dynamics (ULD) and the non-reversible Langevin dynamics (NLD).
First, we consider the special case of quadratic objectives and derive an explicit characterization of the rate of convergence to the stationary distribution in the 2-Wasserstein metric. Our exponential rate is optimal and unimprovable in the sense that it is achieved for some quadratic functions and initialization. Our results show that both non-reversible diffusions mix faster than the reversible Langevin diffusion. Our mixing rate results improve upon the existing results by Cheng et al. [CCBJ17] and Dalalyan and Riou-Durand [DRD18] developed earlier for the more general case when is a strongly convex function with Lipschitz gradients.
Second, we investigate the metastability behavior of non-reversible Langevin algorithms for non-convex objectives. We extend the results of [TLR18] obtained for Langevin dynamics to non-reversible Langevin dynamics and show that for a given local minimum that is within an arbitrary distance from the initialization, with high probability, either ULD trajectory ends up somewhere outside an -neighborhood of this local minimum within a recurrence time or they enter this neighborhood by the recurrence time and stay there for a potentially exponentially long escape time. The analogous result shown in [TLR18] for reversible LD requires a recurrence time of . This shows that underdamped dynamics requires a smaller recurrence time by a square root factor in (ignoring a factor). Since the recurrence time can be viewed as a measure of the efficiency of the search of a local minimum [TLR18], our results suggest that underdamped Langevin dynamics operate on a faster time-scale to locate a local minimum. This is also consistent with our results for the quadratic objectives. We also show analogous results for NLD proving that its recurrence time improves upon that of LD under some assumptions.
Third, we consider the mean exit time from the basin of attraction of a local minimum for non-reversible algorithms. We focus on the double-well example (illustrated in Figure 1) which has been the recent focus of the literature [BR16, LMS17] as it is the simplest non-convex function that gives intuition about the more general case and for which mean exit time has been studied in continuous time.777 To our knowledge, a rigorous mathematical theory that characterizes the mean escape time for non-reversible Langevin algorithms beyond the double well example for general non-convex functions is non-existent at the moment.
To our knowledge, a rigorous mathematical theory that characterizes the mean escape time for non-reversible Langevin algorithms beyond the double well example for general non-convex functions is non-existent at the moment.We show that these results can be translated into discrete time if the stepsize parameter is small enough and the inverse temperature is large enough. Our analysis shows that non-reversible dynamics can exit the basin of attraction of a local minimum faster under some conditions and characterizes the improvement for both ULD and NLD compared to LD when the parameters of these algorithms are chosen appropriately. These results support the numerical evidence that non-reversible algorithms can explore the state space more efficiently (see e.g. [CDC15, CFG14, GM16]) and bridges a gap between the theory and practice of Langevin algorithms.
Throughout the paper, for any two real numbers , we use the notation to denote and to denote . For any matrix , we use , , to denote the eigenvalues of . We also assume that is the Hessian matrix evaluated at the local minimum , and is positive definite. The norm denotes the 2-norm of a vector and the (spectral norm) 2-norm of a matrix.
In our analysis, we will use the following 2-Wasserstein distance. For any two Borel probability measures and with finite second moments, the 2-Wasserstein distance is defined as
where the infimum is taken over all the random couples taking values in with marginals and . We refer the reader to [Vil09] for more on Wasserstein distances.
3 Special case: Quadratic objective
We first consider the case when is a strongly convex quadratic function
where is a symmetric positive definite matrix with eigenvalues in increasing order, i.e. , and is a -dimensional vector and is a scalar. Let
be the lower and upper bounds on the eigenvalues, where is the strong convexity constant and is the Lipschitz constant of the gradient
In this case, the Langevin diffusion corresponds to a particular Gaussian process known as the Ornstein-Uhlenbeck process. If the objective gradient is linearized around a point, such dynamics would naturally arise. Understanding the behavior of Langevin diffusion for this simpler special case have proven to be useful for analyzing the more general case when can be non-convex [RRT17], [Pav14, Sec. 6.3], [HN04] as well as shedding light into the convergence behavior of Langevin algorithms, see e.g. [DNP17, BG02]. Our analysis for characterizing the recurrence time of ULD and NLD will also build on our results in this section developed for quadratic objectives.
It is known that the overdamped Langevin diffusion mixes at the exponential rate with respect to the 2-Wasserstein metric [BGG12]. We next show that underdamped Langevin and non-reversible Langevin diffusions mix with a faster exponential rate.
3.1 Underdamped Langevin diffusion
By strong convexity, the quadratic function given in (3.1) has a unique minimum at . We write
Thus, we have the decomposition
Then, we can write the underdamped diffusion in matrix form as
where is a -dimensional standard Brownian motion and
This process is an Ornstein-Uhlenbeck process with a solution
-dimensional Gaussian distribution for allwith mean and covariance given by
It follows from (1.9) that the stationary distribution of is
which is a Gaussian distribution with mean and covariance given by
and the marginal stationary distribution of the process is given by
Natural questions to ask are the following:
Let be the probability law of the underdamped process at time . How fast does the underdamped process converge to its equilibrium in the metric?
How does the convergence depend on the constants and ?
These questions are naturally related to how fast and converges to its equilibrium values and . Dalalyan and Riou-Durand [DRD18] showed recently that
for some explicit constant that depends on . In [DRD18], the bound in (3.6) holds for , and the result in (3.6) can be extended to general by applying Lemma 1 in [DRD18]. For general , we should replace by and and by and to apply (3.6), and the exponent in (3.6) remains invariant. This improves upon the mixing rate obtained in [CCBJ17] for strongly convex for and in the underdamped SDE (see Theorem 5 888In our notation, our is the in [CCBJ17], but the rate is preserved. in [CCBJ17]). It is known that overdamped Langevin diffusion mixes at rate in the 2-Wasserstein metric [DRD18], the rate in (3.6) improves upon this rate when . Next, in Theorem 3, we show a stronger result that shows that the underdamped diffusion mixes at rate when is a strongly convex quadratic.
Let denote the probability law of the underdamped process at time . The 2-Wasserstein distance between the two Gaussian distributions and admits an explicit formula (see e.g. [Gel90])
Inserting formulas (3.4)–(3.5) into (3.7) leads to an explicit expression for the Wasserstein distance. In particular we see from (3.4) that is controlled by 999E.g. and ., where denotes the spectral norm (2-norm) of a matrix. In the next lemma, we provide an estimate on , in particular achieving the fastest mixing rate requires tuning the parameter .
(i) If ), then
(ii) If , then,
where is a constant that depends only on the eigenvalues of and is defined as
In Lemma 1, when , the matrix has double eigenvalues that are not simple (the eigenvalues have a Jordan block of size 2). As a consequence, in this case, there is a factor in front of the exponential that grows with and it is not possible to replace this factor with a universal constant that does not depend on . This behavior is also the reason why the constant in part (i) of Lemma 1 as .
Recall from (1.9) that for the underdamped diffusion in (1.7)-(1.8), we have the stationary distribution of the process . Based on the previous lemma and a standard coupling approach, we obtain the following convergence result in the 2-Wasserstein distance.
Consider the underdamped Langevin diffusion with parameter for a quadratic objective given by (3.1). Let denote the law of at time for the underdamped Langevin diffusion in (1.8). We have the following:
Following the proof technique of Theorem 3, it can be shown that if , then decays with an exponential rate which is a slower decay than the rate achieved for . In this sense, the choice of optimizes the convergence rate to the stationary distribution. This is illustrated in Figure 2. A similar result was known in dimension one (see [Ris89, Sec. 10.2], [Pav14, Section 6.3], [EGZ17, Sec. 1.6]), our result generalizes this result to arbitrary dimension and gives explicit constants.
Theorem 3 shows that a convergence rate of is achievable by tuning the parameter and specifies all the constants and dependence to initialization explicitly. We note that the exponent in in our analysis is optimal and is achieved for some quadratic functions.
3.2 Non-reversible Langevin diffusion
We will show that the non-reversible Langevin diffusion given by (1.12) can converge to its equilibrium with a faster exponential rate compared to the rate of overdamped Langevin diffusion. With a change of variable , the non-reversible Langevin dynamics given in (1.12) become
which admits the solution
We observe that, the convergence behavior of the process to its equilibrium is controlled by the decay of in time . It is expected that the decay of in time is related to the real part of the eigenvalues
indexed with increasing order and their multiplicity. We note that when , we have and which are the eigenvalues of . Hwang et al. [HHMS93] showed that for any anti-symmetric , we have the following result.
Lemma 5 ([Hhms93, Theorem 3.3]).
For any anti-symmetric matrix , we have
and if and only if the following condition holds:
There exists non-zero vectors and in and such that
is an eigenvector ofwith eigenvalue and is an eigenvector of with eigenvalue and , .
, there exists an invertible matrixsuch that
where the latter is a block diagonal matrix with diagonals , , with being the Jordan block corresponding to the -th eigenvalue. This leads to the immediate bound
where is a universal constant that depends on and is the size of the Jordan block corresponding to the eigenvalue (with the convention that a simple eigenvalue is corresponding to a Jordan block of size one). This is summarized in the following lemma:
There exists a positive constant that depends on such that
where is the maximal size of a Jordan block of corresponding to the eigenvalue . It follows that for any , there exist some constant that depends on and such that for every ,
One could also ask what is the choice of the matrix that can maximize the exponent that appears in Lemma 6, i.e., let
A formula for and an algorithm to compute it is known (see [LNP13, Fig. 1]), however this is not practical to compute for optimization purposes as it requires the knowledge of the eigenvectors and eigenvalues of the matrix which is unknown. Nevertheless, gives information about the extent of acceleration that can be obtained. It is known that
The acceleration is not possible (the ratio above is ) if and only if all the eigenvalues of are the same and are equal to ; i.e. when and . Otherwise, can accelerate by a factor of which is on the order of the condition number up to a constant which is close to one for large.
We recall that denotes the stationary distribution of the Langevin diffusion, that is, . Let denote the law of at time defined in (1.12). Using Lemma 6, we can show the following convergence rate in Wasserstein distances using a proof technique similar to the proof of Theorem 3.
For every ,
where , , are defined in Lemma 6.
In the next section, we will discuss non-convex objectives. The results from this section will play a crucial role to understand the behavior of Langevin algorithms near a local minimum which can be approximated by an Ornstein-Uhlenbeck process.
4 Recurrence and escape times for non-reversible Langevin dynamics
It is known that the reversible Langevin algorithm converges to a local minimum in time polynomial with respect to parameters and , the intuition being that the expectation of the iterates follows the gradient descent dynamics which converges to a local minimum [ZLC17, FGQ97]. It is also known that once Langevin algorithms arrive to a neighborhood of a local optimum, they can spend exponentially many iterations in dimension to escape from the basin of attraction of this local minimum. This behavior is known as “metastability” and has been studied well [BGK05, BGK04, Ber13, TLR18].
Recently, Tzen et al. [TLR18] provided a finer characterization of this metastability phenomenon and showed that for a particular local minimum, if the stepsize is small enough and the inverse temperature is large enough, with an arbitrary initialization, at least one of the following two events will occur with high probability: (1) The Langevin trajectory will end up being outside of the -neighborhood of this particular optimum within a short recurrence time where is the smallest eigenvalue of the Hessian at the local minimum. (2) The Langevin dynamics’ trajectory enters this -neighborhood by the recurrence time and stays there for a potentially exponentially long escape time . In Section 3, we observed that both underdamped Langevin and reversible Langevin concentrates around the local (global) minimum faster in continuous-time for the quadratic objectives. When the stepsize is small, we also expect similar results to hold in discrete time. In this section, we will study the recurrence time and escape time and of underdamped Langevin dynamics (ULD) and the corresponding time-scales and for non-reversible Langevin dynamics (NLD). We will show that recurrence time of underdamped and non-reversible Langevin algorithms will improve upon that of reversible Langevin algorithms in terms of its dependency to the smallest eigenvalue of the Hessian at a local minimum. Our analysis is based on linearizing the gradient of the objective around a local minimum and is based on the results derived in the previous section on quadratic objectives. We will also show in Section 5 that for the double-well potential, the mean exit times from the basin of attraction of a local minimum101010Formally, we define the basin of attraction of a local minimum as the set of all initial points such that the gradient flow ODE given in (1.5) with initial point converges to as time goes to infinity. for non-reversible Langevin dynamics will improve upon that of reversible Langevin dynamics in terms of dependency to the curvature at the saddle point.
Throughout the rest of the paper, we impose the following assumptions.
We impose the following assumptions.
The functions are twice continuously differentiable, non-negative valued, and
uniformly in for some .
have Lipschitz-continuous gradients and Hessians, uniformly in , there exist constants so that for all ,
The empirical risk is -dissipative 111111In terms of notations, the dissipativity constant here is taken to be the same as the minimum eigenvalue of the Hessian matrix in this section, as well as for the quadratic case in Section 3. With abuse of notations, we use both for non-convex and convex cases.:
The initialization satisfies .
The first assumption and the second assumption on gradient Lipschitzness is standard (see e.g. [MSH02, TLR18, EGZ17, CCBJ17]). The second assumption on the Lipschitzness of the Hessian is also frequently made in the literature [TLR18, CFM18, DRD18], this allows a more accurate local approximation of the Langevin dynamics as an Ornstein-Uhlenbeck process. The third assumption on dissipativity is also standard in the literature to ensure convergence of Langevin diffusions to the stationary measure. It is not hard to see from dissipativity condition (4.2) in Assumption 9 (iii) above that for any local minimum of the Hessian matrix , which is positive definite and the minimum eigenvalue of is , then we have . This indicates that we can choose an initialization to satisfy as in part (iv) of Assumption 9.
4.1 Underdamped Langevin dynamics
In this section, we investigate the behavior around local minima for the underdamped Langevin dynamics (1.10)-(1.11) by studying recurrence and escape times with the choice of the friction coefficient which is optimal for the convergence rate as discussed in Section 3. We also give an analogous theorem for the case in Section E of the Appendix; the remaining case can also be treated similarly.
Before we state the main result, we summarize the technical constants that will be used in Table 1 121212In Table 1, , give the uniform bounds for the continuous-time processes and defined in (1.8) and (1.7) respectively, and , give the uniform bounds for the discrete-time processes and defined in (1.11) and (1.10) respectively, and and are used in the upper bound on the stepsize under which the uniform bounds for the discrete-time processes are valid, see Lemma 24 for details.. In the following result and the rest of the paper unless we specify otherwise, the notation hides dependency to the constants , , . We give explicit expressions for all the constants in the proofs.