A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics

02/18/2017 ∙ by Yuchen Zhang, et al. ∙ Stanford University 0

We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for non-convex optimization. The algorithm performs stochastic gradient descent, where in each step it injects appropriately scaled Gaussian noise to the update. We analyze the algorithm's hitting time to an arbitrary subset of the parameter space. Two results follow from our general theory: First, we prove that for empirical risk minimization, if the empirical risk is point-wise close to the (smooth) population risk, then the algorithm achieves an approximate local minimum of the population risk in polynomial time, escaping suboptimal local minima that only exist in the empirical risk. Second, we show that SGLD improves on one of the best known learnability results for learning linear classifiers under the zero-one loss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central challenge of non-convex optimization is avoiding sub-optimal local minima. Although escaping all local minima is NP-hard in general [e.g. 7], one might expect that it should be possible to escape “appropriately shallow” local minima, whose basins of attraction have relatively low barriers. As an illustrative example, consider minimizing an empirical risk function in Figure 1. As the figure shows, although the empirical risk is uniformly close to the population risk, it contains many poor local minima that don’t exist in the population risk. Gradient descent is unable to escape such local minima.

A natural workaround is to inject random noise to the gradient. Empirically, adding gradient noise has been found to improve learning for deep neural networks and other non-convex models 

[23, 24, 18, 17, 35]. However, theoretical understanding of the value of gradient noise is still incomplete. For example, Ge et al. [14] show that by adding isotropic noise and by choosing a sufficiently small stepsize , the iterative update:

(1)

is able to escape strict saddle points. Unfortunately, this approach, as well as the subsequent line of work on escaping saddle points [20, 2, 1], doesn’t guarantee escaping even shallow local minima.

Another line of work in Bayesian statistics studies the Langevin Monte Carlo (LMC) method

[28], which employs an alternative noise term. Given a function , LMC performs the iterative update:

(2)

where

is a “temperature” hyperparameter. Unlike the bounded noise added in formula (

1), LMC adds a large noise term that scales with . With a small enough , the noise dominates the gradient, enabling the algorithm to escape any local minimum. For empirical risk minimization, one might substitute the exact gradient with a stochastic gradient, which gives the Stochastic Gradient Langevin Dynamics (SGLD) algorithm [34]. It can be shown that both LMC and SGLD asymptotically converge to a stationary distribution  [28, 30]. As

, the probability mass of

concentrates on the global minimum of the function , and the algorithm asymptotically converges to a neighborhood of the global minimum.

Figure 1: Empirical risk (sample size ) versus population risk (sample size ) on one-dimensional zero-one losses. The two functions are uniformly close, but the empirical risk contains local minima that that are far worse than the population local minima.

Despite asymptotic consistency, there is no theoretical guarantee that LMC is able to find the global minimum of a general non-convex function, or even a local minimum of it, in polynomial time. Recent works focus on bounding the mixing time (i.e. the time for converging to ) of LMC and SGLD. Bubeck et al. [10], Dalalyan [12] and Bonis [8] prove that on convex functions, LMC converges to the stationary distribution in polynomial time. On non-convex functions, however, an exponentially long mixing time is unavoidable in general. According to Bovier et al. [9], it takes the Langevin diffusion at least time to escape a depth- basin of attraction. Thus, if the function contains multiple “deep” basins with , then the mixing time is lower bounded by .

In parallel work to this paper, Raginsky et al. [27] upper bound the time of SGLD converging to an approximate global minimum of non-convex functions. They show that the upper bound is polynomial in the inverse of a quantity they call the uniform spectral gap. Similar to the mixing time bound, in the presence of multiple local minima, the convergence time to an approximate global minimum can be exponential in dimension  and the temperature parameter .

Contributions

In this paper, we present an alternative analysis of SGLD algorithm.444The theory holds for the standard LMC algorithm as well. Instead of bounding its mixing time, we bound the algorithm’s hitting time to an arbitrary set on a general non-convex function. The hitting time captures the algorithm’s optimization efficiency, and more importantly, it enjoys polynomial rates for hitting appropriately chosen sets regardless of the mixing time, which could be exponential. We highlight two consequences of the generic bound: First, under suitable conditions, SGLD hits an approximate local minimum of , with a hitting time that is polynomial in dimension and all hyperparameters; this extends the polynomial-time guarantees proved for convex functions [10, 12, 8]. Second, the time complexity bound is stable, in the sense that any perturbation in -norm of the function  doesn’t significantly change the hitting time. This second property is the main strength of SGLD: For any function , if there exists another function such that , then we define the set to be the approximate local minima of . The two properties together imply that even if we execute SGLD on function , it hits an approximate local minimum of in polynomial time. In other words, SGLD is able to escape “shallow” local minima of that can be eliminated by slightly perturbing the function.

This stability property is useful in studying empirical risk minimization (ERM) in situations where the empirical risk is pointwise close to the population risk

, but has poor local minima that don’t exist in the population risk. This phenomenon has been observed in statistical estimation with non-convex penalty functions 

[33, 21], as well as in minimizing the zero-one loss (see Figure 1). Under this setting, our result implies that SGLD achieves an approximate local minimum of the (smooth) population risk in polynomial time, ruling out local minima that only exist in the empirical risk. It improves over recent results on non-convex optimization [14, 20, 2, 1], which compute approximate local minima only for the empirical risk.

As a concrete application, we prove a stronger learnability result for the problem of learning linear classifiers under the zero-one loss [3], which involves non-convex and non-smooth empirical risk minimization. Our result improves over the recent result of Awasthi et al. [4]: the method of Awasthi et al. [4] handles noisy data corrupted by a very small Massart noise (at most ), while our algorithm handles Massart noise up to any constant less than . As a Massart noise of represents completely random observations, we see that SGLD is capable of learning from very noisy data.

Techniques

The key step of our analysis is to define a positive quantity called the restricted Cheeger constant. This quantity connects the hitting time of SGLD, the geometric properties of the objective function, and the stability of the time complexity bound. For an arbitrary function and an arbitrary set , the restricted Cheeger constant is defined as the minimal ratio between the surface area of a subset and its volume with respect to a measure . We prove that the hitting time is polynomial in the inverse of the restricted Cheeger constant (Section 2.3). The stability of the time complexity bound follows as a natural consequence of the definition of this quantity (Section 2.2). We then develop techniques to lower bound the restricted Cheeger constant based on geometric properties of the objective function (Section 2.4).

Notation

For any positive integer , we use as a shorthand for the discrete set . For a rectangular matrix , let

be its nuclear norm (i.e., the sum of singular values), and

be its spectral norm (i.e., the maximal singular value). For any point and an arbitrary set , we denote their Euclidean distance by . We use to denote the Euclidean ball of radius that centers at point .

2 Algorithm and main results

In this section, we define the algorithm and the basic concepts, then present the main theoretical results of this paper.

2.1 The SGLD algorithm

Input: Objective function ; hyperparameters .

  1. Initialize by uniformly sampling from the parameter space .

  2. For each : Sample . Compute a stochastic gradient such that . Then update:

    (3a)
    (3b)

Output: .

Algorithm 1 Stochastic Gradient Langevin Dynamics

Our goal is to minimize a function in a compact parameter space . The SGLD algorithm [34] is summarized in Algorithm 1. In step (3a), the algorithm performs SGD on the function , then adds Gaussian noise to the update. Step (3b

) ensures that the vector

always belong to the parameter space, and is not too far from of the previous iteration.555The hyperparameter can be chosen large enough so that the constraint is satisfied with high probability, see Theorem 1. After iterations, the algorithm returns a vector . Although standard SGLD returns the last iteration’s output, we study a variant of the algorithm which returns the best vector across all iterations. This choice is important for our analysis of hitting time. We note that evaluating can be computationally more expensive than computing the stochastic gradient , because the objective function is defined on the entire dataset, while the stochastic gradient can be computed via a single instance. Returning the best merely facilitates theoretical analysis and might not be necessary in practice.

Because of the noisy update, the sequence asymptotically converges to a stationary distribution rather than a stationary point [30]. Although this fact introduces challenges to the analysis, we show that its non-asymptotic efficiency can be characterized by a positive quantity called the restricted Cheeger constant.

2.2 Restricted Cheeger constant

For any measurable function , we define a probability measure whose density function is:

(4)

For any function and any subset , we define the restricted Cheeger constant as:

(5)

The restricted Cheeger constant generalizes the notion of the Cheeger isoperimetric constant [11], quantifying how well a subset of can be made as least connected as possible to the rest of the parameter space. The connectivity is measured by the ratio of the surface measure to the set measure . Intuitively, this quantifies the chance of escaping the set under the probability measure .

Stability of restricted Cheeger constant

A property that will be important in the sequal is that the restricted Cheeger constant is stable under perturbations: if we perturb by a small amount, then the values of won’t change much, so that the variation on will also be small. More precisely, for functions and satisfying , we have

(6)

and similarly . As a result, if two functions and are uniformly close, then we have for a constant . This property enables us to lower bound by lower bounding the restricted Cheeger constant of an alternative function , which might be easier to analyze.

2.3 Generic non-asymptotic bounds

We make several assumptions on the parameter space and on the objective function.

Assumption A (parameter space and objective function).
  • The parameter space satisfies: there exists , such that for any and any

    , the random variable

    satisfies .

  • The function is bounded, differentiable and -smooth in , meaning that for any , we have .

  • The stochastic gradient vector has sub-exponential tails: there exists , , such that given any and any vector satisfying , the vector satisfies .

The first assumption states that the parameter space doesn’t contain sharp corners, so that the update (3b) won’t be stuck at the same point for too many iterations. It can be satisfied, for example, by defining the parameter space to be an Euclidean ball and choosing . The probability is arbitrary and can be replaced by any constant in . The second assumption requires the function to be smooth. We show how to handle non-smooth functions in Section 3 by appealing to the stability property of the restricted Cheeger constant discussed earlier. The third assumption requires the stochastic gradient to have sub-exponential tails, which is a standard assumption in stochstic optimization.

Theorem 1.

Assume that Assumption A holds. For any subset and any , there exist and , such that if we choose any stepsize and hyperparameter , then with probability at least , SGLD after iterations returns a solution satisfying:

(7)

The iteration number is bounded by

(8)

where the numerator is polynomial in . See Appendix B.2 for the explicit polynomial dependence.

Theorem 1 is a generic result that applies to all optimization problems satisfying Assumption A. The right-hand side of the bound (7) is determined by the choice of . If we choose to be the set of (approximate) local minima, and let be sufficiently small, then will roughly be bounded by the worst local minimum. The theorem permits to be arbitrary provided the stepsize is small enough. Choosing a larger means adding less noise to the SLGD update, which means that the algorithm will be more efficient at finding a stationary point, but less efficient at escaping local minima. Such a trade-off is captured by the restricted Cheeger constant in inequality (8) and will be rigorously studied in the next subsection.

The iteration complexity bound is governed by the restricted Cheeger constant. For any function  and any target set with a positive Borel measure, the restricted Cheeger constant is strictly positive (see Appendix A), so that with a small enough , the algorithm always converges to the global minimum asymptotically. We remark that the SGD doesn’t enjoy the same asymptotic optimality guarantee, because it uses a gradient noise in contrast to SGLD’s one. Since the convergence theory requires a small enough , we often have . the SGD noise is too conservative to allow the algorithm to escape local minima.

The proof of Theorem 1 is fairly technical. We defer the full proof to Appendix B

, only sketching the basic proof ideas here. At a high level, we establish the theorem by bounding the hitting time of the Markov chain

to the set . Indeed, if some hits the set, then:

which establishes the risk bound (7).

In order to bound the hitting time, we construct a time-reversible Markov chain, and prove that its hitting time to is on a par with the original hitting time. To analyze this second Markov chain, we define a notion called the restricted conductance, which measures how easily the Markov chain can transition between states within . This quantity is related to the notion of conductance in the analysis of time-reversible Markov processes [22], but the ratio between these two quantities can be exponentially large for non-convex . We prove that the hitting time of the second Markov chain depends inversely on the restricted conductance, so that the problem reduces to lower bounding the restricted conductance.

Finally, we lower bound the restricted conductance by the restricted Cheeger constant. The former quantity characterizes the Markov chain, while the later captures the geometric properties of the function . Thus, we must analyze the SGLD algorithm in depth to establish a connection between them. Once we prove this lower bound, putting all pieces together completes the proof.

2.4 Lower bounding the restricted Cheeger constant

In this subsection, we prove lower bounds on the restricted Cheeger constant in order to flesh out the iteration complexity bound of Theorem 1. We start with a lower bound for the class of convex functions:

Proposition 1.

Let be a -dimensional unit ball. For any convex -Lipschitz continuous function and any , let the set of -optimal solutions be defined by:

Then for any , we have .

The proposition shows that if we choose a big enough , then will be lower bounded by a universal constant. The lower bound is proved based on an isoperimetric inequality for log-concave distributions. See Appendix C for the proof.

For non-convex functions, directly proving the lower bound is difficult, because the definition of involves verifying the properties of all subsets . We start with a generic lemma that reduces the problem to checking properties of all points in .

Lemma 1.

Consider an arbitrary continuously differentiable vector field and a positive number such that:

(9)

For any continuously differentiable function and any subset , the restricted Cheeger constant is lower bounded by

Figure 2: Consider a mapping . If the conditions of Lemma 1 hold, then we have and consequentely . We use inequality (10) to lower bound the restricted Cheeger constant.

Lemma 1 reduces the problem of lower bounding to the problem of finding a proper vector field and verifying its properties for all points . Informally, the quantity measures the chance of escaping the set . The lemma shows that if we can construct an “oracle” vector field , such that at every point it gives the correct direction (i.e. ) to escape , but always stay in , then we obtain a strong lower bound on . This construction is merely for the theoretical analysis and doesn’t affect the execution of the algorithm.

The proof idea is illustrated in Figure 2: by constructing a mapping that satisfies the conditions of the lemma, we obtain for all , and consequently . Then we are able to lower bound the restricted Cheeger constant by:

(10)

where is an infinitesimal of the set . It can be shown that the right-hand side of inequality (10) is equal to , which establishes the lemma. See Appendix D for a rigorous proof.

Before demonstrating the applications of Lemma 1, we make several additional mild assumptions on the parameter space and on the function .

Assumption B (boundary condition and smoothness).
  • The parameter space is a -dimensional ball of radius centered at the origin. There exists such that for every point satisfying , we have .

  • For some , the function is third-order differentiable with , and for any .

The first assumption requires the parameter space to be an Euclidean ball and imposes a gradient condition on its boundary. This is made mainly for the convenience of theoretical analysis. We remark that for any function , the condition on the boundary can be satisfied by adding a smooth barrier function to it, where the function for any , but sharply increases on the interval to produce large enough gradients. The second assumption requires the function to be third-order differentiable. We shall relax the second assumption in Section 3.

The following proposition describes a lower bound on when is a smooth function and the set consists of approximate stationary points. Although we shall prove a stronger result, the proof of this proposition is a good example for demonstrating the power of Lemma 1.

Proposition 2.

Assume that Assumption B holds. For any , define the set of -approximate stationary points . For any , we have .

Proof.

Recall that is the Lipschitz constant of function . Let the vector field be defined by , then we have . By Assumption B, it is easy to verify that the conditions of Lemma 1 hold. For any , the fact that implies:

Recall that is the smoothness parameter. By Assumption B, the divergence of is upper bounded by . Consequently, if we choose as assumed, then we have:

Lemma 1 then establishes the claimed lower bound. ∎

Next, we consider approximate local minima [25, 1], which rules out local maxima and strict saddle points. For an arbitrary , the set of -approximate local minima is defined by:

(11)

We note that an approximate local minimum is not necessarily close to any local minimum of . However, if we assume in addition the the function satisfies the (robust) strict-saddle property [14, 20], then any point is guaranteed to be close to a local minimum. Based on definition (11), we prove a lower bound for the set of approximate local minima.

Proposition 3.

Assume that Assumption B holds. For any , let be the set of -approximate local minima. For any satisfying

(12)

we have . The notation hides a poly-logarithmic function of .

Proving Proposition 3 is significantly more challenging than proving Proposition 2. From a high-level point of view, we still construct a vector field , then lower bound the expression for every point in order to apply Lemma 1. However, there exist saddle points in the set , such that the inner product can be very close to zero. For these points, we need to carefully design the vector field so that the term is strictly negative and bounded away from zero. To this end, we define to be the sum of two components. The first component aligns with the gradient . The second component aligns with a projected vector , which projects

to the linear subspace spanned by the eigenvectors of

with negative eigenvalues. It can be shown that the second component produces a strictly negative divergence in the neighborhood of strict saddle points. See Appendix 

E for the complete proof.

2.5 Polynomial-time bound for finding an approximate local minimum

Combining Proposition 3 with Theorem 1, we conclude that SGLD finds an approximate local minimum of the function in polynomial time, assuming that is smooth enough to satisfy Assumption B.

Corollary 1.

Assume that Assumptions A,B hold. For an arbitrary , let be the set of -approximate local minima. For any , there exists a large enough and hyperparameters such that with probability at least , SGLD returns a solution satisfying

The iteration number is bounded by a polynomial function of all hyperparameters in the assumptions as well as .

Similarly, we can combine Proposition 1 or Proposition 2 with Theorem 1, to obtain complexity bounds for finding the global minimum of a convex function, or finding an approximate stationary point of a smooth function.

Corollary 1 doesn’t specify any upper limit on the temperature parameter . As a result, SGLD can be stuck at the worst approximate local minima. It is important to note that the algorithm’s capability of escaping certain local minima relies on a more delicate choice of . Given objective function , we consider an arbitrary smooth function such that . By Theorem 1, for any target subset , the hitting time of SGLD can be controlled by lower bounding the restricted Cheeger constant . By the stability property (6), it is equivalent to lower bounding because and are uniformly close. If is chosen large enough (w.r.t. smoothness parameters of ), then the lower bound established by Proposition 3 guarantees a polynomial hitting time to the set of approximate local minima of . Thus, SGLD can efficiently escape all local minimum of that lie outside of . Since the function is arbitrary, it can be thought as a favorable perturbation of such that the set eliminates as many local minima of as possible. The power of such perturbations are determined by their maximum scale, namely the quantity . Therefore, it motivates choosing the smallest possible whenever it satisfies the lower bound in Proposition 3.

The above analysis doesn’t specify any concrete form of the function . In Section 3, we present a concrete analysis where the function is assumed to be the population risk of empirical risk minimization (ERM). We establish sufficient conditions under which SGLD efficiently finds an approximate local minima of the population risk.

3 Applications to empirical risk minimization

In this section, we apply SGLD to a specific family of functions, taking the form:

These functions are generally referred as the empirical risk in the statistical learning literature. Here, every instance is i.i.d. sampled from a distribution , and the function measures the loss on individual samples. We define population risk to be the function .

We shall prove that under certain conditions, SGLD finds an approximate local minimum of the (presumably smooth) population risk in polynomial time, even if it is executed on a non-smooth empirical risk. More concretely, we run SGLD on a smoothed approximation of the empirical risk that satisfies Assumption A. With large enough sample size, the empirical risk and its smoothed approximation will be close enough to the population risk , so that combining the stability property with Theorem 1 and Proposition 3 establishes the hitting time bound. First, let’s formalize the assumptions.

Assumption C (parameter space, loss function and population risk).
  • The parameter space satisfies: there exists , such that for any and any , the random variable satisfies .

  • There exist such that in the set , the population risk is -Lipschitz continuous, and .

  • For some , the loss is uniformly bounded in for any .

The first assumption is identical to that of Assumption A. The second assumption requires the population risk to be Lipschitz continuous, and it bounds the -norm distance between and . The third assumption requires the loss to be uniformly bounded. Note that Assumption C allows the empirical risk to be non-smooth or even discontinuous.

Since the function can be non-differentiable, the stochastic gradient may not be well defined. We consider a smooth approximation of it following the idea of Duchi et al. [13]:

(13)

where is a smoothing parameter. We can easily compute a stochastic gradient of as follows:

(14)

Here, is sampled from and is uniformly sampled from

. This stochastic gradient formulation is useful when the loss function

is non-differentiable, or when its gradient norms are unbounded. The former happens for minimizing the zero-one loss, and the later can arise in training deep neural networks [26, 6]. Since the loss function is uniformly bounded, formula (14) guarantees that the squared-norm is sub-exponential.

We run SGLD on the function . Theorem 1 implies that the time complexity inversely depends on the restricted Cheeger constant . We can lower bound this quantity using — the restricted Cheeger constant of the population risk. Indeed, by choosing a small enough , it can be shown that . The stability property (6) then implies

(15)

For any , we have , thus the term is lower bounded by . As a consequence, we obtain the following special case of Theorem 1.

Theorem 2.

Assume that Assumptions C holds. For any subset , any and any , there exist hyperparameters such that with probability at least , running SGLD on returns a solution satisfying:

(16)

The iteration number is polynomial in .

See Appendix F for the proof.

In order to lower bound the restricted Cheeger constant , we resort to the general lower bounds in Section 2.4. Consider population risks that satisfy the conditions of Assumption B. By combining Theorem 2 with Proposition 3, we conclude that SGLD finds an approximate local minimum of the population risk in polynomial time.

Corollary 2.

Assume that Assumption C holds. Also assume that Assumption B holds for the population risk  with smoothness parameters . For any , let be the set of -approximate local minima of . If

(17)

then there exist hyperparameters such that with probability at least , running SGLD on returns a solution satisfying . The time complexity will be bounded by a polynomial function of all hyperparameters in the assumptions as well as . The notation hides a poly-logarithmic function of .

Assumption B requires the population risk to be sufficiently smooth. Nonetheless, assuming smoothness of the population risk is relatively mild, because even if the loss function is discontinuous, the population risk can be smooth given that the data is drawn from a smooth density. The generalization bound (17) is a necessary condition, because the constraint for Theorem 2 and the constraint (12) for Proposition 3 must simultaneously hold. With a large sample size , the empirical risk can usually be made sufficiently close to the population risk. There are multiple ways to bound the -distance between the empirical risk and the population risk, either by bounding the VC-dimension [32], or by bounding the metric entropy [15] or the Rademacher complexity [5] of the function class. We note that for many problems, the function gap uniformly converges to zero in a rate for some constant . For such problems, the condition (17) can be satisfied with a polynomial sample complexity.

4 Learning linear classifiers with zero-one loss

As a concrete application, we study the problem of learning linear classifiers with zero-one loss. The learner observes i.i.d. training instances where are feature-label pairs. The goal is to learn a linear classifier in order to minimize the zero-one loss:

For a finite dataset , the empirical risk is . Clearly, the function is non-convex and discontinous, and has zero gradients almost everywhere. Thus the optimization cannot be accomplished by gradient descent.

For a general data distribution, finding a global minimizer of the population risk is NP-hard [3]. We follow Awasthi et al. [4] to assume that the feature vectors are drawn uniformly from the unit sphere, and the observed labels are corrupted by the Massart noise. More precisely, we assume that there is an unknown unit vector such that for every feature , the observed label satisfies:

(18)

where is the Massart noise level. We assume that the noise level is strictly smaller than when the feature vector is separated apart from the decision boundary. Formally, there is a constant such that

(19)

The value of can be adversarially perturbed as long as it satisfies the constraint (19). Awasthi et al. [4] studied the same Massart noise model, but they impose a stronger constraint for all , so that almost all observed labels are accurate. In contrast, our model (19) captures arbitrary Massart noises (because can be arbitrarily small), and allows for completely random observations at the decision boundary. Our model is thus more general than that of Awasthi et al. [4].

Given function , we use SGLD to optimize its smoothed approximation (13) in a compact parameter space . The following theorem shows that the algorithm finds an approximate global optimum in polynomial time, with a polynomial sample complexity.

Theorem 3.

Assume that . For any and , if the sample size satisfies:

then there exist hyperparameters such that SGLD on the smoothed function (13) returns a solution satisfying with probability at least . The notation hides a poly-logarithmic function of . The time complexity of the algorithm is polynomial in .

The proof consists of two parts. For the first part, we prove that the population risk is Lipschitz continuous and the empirical risk uniformly converges to the population risk, so that Assumption C hold. For the second part, we lower bound the restricted Cheeger constant by Lemma 1. The proof is spiritually similar to that of Proposition 2 or Proposition 3. We define  to be the set of approximately optimal solutions, and construct a vector field such that:

By lower bounding the expression for all , Lemma 1 establishes a lower bound on the restricted Cheeger constant. The theorem is established by combining the two parts together and by Theorem 2. We defer the full proof to Appendix G.

5 Conclusion

In this paper, we analyzed the hitting time of the SGLD algorithm on non-convex functions. Our approach is different from existing analyses on Langevin dynamics [10, 12, 8, 30, 27], which connect LMC to a continuous-time Langevin diffusion process, then study the mixing time of the latter process. In contrast, we are able to establish polynomial-time guarantees for achieving certain optimality sets, regardless of the exponential mixing time.

For future work, we hope to establish stronger results on non-convex optimization using the techniques developed in this paper. Our current analysis doesn’t apply to training over-specified models. For these models, the empirical risk can be minimized far below the population risk [29], thus the assumption of Corollary 2

is violated. In practice, over-specification often makes the optimization easier, thus it could be interesting to show that this heuristic actually improves the restricted Cheeger constant. Another open problem is avoiding poor population local minima.

Jin et al. [16]

proved that there are many poor population local minima in training Gaussian mixture models. It would be interesting to investigate whether a careful initialization could prevent SGLD from hitting such bad solutions.

References

Appendix A Restricted Cheeger constant is strictly positive

In this appendix, we prove that under mild conditions, the restricted Cheeger constant for a convex parameter space is always strictly positive. Let be an arbitrary convex parameter space with diameter . Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: for any subset and any , the following lower bound holds:

(20)

where represents the Borel measure of set . Let be a constant zero function. By the definition of the function-induced probability measure, we have

(21)

Combining the inequality (20) with equation (21), we obtain:

If the set satisfies , then . Combining it with the above inequality, we obtain:

According to the definition of the restricted Cheeger constant, the above lower bound implies:

(22)

Consider an arbitrary bounded function satisfying , combining the stability property (6) and inequality (22), we obtain:

We summarize the result as the following proposition.

Proposition 4.

Assume that is a convex parameter space with finite diameter. Also assume that is a measurable set satisfying . For any bounded function , the restricted Cheeger constant is strictly positive.

Appendix B Proof of Theorem 1

The proof consists of two parts. We first establish a general bound on the hitting time of Markov chains to a certain subset , based on the notion of restricted conductance. Then we prove that the hitting time of SGLD can be bounded by the hitting time of a carefully constructed time-reversible Markov chain. This Markov chain runs a Metropolis-Hastings algorithm that converges to the stationary distribution . We prove that this Markov chain has a bounded restricted conductance, whose value is characterized by the restricted Cheeger constant that we introduced in Section 2.2. Combining the two parts establishes the general theorem.

b.1 Hitting time of Markov chains

For an arbitrary Markov chain defined on the parameter space , we represent the Markov chain by its transition kernel , which gives the conditional probability that the next state satisfies given the current state . Similarly, we use to represent the conditional probability . If has a stationary distribution, then we denote it by .

A Markov chain is call lazy if for every , and is called time-reversible if it satisfies

If is a realization of the Markov chain , then the hitting time to some set is denoted by:

For arbitrary subset , we define the restricted conductance, denoted by , to be the following infinimum ratio:

(23)

Based on the notion of restricted conductance, we present a general upper bound on the hitting time. For arbitrary subset , suppose that is an arbitrary Markov chain whose transition kernel is stationary inside , namely it satisfies for any . Let be a realization of the Markov chain . We denote by

the probability distribution of

at iteration . In addition, we define a measure of closeness between any two Markov chains.

Definition.

For two Markov chains and , we say that is -close to w.r.t. a set if the following condition holds for any and any :

(24)

Then we are able to prove the following lemma.

Lemma 2.

Let be a time-reversible lazy Markov chain with atom-free stationary distribution . Assume that is -close to w.r.t.  where . If there is a constant such that the distribution satisfies for any , then for any , the hitting time of the Markov chain is bounded by:

(25)

with probability at least .

See Appendix B.3.1 for the proof of Lemma 2. The lemma shows that if the two chains and are sufficiently close, then the hitting time of the Markov chain will be inversely proportional to the square of the restricted conductance of the Markov chain , namely . Note that if the density function of distribution is bounded, then by choosing

to be the uniform distribution over

, there exists a finite constant such that , satisfying the last condition of Lemma 2.

b.2 Proof of the theorem

The SGLD algorithm initializes by the uniform distribution (with ). Then at iteration , it performs the following update:

(26)

We refer the particular setting as the “standard setting”. For the “non-standard” setting of , we rewrite the first equation as:

This re-formulation reduces to the problem to the standard setting, with stepsize and objective function . Thus it suffices to prove the theorem in the standard setting, then plug in the stepsize and the objective function to obtain the general theorem. Therefore, we assume and consider the sequence of points generated by:

(27)

We introduce two additional notations: for arbitrary functions , we denote the maximal gap