First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

06/21/2019 ∙ by Thanh Huy Nguyen, et al. ∙ 12

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using α-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a Lévy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient descent (SGD) is one of the most popular algorithms in machine learning due to its scalability to large dimensional problems as well as favorable generalization properties. SGD algorithms are applicable to a broad set of convex and non-convex optimization problems arising in machine learning [1, 2], including deep learning where they have been particularly successful [3, 4, 5]. In deep learning, many key tasks can be formulated as the following non-convex optimization problem:

(1)

where

contains the weights for the deep network to estimate,

is the typically non-convex loss function corresponding to the

-th data point, and is the number of data points [6, 7, 5]. SGD iterations consist of

(2)

where is the step-size, denotes the iterations, is the initial point,

is an unbiased estimator of the actual gradient

, estimated from a subset of the component functions . In particular, the gradients of the objective are estimated as averages of the form

(3)

where is a random subset that is drawn with or without replacement at iteration , and denotes the number of elements in [1].

The popularity and success of SGD in practice have motivated researchers to investigate and analyze the reasons behind; a topic which has been an active research area [6, 4]. One well-known hypothesis [8] that has gained recent popularity (see e.g. [4, 9]) is that among all the local minima lying on the non-convex energy landscape defined by the loss function (1), local minima that lie on wider valleys generalize better compared to sharp valleys, and that SGD is able to converge to the “right local minimum" that generalizes better. This is visualized in Figure 1(right), where the local minimum on the right lies on a wider valley with width compared to the local minimum on the left with width lying in a sharp valley of depth . Interpreting this hypothesis and the structure of the local minima found by SGD clearly requires a deeper understanding of the statistical properties of the gradient noise and its implications on the dynamics of SGD. A number of papers in the literature argue that the noise has Gaussian structure [10, 7, 11, 12, 13, 3]. Under the Gaussian noise assumption, the following continuous-time limit of SGD has been considered in the literature to analyze the behavior of SGD:

(4)

where is the standard Brownian motion and

is the noise variance and

is the step-size.

Figure 1: Illustration of (left), (middle), wide-narrow minima (right).

The Gaussianity of the gradient noise implicitly assumes that the gradient noise has a finite variance with light tails. In a recent study, [6] empirically illustrated that in various deep learning settings, the gradient noise admits a heavy-tail behavior, which suggests that the Gaussian-based approximation is not always appropriate, and furthermore, the heavy-tailed noise could be modeled by a symmetric -stable distribution (). Here, is called the tail-index and characterizes the heavy-tailedness of the distribution and is a scale parameter that will be formally defined in Section 2. This -stable model generalizes the Gaussian model in the sense that reduces to the Gaussian model, whereas smaller values of quantify the heavy-tailedness of the gradient noise (see Figure 1(left)). Under this noise model, the resulting continuous-time limit of SGD becomes [6]:

(5)

where is the -dimensional -stable Lévy motion with independent components (which will be formally defined in Section 2). This process has also been investigated for global non-convex optimization in a recent study [14].

The sample paths of the Lévy-driven SDE (5) have a fundamentally different behavior than the ones of Brownian motion driven dynamics (4). This difference is mainly originated by the fact that, unlike the Brownian motion which has almost surely continuous sample paths, the Lévy motion can have discontinuities, which are also called ‘jumps’ [15] (cf. Figure 1(middle)). This fundamental difference becomes more prominent in the metastability properties of the SDE (5). The metastability studies consider the case where is initialized in a basin and analyze the minimum time such that exits that basin. It has been shown that when (i.e. the noise has a heavy-tailed component), this so called first exit time only depends on the width of the basin and the value of , and it does not depend on the height of the basin [16, 17, 18]. The empirical results in [6] showed that, in various deep learning settings the estimated tail index is significantly smaller than 2, suggesting that the metastability results can be used as a proxy for understanding the dynamics of SGD in discrete time, especially to shed more light on the hypothesis that SGD prefers wide minima.

While this approach brings a new perspective for analyzing SGD, approximating SGD as a continuous-time approach might not be accurate for any step-size , and some theoretical concerns have already been raised for the validity of such approximations [19]. Intuitively, one can expect that the metastable behavior of SGD would be similar to the behavior of its continuous-time limit only when the discretization step-size is small enough. Even though some theoretical results have been recently established for the discretizations of SDEs driven by Brownian motion [20], it is not clear that how the discretized Lévy SDEs behave in terms of metastability.

In this study, we provide formal theoretical analyses where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system (7) is guaranteed to be close to its continuous-time limit (6). More precisely, we consider a stochastic differential equation with both a Brownian term and a Lévy term, and its Euler discretization as follows [21]:

(6)
(7)

with independent and identically distributed (i.i.d.) variables where

is the identity matrix, the components of

are i.i.d with distribution, and is the amplitude of the noise. This dynamics includes (4) and (5) as special cases. Here, we choose as a scalar for convenience; however, our analyses could be easily extended to the case where is a function of .

Understanding the metastability behavior of SGD modeled by these dynamics requires understanding how long it takes for the continuous-time process given by (6) and its discretization (7) to exit a neighborhood of a local minimum , if it is started in that neighborhood. For this purpose, for any given local minimum of and , we define the following set

(8)

which is the set of points in , each at a distance of at most from the local minimum . We formally define the first exit times, respectively for and as follows:

(9)
(10)

Our main result (Theorem 2) shows that with sufficiently small discretization step

, the probability to exit a given neighborhood of the local optimum at a fixed time

of the discretization process approximates that of the continuous process. This result also provides an explicit condition for the step-size, which explains certain impacts of the other parameters of the problem, such as dimension , noise amplitude , variance of Gaussian noise , towards the similarity of the discretization and continuous processes. We validate our theory on a synthetic model and neural networks.

Notations. For , the gamma function is defined as . For any Borel probability measures and with domain , the total variation (TV) distance is defined as follows: , where denotes the Borel subsets of .

2 Technical Background

Symmetric -stable distributions. The

distribution is a generalization of a centered Gaussian distribution where

is called the tail index, a parameter that determines the amount of heavy-tailedness. We say that

, if its characteristic function

where is called the scale parameter. In the special case, when ,

reduces to the normal distribution

. A crucial property of the -stable distributions is that, when with

, the moment

is finite if and only if , which implies that has infinite variance as soon as

. While the probability density function does not have closed form analytical expression except for a few special cases of

(e.g. : Gaussian, : Cauchy), it is computationally easy to draw random samples from it by using the method proposed in [22].

Lévy processes and SDEs driven by Lévy motions. The standard -stable Lévy motion on the real line is the unique process satisfying the following properties [21]:

  1. [topsep=0pt]

  2. For any , its increments are independent for and almost surely.

  3. and have the same distribution for any .

  4. is continuous in probability: and , as .

When , reduces to a scaled version of the standard Brownian motion . Since for is only continuous in probability, it can incur a countable number of discontinuities at random times, which makes is fundamentally different from the Brownian motion that has almost surely continuous paths.

The -dimensional Lévy motion with independent components is a stochastic process on where each coordinate corresponds to an independent scalar Lévy motion. Stochastic processes based on Lévy motion such as (5) and their mathematical properties have also been studied in the literature, we refer the reader to [23, 15] for details.

First Exit Times of Continuous-Time Lévy Stable SDEs. Due to the discontinuities of the Lévy-driven SDEs, their metastability behaviors also differ significantly from their Brownian counterparts. In this section, we will briefly mention important theoretical results about the SDE given in (6).

For simplicity, let us consider the SDE (6) in dimension one, i.e. . In a relatively recent study [16], the authors considered this SDE, where the potential function is required to have a non-degenerate global minimum at the origin, and they proved the following theorem.

Theorem 1 ([16]).

Consider the SDE (6) in dimension and assume that it has a unique strong solution. Assume further that the objective has a global minimum at zero, satisfying the conditions , , if and only if , and . Then, there exist positive constants , , , and such that for , the following holds:

(11)

uniformly for all and , where . Consequently,

(12)

This result indicates that the first exit time of needs only polynomial time with respect to the width of the basin and it does not depend on the depth of the basin, whereas Brownian systems need exponential time in the height of the basin in order to exit from the basin [24, 17]. This difference is mainly due to the discontinuities of the Lévy motion, which enables it to ‘jump out’ of the basin, whereas the Brownian SDEs need to ‘climb’ the basin due to their continuity. Consequently, given that the gradient noise exhibits similar heavy-tailed behavior to an

-stable distributed random variable, this result can be considered as a proxy to understand the wide-minima behavior of SGD.

We note that this result has already been extended to in [18]. Extension to state dependent noise has also been obtained in [25]. We also note that the metastability phenomenon is closely related to the spectral gap of the forward operator corresponding to the SDE dynamics (see e.g. [24]) and it is known that this quantity scales like for small which determines the dependency to in the first term of the exit time (12) due to Kramer’s Law [26, 27]. Burghoff and Pavlyukevich [27] showed that similar scaling in for the spectral gap would hold if we were to restrict the SDE dynamics to a discrete grid with a small enough grid size.

3 Assumptions and the Main Result

In this study, our main goal is to obtain an explicit condition on the step-size, such that the first exit time of the continuous-time process (9) would be similar to the first exit time of its Euler discretization (10).

We first state our assumptions.

A 1.

The SDE (6) admits a unique strong solution.

A 2.

The process satisfies

A 3.

The gradient of is -Hölder continuous with :

A 4.

The gradient of satisfies the following assumption: .

A 5.

For some and , is -dissipative: , .

We note that, as opposed to the theory of SDEs driven by Brownian motion, the theory of Lévy-driven SDEs is still an active research field where even the existence of solutions with general drift functions is not well-established and the main contributions have appeared in the last decade [28, 29]. Therefore, A1 has been a common assumption in stochastic analysis, e.g. [16, 18, 30]. Nevertheless, existence and uniqueness results have been very recently established in [29] for SDEs with bounded Hölder drifts. Therefore A1 and A2 directly hold for bounded gradients and extending this result to Hölder and dissipative drifts is out of the scope of this study. On the other hand, the assumptions A3-A5 are standard conditions, which are often considered in non-convex optimization algorithms that are based on discretization of diffusions [31, 32, 33, 34, 35].

Now, we identify an explicit condition for the step-size, which is one of our main contributions.

A 6.

For a given , , and for some , the step-size satisfies the following condition:

where is as in (7), the constants are defined by A3A5 and

with

We now present our main result, its proof can be found in the supplementary material.

Theorem 2.

Under assumptions A1- A6, the following inequality holds:

where,

for some constants and that does not depend on or , is given by A3 and is as in (6)–(7).

Exit time versus problem parameters. In Theorem 2, if we let go to zero for any fixed, the constant will also go to zero, and since can be chosen arbitrarily small, this implies that the probability of the first exit time for the discrete process and the continuous process will approach each other when the step-size gets smaller, as expected. If instead, we decrease or , the quantity also decreases monotonically, but it does not go to zero due to the first term in the expression of .

Exit time versus width of local minima.

Popular activation functions used in deep learning such as ReLU functions are almost everywhere differentiable and therefore the cost function has a well-defined Hessian almost everywhere (see e.g.

[36]

). The eigenvalues of the Hessian of the objective near local minima have also been studied in the literature (see e.g.

[37, 38]). If the Hessian around a local minimum is positive definite, the conditions for the multi-dimensional version of Theorem 1 in [18]) are satisfied locally around a local minimum. For local minima lying in wider valleys, the parameter can be taken to be larger; in which case the expected exit time will be larger by the formula (12). In other words, the SDE (5) spends more time to exit wider valleys. Theorem 2 shows that SGD modeled by the discretization of this SDE will also inherit a similar behavior if the step-size satisfies the conditions we provide.

4 Proof Overview

Relating the first exit times for and often requires obtaining bounds on the distance between and . In particular if is small with high probability, then we expect that their first exit times from the set will be close to each other as well with high probability.

For objective functions with bounded gradients, in order to relate to , one can attempt to use the strong convergence of the Euler scheme (cf. [39] Proposition 1): . By using Markov’s inequality, this result implies convergence in probability: for any and , there exists such that . Then, if then one of the following events must happen:

  1. [topsep=0pt]

  2. ,

  3. and (with probability less than ),

  4. and distance from to is at most (with probability less than ).

By using this observation, we obtain: . Even though we could use this result in order to relate to , this approach would not yield a meaningful condition for since the bounds for the strong error often grows exponentially in general with , which means should be chosen exponentially small for a given . Therefore, in our strategy, we choose a different path where we do not use the strong convergence of the Euler scheme.

Our proof strategy is inspired by the recent study [20], where the authors analyze the empirical metastability of the Langevin equation which is driven by a Brownian motion. However, unlike the Brownian case that [20] was based on, some of the tools for analyzing Brownian SDEs do not exist for Lévy-driven SDEs, which increases the difficulty of our task.

We first define a

linearly interpolated

version of the discrete-time process , which will be useful in our analysis, given as follows:

(13)

where denotes the whole process and the drift function is chosen as follows:

Here, denotes the indicator function, i.e.  if and if . It is easy to verify that for all [40, 31].

In our approach, we start by developing a Girsanov-like change of measures [23] to express the Kullback-Leibler (KL) divergence between and , which is defined as follows:

where denotes the law of , denotes the law of , and is the Radon–Nikodym derivative of with respect to . Here, we require A2 for the existence of a Girsanov transform between and and for establishing an explicit formula for the transform. In the supplementary document, we show that the KL divergence between and can be written as:

(14)

While this result has been known for SDEs driven by Brownian motion [15], none of the references we are aware of expressed the KL divergence as in (14). We also note that one of the key reasons that allows us to obtain (14) is the presence of the Brownian motion in (6), i.e. . For such a measure transformation cannot be performed [41].

In the next result, we show that if the step-size is chosen sufficiently small, the KL divergence between and is bounded.

Theorem 3.

Assume that the conditions A1-A6 hold. Then the following inequality holds:

The proof technique is similar to the approach of [40, 31, 14]: The idea is to divide the integral in (14) into smaller pieces and bounding each piece separately. Once we obtain a bound on KL, by using an optimal coupling argument, the data processing inequality, and Pinsker’s inequality, we obtain a bound for the total variation (TV) distance between and as follows:

where the TV distance is defined in Section 1. Besides, denotes the optimal coupling between and , i.e., the joint probability measure of and , which satisfies the following identity [42]:

Combined with Theorem 3, this inequality implies the following useful result:

(15)

where we used the fact that the event is equivalent to the event . The remaining task is to relate the probability to . The event ensures that the process does not leave the set when ; however, it does not indicate that the process remains in when . In order to have a control over the whole process, we introduce the following event:

such that the event ensures that the process stays close to for the whole time. By using this event, we can obtain the following inequalities:

By using the same approach, we can obtain a lower bound on as well. Hence, our final task reduces to bounding the term , which we perform by using the weak reflection principles of Lévy processes [43]. This finally yields Theorem 2.

5 Numerical Illustration

To illustrate our results, we first conduct experiments on a synthetic problem, where the cost function is set to . This corresponds to an Ornstein-Uhlenbeck-type process, which is commonly considered in metastability analyses [21]. This process locally satisfies the conditions A1-A5.

Since we cannot directly simulate the continuous-time process, we consider the stochastic process sampled from (7) with sufficiently small step-size as an approximation of the continuous scheme. Thus, we organize the experiments as follows. We first choose a very small step-size, i.e. . Starting from an initial point satisfying , we iterate (7) until we find the first such that . We repeat this experiment times, then we take the average as the ‘ground-truth’ first exit time. We continue the experiments by calculating the first exit times for larger step-sizes (each repeated times), and compute their distances to the ground truth.

Figure 2: Results of the synthetic experiments.

The results for this experiment are shown in Figure 2. By Theorem 2, the distance between the first exit times of the discretization and the continuous processes depends on two terms and , which are used for explaining our experimental results.

We observe from Figure 2(a) that the error to the ground-truth first exit time is an increasing function of , which directly matches our theoretical result. Figure 2(b) shows that, with small noise limit (e.g., in our settings, versus ), the error decreases with the parameter . By A6, with increased , we have the term to be reduced. On the other hand, increases with . However, at small noise limit, this effect is dominated by the decrease of , that makes the error decrease overall. The decreasing speed then decelerates with larger , since, the product becomes so large that the increase of starts to dominate the decrease of . Thus, it suggests that for a large , a very small step-size would be required for reducing the distance between the first exit times of the processes. In Figure 2(c), the error decreases when the variance increases. The reason for the performance is the same as in (b), and can be explained by considering the expression of and in the conclusion of Theorem 2.

In Figure 2(d), for small dimension, with the same exit time interval, when we increase , both processes escape the interval earlier, with smaller exit times. Hence, the distance between their exit times becomes smaller. With larger , the increasing effect of and starts to dominate the above ‘early-escape’ effect, thus, the decreasing speed of the error diminish. We observe that the error even slightly increases when and grows from to .

Figure 3: Results of the neural network experiments.

In our second set of experiments, we consider the real data setting used in [6]: a multi-layer fully connected neural network with ReLu activations on the MNIST dataset. We adapted the code provided in [6]. For this model, we followed a similar methodology: we monitored the first exit time by varying the

, the number of layers (depth), and the number of neurons per layer (width). Since a local minimum is not analytically available, we first trained the networks with SGD until a vicinity of a local minimum is reached with at least 90% accuracy, then we measured the first exit times with

and . In order to have a prominent level of gradient noise, we set the mini-batch size and we did not add explicit Gaussian or Lévy noise. The results are given in Figure 3. We observe that, even with pure gradient noise, the error in the exit time behaves very similarly to the one that we observed in Figure 2(a), hence supporting our theory. We further observe that, the error has a better dependency when the width and depth are relatively small, whereas the slope of the error increases for larger width and depth. This result shows that, to inherit the metastability properties of the continuous-time SDE, we need to use a smaller as we increase the size of the network. Note that this result does not conflict with Figure 2(d), since changing the width and depth does not simply change , it also changes the landscape of the problem.

6 Conclusion

We studied SGD under a heavy-tailed gradient noise model, which has been empirically justified for a variety of deep learning tasks. While a continuous-time limit of SGD can be used as a proxy for investigating the metastability of SGD under this model, the system might behave differently once discretized. Addressing this issue, we derived explicit conditions for the step-size such that the discrete-time system can inherit the metastability behavior of its continuous-time limit. We illustrated our results on a synthetic model and neural networks.

Acknowledgments

We are grateful to Peter Tankov for providing us the derivations for the Girsanov-like change of measures. This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project, and by the industrial chair Data science & Artificial Intelligence from Télécom Paris. Mert Gürbüzbalaban acknowledges support from the grants NSF DMS-1723085 and NSF CCF-1814888.

References

7 Appendix

7.1 Proof of Theorem 2

Proof.

Note that is equivalent to . Hence, from Lemma 4, the remaining task is to upper-bound :

and to lower-bound it:

By Lemma 1, the final result follows. ∎

Lemma 1.

There exist constants , and such that:

Proof.

We have for ,

For , using that , we get:

Then the Gronwall lemma gives:

Hence,

By Lemma 7.1 in [44], Lemma S4 in [14] and Markov’s inequality, for any , we have:

where is a constant independent of and . By Lemma 3, we have:

and

Finally, we get:

Now we prove the following lemma.

Lemma 2.

There exist constants and such that: