1 Introduction
Stochastic gradient descent (SGD) is one of the most popular algorithms in machine learning due to its scalability to large dimensional problems as well as favorable generalization properties. SGD algorithms are applicable to a broad set of convex and nonconvex optimization problems arising in machine learning [1, 2], including deep learning where they have been particularly successful [3, 4, 5]. In deep learning, many key tasks can be formulated as the following nonconvex optimization problem:
(1) 
where
contains the weights for the deep network to estimate,
is the typically nonconvex loss function corresponding to the
th data point, and is the number of data points [6, 7, 5]. SGD iterations consist of(2) 
where is the stepsize, denotes the iterations, is the initial point,
is an unbiased estimator of the actual gradient
, estimated from a subset of the component functions . In particular, the gradients of the objective are estimated as averages of the form(3) 
where is a random subset that is drawn with or without replacement at iteration , and denotes the number of elements in [1].
The popularity and success of SGD in practice have motivated researchers to investigate and analyze the reasons behind; a topic which has been an active research area [6, 4]. One wellknown hypothesis [8] that has gained recent popularity (see e.g. [4, 9]) is that among all the local minima lying on the nonconvex energy landscape defined by the loss function (1), local minima that lie on wider valleys generalize better compared to sharp valleys, and that SGD is able to converge to the “right local minimum" that generalizes better. This is visualized in Figure 1(right), where the local minimum on the right lies on a wider valley with width compared to the local minimum on the left with width lying in a sharp valley of depth . Interpreting this hypothesis and the structure of the local minima found by SGD clearly requires a deeper understanding of the statistical properties of the gradient noise and its implications on the dynamics of SGD. A number of papers in the literature argue that the noise has Gaussian structure [10, 7, 11, 12, 13, 3]. Under the Gaussian noise assumption, the following continuoustime limit of SGD has been considered in the literature to analyze the behavior of SGD:
(4) 
where is the standard Brownian motion and
is the noise variance and
is the stepsize.The Gaussianity of the gradient noise implicitly assumes that the gradient noise has a finite variance with light tails. In a recent study, [6] empirically illustrated that in various deep learning settings, the gradient noise admits a heavytail behavior, which suggests that the Gaussianbased approximation is not always appropriate, and furthermore, the heavytailed noise could be modeled by a symmetric stable distribution (). Here, is called the tailindex and characterizes the heavytailedness of the distribution and is a scale parameter that will be formally defined in Section 2. This stable model generalizes the Gaussian model in the sense that reduces to the Gaussian model, whereas smaller values of quantify the heavytailedness of the gradient noise (see Figure 1(left)). Under this noise model, the resulting continuoustime limit of SGD becomes [6]:
(5) 
where is the dimensional stable Lévy motion with independent components (which will be formally defined in Section 2). This process has also been investigated for global nonconvex optimization in a recent study [14].
The sample paths of the Lévydriven SDE (5) have a fundamentally different behavior than the ones of Brownian motion driven dynamics (4). This difference is mainly originated by the fact that, unlike the Brownian motion which has almost surely continuous sample paths, the Lévy motion can have discontinuities, which are also called ‘jumps’ [15] (cf. Figure 1(middle)). This fundamental difference becomes more prominent in the metastability properties of the SDE (5). The metastability studies consider the case where is initialized in a basin and analyze the minimum time such that exits that basin. It has been shown that when (i.e. the noise has a heavytailed component), this so called first exit time only depends on the width of the basin and the value of , and it does not depend on the height of the basin [16, 17, 18]. The empirical results in [6] showed that, in various deep learning settings the estimated tail index is significantly smaller than 2, suggesting that the metastability results can be used as a proxy for understanding the dynamics of SGD in discrete time, especially to shed more light on the hypothesis that SGD prefers wide minima.
While this approach brings a new perspective for analyzing SGD, approximating SGD as a continuoustime approach might not be accurate for any stepsize , and some theoretical concerns have already been raised for the validity of such approximations [19]. Intuitively, one can expect that the metastable behavior of SGD would be similar to the behavior of its continuoustime limit only when the discretization stepsize is small enough. Even though some theoretical results have been recently established for the discretizations of SDEs driven by Brownian motion [20], it is not clear that how the discretized Lévy SDEs behave in terms of metastability.
In this study, we provide formal theoretical analyses where we derive explicit conditions for the stepsize such that the metastability behavior of the discretetime system (7) is guaranteed to be close to its continuoustime limit (6). More precisely, we consider a stochastic differential equation with both a Brownian term and a Lévy term, and its Euler discretization as follows [21]:
(6)  
(7) 
with independent and identically distributed (i.i.d.) variables where
is the identity matrix, the components of
are i.i.d with distribution, and is the amplitude of the noise. This dynamics includes (4) and (5) as special cases. Here, we choose as a scalar for convenience; however, our analyses could be easily extended to the case where is a function of .Understanding the metastability behavior of SGD modeled by these dynamics requires understanding how long it takes for the continuoustime process given by (6) and its discretization (7) to exit a neighborhood of a local minimum , if it is started in that neighborhood. For this purpose, for any given local minimum of and , we define the following set
(8) 
which is the set of points in , each at a distance of at most from the local minimum . We formally define the first exit times, respectively for and as follows:
(9)  
(10) 
Our main result (Theorem 2) shows that with sufficiently small discretization step
, the probability to exit a given neighborhood of the local optimum at a fixed time
of the discretization process approximates that of the continuous process. This result also provides an explicit condition for the stepsize, which explains certain impacts of the other parameters of the problem, such as dimension , noise amplitude , variance of Gaussian noise , towards the similarity of the discretization and continuous processes. We validate our theory on a synthetic model and neural networks.Notations. For , the gamma function is defined as . For any Borel probability measures and with domain , the total variation (TV) distance is defined as follows: , where denotes the Borel subsets of .
2 Technical Background
Symmetric stable distributions. The
distribution is a generalization of a centered Gaussian distribution where
is called the tail index, a parameter that determines the amount of heavytailedness. We say that, if its characteristic function
where is called the scale parameter. In the special case, when ,reduces to the normal distribution
. A crucial property of the stable distributions is that, when with, the moment
is finite if and only if , which implies that has infinite variance as soon as. While the probability density function does not have closed form analytical expression except for a few special cases of
(e.g. : Gaussian, : Cauchy), it is computationally easy to draw random samples from it by using the method proposed in [22].Lévy processes and SDEs driven by Lévy motions. The standard stable Lévy motion on the real line is the unique process satisfying the following properties [21]:

[topsep=0pt]

For any , its increments are independent for and almost surely.

and have the same distribution for any .

is continuous in probability: and , as .
When , reduces to a scaled version of the standard Brownian motion . Since for is only continuous in probability, it can incur a countable number of discontinuities at random times, which makes is fundamentally different from the Brownian motion that has almost surely continuous paths.
The dimensional Lévy motion with independent components is a stochastic process on where each coordinate corresponds to an independent scalar Lévy motion. Stochastic processes based on Lévy motion such as (5) and their mathematical properties have also been studied in the literature, we refer the reader to [23, 15] for details.
First Exit Times of ContinuousTime Lévy Stable SDEs. Due to the discontinuities of the Lévydriven SDEs, their metastability behaviors also differ significantly from their Brownian counterparts. In this section, we will briefly mention important theoretical results about the SDE given in (6).
For simplicity, let us consider the SDE (6) in dimension one, i.e. . In a relatively recent study [16], the authors considered this SDE, where the potential function is required to have a nondegenerate global minimum at the origin, and they proved the following theorem.
Theorem 1 ([16]).
Consider the SDE (6) in dimension and assume that it has a unique strong solution. Assume further that the objective has a global minimum at zero, satisfying the conditions , , if and only if , and . Then, there exist positive constants , , , and such that for , the following holds:
(11) 
uniformly for all and , where . Consequently,
(12) 
This result indicates that the first exit time of needs only polynomial time with respect to the width of the basin and it does not depend on the depth of the basin, whereas Brownian systems need exponential time in the height of the basin in order to exit from the basin [24, 17]. This difference is mainly due to the discontinuities of the Lévy motion, which enables it to ‘jump out’ of the basin, whereas the Brownian SDEs need to ‘climb’ the basin due to their continuity. Consequently, given that the gradient noise exhibits similar heavytailed behavior to an
stable distributed random variable, this result can be considered as a proxy to understand the wideminima behavior of SGD.
We note that this result has already been extended to in [18]. Extension to state dependent noise has also been obtained in [25]. We also note that the metastability phenomenon is closely related to the spectral gap of the forward operator corresponding to the SDE dynamics (see e.g. [24]) and it is known that this quantity scales like for small which determines the dependency to in the first term of the exit time (12) due to Kramer’s Law [26, 27]. Burghoff and Pavlyukevich [27] showed that similar scaling in for the spectral gap would hold if we were to restrict the SDE dynamics to a discrete grid with a small enough grid size.
3 Assumptions and the Main Result
In this study, our main goal is to obtain an explicit condition on the stepsize, such that the first exit time of the continuoustime process (9) would be similar to the first exit time of its Euler discretization (10).
We first state our assumptions.
A 1.
The SDE (6) admits a unique strong solution.
A 2.
The process satisfies
A 3.
The gradient of is Hölder continuous with :
A 4.
The gradient of satisfies the following assumption: .
A 5.
For some and , is dissipative: , .
We note that, as opposed to the theory of SDEs driven by Brownian motion, the theory of Lévydriven SDEs is still an active research field where even the existence of solutions with general drift functions is not wellestablished and the main contributions have appeared in the last decade [28, 29]. Therefore, A1 has been a common assumption in stochastic analysis, e.g. [16, 18, 30]. Nevertheless, existence and uniqueness results have been very recently established in [29] for SDEs with bounded Hölder drifts. Therefore A1 and A2 directly hold for bounded gradients and extending this result to Hölder and dissipative drifts is out of the scope of this study. On the other hand, the assumptions A3A5 are standard conditions, which are often considered in nonconvex optimization algorithms that are based on discretization of diffusions [31, 32, 33, 34, 35].
Now, we identify an explicit condition for the stepsize, which is one of our main contributions.
A 6.
We now present our main result, its proof can be found in the supplementary material.
Theorem 2.
Exit time versus problem parameters. In Theorem 2, if we let go to zero for any fixed, the constant will also go to zero, and since can be chosen arbitrarily small, this implies that the probability of the first exit time for the discrete process and the continuous process will approach each other when the stepsize gets smaller, as expected. If instead, we decrease or , the quantity also decreases monotonically, but it does not go to zero due to the first term in the expression of .
Exit time versus width of local minima.
Popular activation functions used in deep learning such as ReLU functions are almost everywhere differentiable and therefore the cost function has a welldefined Hessian almost everywhere (see e.g.
[36]). The eigenvalues of the Hessian of the objective near local minima have also been studied in the literature (see e.g.
[37, 38]). If the Hessian around a local minimum is positive definite, the conditions for the multidimensional version of Theorem 1 in [18]) are satisfied locally around a local minimum. For local minima lying in wider valleys, the parameter can be taken to be larger; in which case the expected exit time will be larger by the formula (12). In other words, the SDE (5) spends more time to exit wider valleys. Theorem 2 shows that SGD modeled by the discretization of this SDE will also inherit a similar behavior if the stepsize satisfies the conditions we provide.4 Proof Overview
Relating the first exit times for and often requires obtaining bounds on the distance between and . In particular if is small with high probability, then we expect that their first exit times from the set will be close to each other as well with high probability.
For objective functions with bounded gradients, in order to relate to , one can attempt to use the strong convergence of the Euler scheme (cf. [39] Proposition 1): . By using Markov’s inequality, this result implies convergence in probability: for any and , there exists such that . Then, if then one of the following events must happen:

[topsep=0pt]

,

and (with probability less than ),

and distance from to is at most (with probability less than ).
By using this observation, we obtain: . Even though we could use this result in order to relate to , this approach would not yield a meaningful condition for since the bounds for the strong error often grows exponentially in general with , which means should be chosen exponentially small for a given . Therefore, in our strategy, we choose a different path where we do not use the strong convergence of the Euler scheme.
Our proof strategy is inspired by the recent study [20], where the authors analyze the empirical metastability of the Langevin equation which is driven by a Brownian motion. However, unlike the Brownian case that [20] was based on, some of the tools for analyzing Brownian SDEs do not exist for Lévydriven SDEs, which increases the difficulty of our task.
We first define a
linearly interpolated
version of the discretetime process , which will be useful in our analysis, given as follows:(13) 
where denotes the whole process and the drift function is chosen as follows:
Here, denotes the indicator function, i.e. if and if . It is easy to verify that for all [40, 31].
In our approach, we start by developing a Girsanovlike change of measures [23] to express the KullbackLeibler (KL) divergence between and , which is defined as follows:
where denotes the law of , denotes the law of , and is the Radon–Nikodym derivative of with respect to . Here, we require A2 for the existence of a Girsanov transform between and and for establishing an explicit formula for the transform. In the supplementary document, we show that the KL divergence between and can be written as:
(14) 
While this result has been known for SDEs driven by Brownian motion [15], none of the references we are aware of expressed the KL divergence as in (14). We also note that one of the key reasons that allows us to obtain (14) is the presence of the Brownian motion in (6), i.e. . For such a measure transformation cannot be performed [41].
In the next result, we show that if the stepsize is chosen sufficiently small, the KL divergence between and is bounded.
The proof technique is similar to the approach of [40, 31, 14]: The idea is to divide the integral in (14) into smaller pieces and bounding each piece separately. Once we obtain a bound on KL, by using an optimal coupling argument, the data processing inequality, and Pinsker’s inequality, we obtain a bound for the total variation (TV) distance between and as follows:
where the TV distance is defined in Section 1. Besides, denotes the optimal coupling between and , i.e., the joint probability measure of and , which satisfies the following identity [42]:
Combined with Theorem 3, this inequality implies the following useful result:
(15) 
where we used the fact that the event is equivalent to the event . The remaining task is to relate the probability to . The event ensures that the process does not leave the set when ; however, it does not indicate that the process remains in when . In order to have a control over the whole process, we introduce the following event:
such that the event ensures that the process stays close to for the whole time. By using this event, we can obtain the following inequalities:
By using the same approach, we can obtain a lower bound on as well. Hence, our final task reduces to bounding the term , which we perform by using the weak reflection principles of Lévy processes [43]. This finally yields Theorem 2.
5 Numerical Illustration
To illustrate our results, we first conduct experiments on a synthetic problem, where the cost function is set to . This corresponds to an OrnsteinUhlenbecktype process, which is commonly considered in metastability analyses [21]. This process locally satisfies the conditions A1A5.
Since we cannot directly simulate the continuoustime process, we consider the stochastic process sampled from (7) with sufficiently small stepsize as an approximation of the continuous scheme. Thus, we organize the experiments as follows. We first choose a very small stepsize, i.e. . Starting from an initial point satisfying , we iterate (7) until we find the first such that . We repeat this experiment times, then we take the average as the ‘groundtruth’ first exit time. We continue the experiments by calculating the first exit times for larger stepsizes (each repeated times), and compute their distances to the ground truth.
The results for this experiment are shown in Figure 2. By Theorem 2, the distance between the first exit times of the discretization and the continuous processes depends on two terms and , which are used for explaining our experimental results.
We observe from Figure 2(a) that the error to the groundtruth first exit time is an increasing function of , which directly matches our theoretical result. Figure 2(b) shows that, with small noise limit (e.g., in our settings, versus ), the error decreases with the parameter . By A6, with increased , we have the term to be reduced. On the other hand, increases with . However, at small noise limit, this effect is dominated by the decrease of , that makes the error decrease overall. The decreasing speed then decelerates with larger , since, the product becomes so large that the increase of starts to dominate the decrease of . Thus, it suggests that for a large , a very small stepsize would be required for reducing the distance between the first exit times of the processes. In Figure 2(c), the error decreases when the variance increases. The reason for the performance is the same as in (b), and can be explained by considering the expression of and in the conclusion of Theorem 2.
In Figure 2(d), for small dimension, with the same exit time interval, when we increase , both processes escape the interval earlier, with smaller exit times. Hence, the distance between their exit times becomes smaller. With larger , the increasing effect of and starts to dominate the above ‘earlyescape’ effect, thus, the decreasing speed of the error diminish. We observe that the error even slightly increases when and grows from to .
In our second set of experiments, we consider the real data setting used in [6]: a multilayer fully connected neural network with ReLu activations on the MNIST dataset. We adapted the code provided in [6]. For this model, we followed a similar methodology: we monitored the first exit time by varying the
, the number of layers (depth), and the number of neurons per layer (width). Since a local minimum is not analytically available, we first trained the networks with SGD until a vicinity of a local minimum is reached with at least 90% accuracy, then we measured the first exit times with
and . In order to have a prominent level of gradient noise, we set the minibatch size and we did not add explicit Gaussian or Lévy noise. The results are given in Figure 3. We observe that, even with pure gradient noise, the error in the exit time behaves very similarly to the one that we observed in Figure 2(a), hence supporting our theory. We further observe that, the error has a better dependency when the width and depth are relatively small, whereas the slope of the error increases for larger width and depth. This result shows that, to inherit the metastability properties of the continuoustime SDE, we need to use a smaller as we increase the size of the network. Note that this result does not conflict with Figure 2(d), since changing the width and depth does not simply change , it also changes the landscape of the problem.6 Conclusion
We studied SGD under a heavytailed gradient noise model, which has been empirically justified for a variety of deep learning tasks. While a continuoustime limit of SGD can be used as a proxy for investigating the metastability of SGD under this model, the system might behave differently once discretized. Addressing this issue, we derived explicit conditions for the stepsize such that the discretetime system can inherit the metastability behavior of its continuoustime limit. We illustrated our results on a synthetic model and neural networks.
Acknowledgments
We are grateful to Peter Tankov for providing us the derivations for the Girsanovlike change of measures. This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR16CE230014) project, and by the industrial chair Data science & Artificial Intelligence from Télécom Paris. Mert Gürbüzbalaban acknowledges support from the grants NSF DMS1723085 and NSF CCF1814888.
References
 [1] Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
 [2] Léon Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 [3] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In International Conference on Learning Representations, 2018.
 [4] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
 [5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 [6] U. Şimşekli, L. Sagun, and Gürbüzbalaban. A TailIndex Analysis of Stochastic Gradient Noise in Deep Neural Networks. In ICML, 2019.
 [7] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.
 [8] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
 [9] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836, 2016.
 [10] S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363, 2016.
 [11] Q. Li, C. Tai, and W. E. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms. In Proceedings of the 34th International Conference on Machine Learning, pages 2101–2110, 06–11 Aug 2017.
 [12] W. Hu, C. J. Li, L. Li, and J.G. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.
 [13] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects. arXiv preprint arXiv:1803.00195, 2018.

[14]
Thanh Huy Nguyen, Umut Şimşekli, and Gaël Richard.
NonAsymptotic Analysis of Fractional Langevin Monte Carlo for NonConvex Optimization.
In International Conference on Machine Learning, 2019.  [15] Bernt Karsten Øksendal and Agnes Sulem. Applied stochastic control of jump diffusions, volume 498. Springer, 2005.
 [16] Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable Lévy processes. Stochastic Processes and their Applications, 116(4):611–642, 2006.
 [17] P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of Lévydriven Langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010.
 [18] Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of nonlinear dynamical systems in rd perturbed by multifractal Lévy noise. Journal of Statistical Physics, 141(1):94–119, 2010.
 [19] S. Yaida. Fluctuationdissipation relations for stochastic gradient descent. In International Conference on Learning Representations, 2019.
 [20] B. Tzen, T. Liang, and M. Raginsky. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. In Proceedings of the 2018 Conference on Learning Theory, 2018.
 [21] J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015.
 [22] J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the american statistical association, 71(354):340–344, 1976.
 [23] Peter Tankov. Financial modelling with jump processes. Chapman and Hall/CRC, 2003.
 [24] Anton Bovier, Michael Eckhoff, Véronique Gayrard, and Markus Klein. Metastability in reversible diffusion processes I: Sharp asymptotics for capacities and exit times. Journal of the European Mathematical Society, 6(4):399–424, 2004.
 [25] Ilya Pavlyukevich. First exit times of solutions of stochastic differential equations driven by multiplicative Lévy noise with heavy tails. Stochastics and Dynamics, 11(02n03):495–519, 2011.
 [26] Nils Berglund. Kramers’ law: Validity, derivations and generalisations. arXiv preprint arXiv:1106.5799, 2011.
 [27] Toralf Burghoff and Ilya Pavlyukevich. Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights. Journal of Statistical Physics, 161(1):171–196, 2015.
 [28] Enrico Priola et al. Pathwise uniqueness for singular sdes driven by stable processes. Osaka Journal of Mathematics, 49(2):421–447, 2012.
 [29] Alexei M Kulik. On weak uniqueness and distributional properties of a solution to an sde with stable noise. Stochastic Processes and their Applications, 129(2):473–506, 2019.
 [30] Mingjie Liang and Jian Wang. Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative Lévy Noises via Coupling. arXiv preprint arXiv:1801.05936, 2018.
 [31] M. Raginsky, A. Rakhlin, and M. Telgarsky. Nonconvex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pages 1674–1703, 2017.
 [32] Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3125–3136, 2018.
 [33] M. A. Erdogdu, L. Mackey, and O. Shamir. Global Nonconvex Optimization with Discretized Diffusions. In Advances in Neural Information Processing Systems, pages 9693–9702, 2018.
 [34] Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accelerates Langevin Dynamics for Global NonConvex Optimization. arXiv eprints, page arXiv:1812.07725, Dec 2018.
 [35] Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for NonConvex Stochastic Optimization: NonAsymptotic Performance Bounds and MomentumBased Acceleration. arXiv eprints, page arXiv:1809.04618, Sep 2018.
 [36] Yuanzhi Li and Yang Yuan. Convergence Analysis of Twolayer Neural Networks with ReLU Activation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 597–607. Curran Associates, Inc., 2017.
 [37] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
 [38] Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. arXiv preprint arXiv:1811.07062, 2018.
 [39] R Mikulevičius and Fanhui Xu. On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes. Stochastics, 90(4):569–604, 2018.
 [40] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
 [41] Arnaud Debussche and Nicolas Fournier. Existence of densities for stablelike driven sde’s with Hölder continuous coefficients. Journal of Functional Analysis, 264(8):1757–1778, 2013.
 [42] Torgny Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
 [43] Erhan Bayraktar, Sergey Nadtochiy, et al. Weak reflection principle for Lévy processes. The Annals of Applied Probability, 25(6):3251–3294, 2015.
 [44] Longjie Xie and Xicheng Zhang. Ergodicity of stochastic differential equations with jumps and singular coefficients. arXiv preprint arXiv:1705.07402, 2017.
 [45] Andreas Winkelbauer. Moments and absolute moments of the normal distribution. arXiv preprint arXiv:1209.4340, 2012.
7 Appendix
7.1 Proof of Theorem 2
Proof.
Lemma 1.
There exist constants , and such that:
Proof.
We have for ,
For , using that , we get:
Then the Gronwall lemma gives:
Hence,
By Lemma 7.1 in [44], Lemma S4 in [14] and Markov’s inequality, for any , we have:
where is a constant independent of and . By Lemma 3, we have:
and
Finally, we get:
∎
Now we prove the following lemma.
Lemma 2.
There exist constants and such that:
Comments
There are no comments yet.