Gradient-based optimization algorithms have been the de facto choice in deep learning for solving the optimization problems of the form:
denotes the non-convex loss function,denotes the loss contributed by an individual data point , denotes the collection of all the parameters of the neural network. Among others, stochastic gradient descent with momentum (SGDm) is one of the most popular algorithms for solving such optimization tasks (see e.g., Sutskever et al. (2013); Smith et al. (2018)), and is based on the following iterative scheme:
where denotes the iteration number, is the step-size, is the friction, and denotes the velocity (also referred to as momentum). Here, denotes the stochastic gradients defined as follows:
where denotes a random subset drawn from the set of data points with for all .
When the gradients are computed on all the data points (i.e., ), SGDm becomes deterministic and can be viewed as a discretization of the following continuous-time system (Gao et al., 2018a; Maddison et al., 2018):
where is still called the velocity. The connection between this system and (2) becomes clearer, if we discretize this system by using the Euler scheme with step-size :
and make the change of variables , , and . However, due to the presence of the stochastic gradient noise , the sequence will be a stochastic process and the deterministic system (4) would not be an appropriate proxy.
Understanding the statistical properties of would be of crucial importance as it might reveal the peculiar properties that lie behind the performance of SGDm for learning with neural networks. A popular approach for understanding the dynamics of stochastic optimization algorithms in deep learning is to impose some structure on the noise and relate the process (2) to a stochastic differential equation (SDE) (Mandt et al., 2016; Jastrzebski et al., 2017; Hu et al., 2017; Chaudhari and Soatto, 2018; Zhu et al., 2019; Simsekli et al., 2019)
. For instance, by assuming that the second-order moments of the stochastic gradient noise are bounded (i.e.,for all admissible , ), one might argue that 2010). Under this assumption, we might view (2) as a discretization of the following SDE, which is also known as the underdamped or kinetic Langevin dynamics:
where denotes the -dimensional Brownian motion and is called the inverse temperature variable, measuring the noise intensity along with . It is easy to check that, under very mild assumptions, the solution process admits an invariant distribution whose density is proportional to , where the function is often called the Gaussian kinetic energy (see e.g. (Betancourt et al., 2017)) and the distribution itself is called the Boltzmann-Gibbs measure (Pavliotis, 2014; Gao et al., 2018a; Hérau and Nier, 2004; Dalalyan and Riou-Durand, 2018). We then observe that the marginal distribution in the stationarity has a density proportional to , which indicated that any local minimum of appears as a local maximum of this density. This is a desirable property since it implies that, when the gradient noise has light tails, the process will spend more time near the local minima of . Furthermore, it has been shown that as goes to infinity, the marginal distribution of concentrates around the global optimum . This observation has yielded interesting results for understanding the dynamics of SGDm in the contexts of both sampling and optimization with convex and non-convex potentials (Gao et al., 2018a, b; Zou et al., 2018; Lu et al., 2016).
While the Gaussianity assumption can be accurate in certain settings such as small networks or ResNets (Martin and Mahoney, 2019; Panigrahi et al., 2019), recently it has been empirically demonstrated that in several deep learning setups, the stochastic gradient noise can exhibit a heavy-tailed behavior (Şimşekli et al., 2019; Zhang et al., 2019b). While the Gaussianity assumption would not be appropriate in this case since the conventional CLT would not hold anymore, nevertheless we can invoke the generalized CLT, which states that the asymptotic distribution of will be a symmetric -stable distribution (
); a class of distributions that are commonly used in the statistical physics literature as an approximation to heavy-tailed random variables(Sliusarenko et al., 2013; Dubkov et al., 2008). As we will define in more detail in the next section, in the core of , lies the parameter , which determines the heaviness of the tail of the distribution. The tails get heavier as gets smaller, the case reduces to the Gaussian random variables. This is illustrated in Figure 1.
With the assumption of being distributed, the choice of Brownian motion will be no longer appropriate and should be replaced with an -stable Lévy motion, which motivates the following Lévy-driven SDE:
where denotes the left limit of and denotes the -stable Lévy process with independent components, which coincides with when . Unfortunately, when , as opposed to its Brownian counterpart, the invariant measures of such SDEs do not admit an analytical form in general; yet, one can still show that the invariant measure cannot be in the form of the Boltzmann-Gibbs measure (Eliazar and Klafter, 2003).
A more striking property of (7) was very recently revealed in a statistical physics study (Capała and Dybiec, 2019), where the authors numerically illustrated that, even when has a single minimum, the invariant measure of (7) can exhibit multiple maxima, none of which coincides with the minimum of . A similar property has been formally proven in the overdamped dynamics with Cauchy noise (i.e., and ) by Sliusarenko et al. (2013). Since the process (7
) would spend more time around the modes of its invariant measure (i.e., the high probability region), in an optimization context (i.e., for larger) the sample paths would concentrate around these modes, which might be arbitrarily distant from the optima of . In other words, the heavy-tails of the gradient noise could result in an undesirable bias, which would be still present even when the step-size is taken to be arbitrarily small. As we will detail in Section 3, informally, this phenomenon stems from the fact that the heavy-tailed noise leads to aggressive updates on , which are then directly transmitted to due to the dynamics. Unless ‘tamed’, these updates create an hurling effect on and drift it away from the modes of the “potential” that is sought to be minimized.
Contributions: In this study, we develop a fractional underdamped Langevin dynamics whose invariant distribution is guaranteed to be in the form of the Boltzmann-Gibbs measure, hence its optima exactly match the optima of . We first prove a general theorem which holds for any kinetic energy function, which is not necessarily the Gaussian kinetic energy. However, it turns out that some components of the dynamics might not admit an analytical form for an arbitrary choice of the kinetic energy. Then we identify two choices of kinetic energies, where all the terms in dynamics can be written in an analytical form or accurately computable. We also analyze the Euler discretization of (14) and identify sufficient conditions for ensuring weak convergence of the ergodic averages computed over the iterates.
We observe that the discretization of the proposed dynamics has interesting algorithmic similarities with natural gradient descent (Amari, 1998) and gradient clipping (Pascanu et al., 2013), which we believe bring further theoretical understanding for their role in deep learning. Finally, we support our theory with experiments conducted on both synthetic settings and neural networks.
2 Technical Background & Related Work
The stable distributions are heavy-tailed distributions that appear as the limiting distribution of the generalized CLT for a sum of i.i.d. random variables with infinite variance(Lévy, 1937). In this paper, we are interested in centered symmetric -stable distribution. A scalar random variable follows a symmetric -stable distribution denoted as
if its characteristic function takes the form:, , where and . Here, is known as the tail-index, which determines the tail thickness of the distribution. becomes heavier-tailed as gets smaller. is known as the scale parameter that measures the spread of around
. The probability density function of a symmetric-stable distribution, , does not yield closed-form expression in general except for a few special cases. When and ,
reduces to the Cauchy and the Gaussian distributions, respectively. When, -stable distributions have heavy-tails so that their moments are finite only up to the order in the sense that if and only if , which implies infinite variance.
Lévy motions are stochastic processes with independent and stationary increments. Their successive displacements are random and independent, and statistically identical over different time intervals of the same length, and can be viewed as the continuous-time analogue of random walks. The best known and most important examples are the Poisson process, Brownian motion, the Cauchy process and more generally stable processes. Lévy motions are prototypes of Markov processes and of semimartingales, and concern many aspects of probability theory. We refer to(Bertoin, 1996) for a survey on the theory of Lévy motions.
In general, Lévy motions are heavy-tailed, which make it appropriate to model natural phenomena with possibly large variations, that offer occurs in statistical physics (Eliazar and Klafter, 2003), signal processing (Kuruoglu, 1999), and finance (Mandelbrot, 2013).
We define , a -dimensional symmetric -stable Lévy motion with independent components as follows. Each component of is an independent scalar -stable Lévy process, which is defined as follows: (cf. Figure 1)
For any , the increments are independent, .
The difference and have the same distribution: for .
has stochastically continuous sample paths, i.e. for any and , as .
When , we obtain a scaled Brownian motion as a special case so that the difference follows a Gaussian distribution and is almost surely continuous. When , due to the stochastic continuity property, symmetric -stable Lévy motions can have have a countable number of discontinuities, which are often known as jumps. The sample paths are continuous from the right and they have left limits, a property known as càdlàg (Duan, 2015).
Recently, Şimşekli (2017) extended the overdamped Langevin dynamics to an SDE driven by , given as:111In Şimşekli (2017), (8) does not contain an inverse temperature , which was later on introduced in Nguyen et al. (2019).
where the drift is defined as follows:
Here, and denotes the fractional Riesz derivative (Riesz, 1949):
denotes the Fourier transform. Briefly,extends usual differentiation to fractional orders and when it coincides (up to a sign difference) with the usual second-order derivative .
The important property of the process (8) is that it admits an invariant distribution whose density is proportional to (Nguyen et al., 2019). It is easy to show that, when , the drift reduces to , hence we recover the classical overdamped dynamics:
where . This approximation essentially results in replacing with in (11) in a rather straightforward manner. While avoiding the computational issues originated from the Riesz derivatives, as shown in (Nguyen et al., 2019), this approximation can induce an arbitrary bias in a non-convex optimization context. Besides, the stationary distribution of this approximated dynamics was analytically derived in (Sliusarenko et al., 2013) under the choice of and for and . These results show that, in the presence of heavy-tailed perturbations, the drift should be modified, otherwise an inaccurate approximation of the Riesz derivatives can result in an explicit bias, which moves the modes of the distribution away from the modes of .
From a pure Monte Carlo perspective, Ye and Zhu (2018) extended the fractional overdamped dynamics (8) to higher-order dynamics and proposed the so-called fractional Hamiltonian dynamics (FHD), given as follows:
where , and . They showed that the invariant measure of the process has a density proportional to , i.e., the Boltzmann-Gibbs measure. Similar to the overdamped case (8), the Riesz derivatives do not admit an analytical form in general. Hence they approximated them by using the same approximation given in (12), which yields the SDE given in (7) (up to a scaling factor). This observation also confirms that the heavy-tailed noise requires an adjustment in the dynamics, otherwise the induced bias might drive the dynamics away from the minima of (Capała and Dybiec, 2019).
3 Fractional Underdamped Langevin Dynamics
In this section, we develop the fractional underdamped Langevin dynamics (FULD), which is expressed by the following SDE:
where is the drift function for the velocity and denotes a general notion of kinetic energy. In the next theorem, which is the main theoretical result of this paper, we will identify the relation between these two functions such that the solution process will keep the generalized Boltzmann-Gibbs measure, invariant. All the proofs are given in the appendix. Let has the following form:
The the measure on is an invariant probability measure for the Markov process . One of the main features of FULD is that the fractional Riesz derivatives only appears in the drift , which only depends on . This is highly in contrast with FHD (13), where the Riesz derivatives are taken over both and , which is the source of intractablility. Moreover, FULD enjoys the freedom to choose different kinetic energy functions . In the sequel, we will investigate two options for , such that the drift can be analytically obtained.
3.1 Gaussian kinetic energy
In classical overdamped Langevin dynamics and Hamiltonian dynamics, the default choice of kinetic energy is the Gaussian kinetic energy, which corresponds to taking (Neal, 2010; Livingstone et al., 2019; Dalalyan and Riou-Durand, 2018). With this choice, the fractional dynamics becomes:
In the next result, we will show that in this case, the drift admits an analytical solution. Let . Then, for any ,
where is the gamma function and is the Kummer confluent hypergeometric function. In particular, when , we have . We observe that the fractional dynamics (16) strictly extends the underdamped Langevin dynamics (6) as .
Let us now investigate the form of the new drift and its implications. In Figure 2, we illustrate for the dimensional case (note that for , each component of still behaves like Figure 2). We observe that due to the hypergeometric function , the drift grows exponentially fast with whenever . Semantically, this means that, in order to be able to compensate the large jumps incurred by , the drift has to react very strongly and hence prevent to take large values. To illustrate this behavior, we provide more visual illustrations in the appendix.
Even though this aggressive behavior of can be beneficial for the continuous-time system, it is unfortunately clear that its Euler-Maruyama discretization will not yield a practical algorithm due to the same behavior. Indeed, we would need the function to be Lipschitz continuous in order to guarantee the algorithmic stability of its discretization (Kloeden and Platen, 2013); however, if we consider the integral form of (cf. (Abramowitz and Stegun, 1972)), we observe that the function
is clearly not Lipschitz continuous in . Therefore, we conclude that FULD with the Gaussian kinetic energy is mostly of theoretical interest.
3.2 Alpha-stable kinetic energy
The dynamics with the Gaussian kinetic energy requires a very strong drift mainly because we force the dynamics to make sure that the invariant distribution of to be a Gaussian. Since the Gaussian distribution has light-tails, it cannot tolerate samples with large magnitudes, hence requires a large dissipation to make sure does not take large values.
In order to avoid such an explosive drift that potentially degrades practicality, next we explore heavy-tailed kinetic energies, which would allow the components of to take large values, while still making sure that the drift in (15) admits an analytical form.
In our next result, we show that, when we choose an kinetic energy, such that the tail-index of this kinetic energy matches the one of the driving process , the drift simplifies and becomes the identity function. Let be the probability density function of . Choose in (15), where for any . Then,
This result hints that, perhaps is the natural choice of kinetic energy for the systems driven by .
It now follows from Theorem 3.2 that the FULD with -stable kinetic energy reduces to the following SDE:
While this choice of results in an analytically available , unfortunately the function itself admits a closed-form analytical formula only when or , due to the properties of the densities. Nevertheless, as is based on one-dimensional densities, it can be very accurately computed by using the recent methods developed in (Ament and O’Neil, 2018).
We visually inspect the behavior of in Figure 2 for dimension one. We observe that, as soon as , takes a very smooth form. Besides, for small the function behaves like a linear function and when goes to infinity, it vanishes. This behavior can be interpreted as follows: since can take larger values due to the heavy tails of the kinetic energy, in order to be able target the correct distribution, the dynamics compensates the potential bursts in by passing it through the asymptotically vanishing .
3.3 Euler discretization and weak convergence analysis
As visually hinted in Figure 2, the function has strong regularity, which makes (19) to be potentially beneficial for practical implementations. Indeed, it is easy to verify that is Lipschitz continuous for and , and in our next result, we show that this observation is true for any admissible , which is a desired property when discretizing continuous-time dynamics. For , the map is Lipschitz continuous, hence is also Lipschitz continuous. Accordingly we consider the following Euler-Maruyama discretization for (19):
where , is a random vector whose components are independently distributed, and is a sequence of step-sizes.
In this section, we analyze the weak convergence of the ergodic averages computed by using (20). Given a test function , consider its expectation with respect to the target measure , i.e. with . We will discuss next how this expectation can be approximated through the sample averages
where is the cumulative sum of the step-size sequence. We note that this notion of convergence is stronger than the convergence of (20) near a local minimum since it requires the convergence of the measure itself, and our analysis can be extended to a global optimization context by using the techniques presented in (Raginsky et al., 2017).
We now present the assumptions that imply our results.
The step-size sequence is non-increasing and satisfies and .
Let be a twice continuously differentiable function, satisfying , for some and has a bounded Hessian . Given , there exists , , such that and where is the drift of the process defined in (19).
These are common assumptions ensuring that the SDE is simulated with infinite time-horizon and the process is not explosive (Panloup, 2008; Şimşekli, 2017). We can now establish the weak convergence of (21) and present it as a corollary to Theorem 3, Proposition 3.3, and (Panloup, 2008) (Theorem 2). Assume that the gradient is Lipschitz continuous and has linear growth i.e., there exits such that for all . Furthermore, assume that Assumptions 1 and 2 hold for some . If the test function then
3.4 Connections to existing approaches
We now point out interesting algorithmic connections between (20) and two methods that are commonly used in practice. We first roll back our initial hypothesis that the gradient noise is distributed, i.e., , and modify (20) as follows:
As a special case when , we obtain a stochastic gradient descent-type recursion:
Let us now consider gradient-clipping
, a heuristic approach for eliminating the problem of ‘exploding gradients’, which often appear in training neural networks(Pascanu et al., 2013; Zhang et al., 2019a). Very recently, Zhang et al. (2019b) empirically illustrated that such explosions stem from heavy-tailed gradients and formally proved that gradient clipping indeed improves convergence rates under heavy-tailed perturbations. We notice that, the behavior of (22) is reminiscent of gradient clipping: due to the vanishing behavior of for , as the components of gets larger in magnitude, the update applied on gets smaller. The behavior becomes more prominent in (23). On the other hand, (22) is more aggressive in the sense that the updates can get arbitrarily small as the value of decreases as opposed to being ‘clipped’ with a threshold.
The second connection is with the natural gradient descent algorithm, where the stochastic gradients are pre-conditioned with the inverse Fisher information matrix (FIM) (Amari, 1998). Here FIM is defined as , where the expectation is taken over the data. Notice that when
(i.e., Cauchy distribution), we have the following form:. Therefore, we observe that, in (23), can be equivalently written as , where is a diagonal matrix with entries . Therefore, we can see
as an estimator of the diagonal part of FIM, as they will be in the same order whenis large. Besides, (22) then appears as its momentum extension. However, will be biased mainly due to the fact that FIM is the average of the squared gradients, whereas is based on the square of the average gradients. This connection is rather surprising, since a seemingly unrelated, differential geometric approach turns out to have strong algorithmic similarities with a method that naturally arises when the gradient noise is Cauchy distributed.
4 Numerical Study
In this section, we will illustrate our theory on several experiments which are conducted in both synthetic and real-data settings. We note that, as expected, FULD with Gaussian kinetic energy did not yield a numerically stable discretization due to the explosive behavior of . Hence, in this section, we only focus on FULD with kinetic energy in order to avoid obscuring the results and from now on we will simply refer to FULD with kinetic energy as FULD.
4.1 Synthetic setting
We first consider a one-dimensional synthetic setting, similar to the one considered in (Capała and Dybiec, 2019). We consider a quartic potential function with a quadratic component, . We then simulate the ‘uncorrected dynamics’ (UD) given in (7) and FULD (19) by using the Euler-Maruyama discretization to compare their behavior for different . For , we used the software given in (Ament and O’Neil, 2018) for computing .
Figure 3 illustrates the distribution of the samples generated by simulating the two dynamics. In this setup, we set , , with number of iterations . We observe that, for , FULD very accurately captures the form of the distribution, whereas UD exhibits a visible bias and the shape of its resulting distribution is slightly distorted. Nevertheless, since the perturbations are close to a Gaussian in this case (i.e., is close to ), the difference is not substantial and can be tolerable in an optimization context. However, this behavior becomes much more emphasized when we use a heavier-tailed driving process: when , we observe that the target distribution of UD becomes distant from the Gibbs measure , and more importantly its modes no longer match the minima of ; agreeing with the observations presented in (Capała and Dybiec, 2019). On the other hand, thanks to the correction brought by , FULD still captures the target distribution very accurately, even when the driving force is Cauchy.
On the other hand, in our experiments we observed that, for small values of , UD can quickly become numerically unstable and even diverge for slightly larger step-sizes, whereas this problem never occurred for FULD. This outcome also stems from the fact that UD does not have any mechanism to compensate the potential large updates originating from the heavy-tailed perturbations. To illustrate this observation more clearly, in Figure 4 we illustrate the iterates which were used for producing Figure 3. We observe that, while the iterates of UD are well-behaved for , the magnitude range of the iterates gets quite large when is set to . On the other hand, for both values of , FULD iterates are always kept in a reasonable range, thanks to the clipping-like effect of .
4.2 Neural networks
In our next set of experiments, we evaluate our theory on neural networks. In particular, we apply the iterative scheme given in (22) as an optimization algorithm for training neural networks, and compare its behavior with classical SGDm defined in (2). In this setting, we do not add any explicit noise, all the stochasticity comes from the potentially heavy-tailed stochastic gradients (3).
We consider a fully-connected network for a classification task on the MNIST and CIFAR10 datasets, with different depths (i.e. number of layers) and widths (i.e. number of neurons per layer). For each depth-width pair, we train two neural networks by using SGDm (2) and our modified version (22), and compare their final train/test accuracies and loss values. We use the conventional train-test split of the datasets: for MNIST we have K training and K test samples, and for CIFAR10 these numbers are K and K, respectively. We use the cross entropy loss (also referred to as the ‘negative-log-likelihood’).
We note that the modified scheme (22) reduces to (2) when , since . Hence in this section, we will refer to SGDm as the special case of (22) with . On the other hand, in these experiments, computing becomes impractical for , since the algorithms given in (Ament and O’Neil, 2018) become prohibitively slow with the increased dimension . Hence, we will focus on the analytically available Cauchy case, i.e., , which can be efficiently implemented (cf. its definition in Section 3.4). We expect that, if the stochastic gradient noise can be well-approximated by using a Cauchy distribution, then the modified dynamics should exhibit an improved performance since it would eliminate the potential bias brought by the heavy-tailed noise.
In these experiments, we set , , and run the algorithms for iterations 222Since the scale of the gradient noise is proportional to (see (20)), in this setup, a fixed implicitly determines . . We measure the accuracy and the loss at every 100th iteration and we report the average of the last two measurements. Figure 5 shows the results obtained on the MNIST dataset. We observe that, in most of the cases, setting yields a better performance in terms both training and testing accuracies/losses. This difference becomes more visible when the width is set to : the accuracy difference between the algorithms reaches to . We obtain a similar result on the CIFAR10 dataset, as illustrated in Figure 6. In most of the cases performs better, with the maximum accuracy difference being , implying the gradient noise can be approximated by a Cauchy random variable.
We observed a similar behavior when the width was set to . However, when we set the width to we did not perceive a significant difference in terms of the performance of the algorithms. We suspect that this behavior is due to either the gradient noise is not well-approximated either by a Gaussian or Cauchy. On the other hand, when the width was set to , resulted in a slightly better performance, which would be an indication that the Gaussian approximation is closer. The corresponding figures are provided in the appendix.
5 Conclusion and Future Directions
We considered the continuous-time variant of SGDm, known as the underdamped Langevin dynamics (ULD), and developed theory for the case where the gradient noise can be well-approximated by a heavy-tailed -stable random vector. As opposed to naïvely replacing the driving stochastic force in ULD, which correspondonds to running SGDm with heavy-tailed gradient noise, the dynamics that we developed exactly target the Boltzmann-Gibbs distribution, and hence do not introduce an implicit bias. We further established the weak convergence of the Euler-Maruyama discretization and illustrated interesting connections between the discretized algorithm and existing approaches commonly used in practice. We supported our theory with experiments on a synthetic setting and fully connected neural networks.
Our framework opens up interesting future directions. Our current modeling strategy requires a state-independent, isotropic noise assumption, which would not accurately reflect the reality. While anisotropic noise can be incorporated to our framework by using the approach of Ye and Zhu (2018), state-dependent noise introduces challenging technical difficulties. Similarly, it has been illustrated that the tail-index can depend on the state and different components of the noise can have a different (Şimşekli et al., 2019). Incorporating such state dependencies would be an important direction of future research.
We thank Jingzhao Zhang for fruitful discussions. The contribution of Umut Şimşekli to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project, and by the industrial chair Data science & Artificial Intelligence from Télécom Paris. Lingjiong Zhu is grateful to the support from Simons Foundation Collaboration Grant. Mert Gürbüzbalaban acknowledges support from the grants NSF DMS-1723085 and NSF CCF-1814888.
- Abramowitz and Stegun (1972) M. Abramowitz and I.A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables. Dover, New York, 1972.
- Amari (1998) Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
- Ament and O’Neil (2018) Sebastian Ament and Michael O’Neil. Accurate and efficient numerical calculation of stable densities via optimized quadrature and asymptotics. Statistics and Computing, 28:171–185, 2018.
- Bertoin (1996) Jean Bertoin. Lévy Processes. Cambridge University Press, 1996.
- Betancourt et al. (2017) Michael Betancourt, Simon Byrne, Sam Livingstone, Mark Girolami, et al. The geometric foundations of Hamiltonian Monte Carlo. Bernoulli, 23(4A):2257–2298, 2017.
- Brosse et al. (2019) Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unadjusted Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–3663, 2019.
- Capała and Dybiec (2019) Karol Capała and Bartłomiej Dybiec. Stationary states for underdamped anharmonic oscillators driven by Cauchy noise. arXiv preprint arXiv:1905.12078, 2019.
- Chaudhari and Soatto (2018) P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In International Conference on Learning Representations, 2018.
Fractional Langevin Monte Carlo: Exploring Lévy driven stochastic differential equations for Markov Chain Monte Carlo.In
International Conference on Machine Learning, pages 3200–3209, 2017.
- Dalalyan and Riou-Durand (2018) Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. arXiv preprint arXiv:1807.09382, 2018.
- Duan (2015) J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015.
- Dubkov et al. (2008) Alexander A Dubkov, Bernardo Spagnolo, and Vladimir V Uchaikin. Lévy flight superdiffusion: An introduction. International Journal of Bifurcation and Chaos, 18(09):2649–2672, 2008.
- Eliazar and Klafter (2003) Iddo Eliazar and Joseph Klafter. Lévy-driven Langevin systems: Targeted stochasticity. Journal of Statistical Physics, 111(3-4):739–768, 2003.
- Fischer (2010) Hans Fischer. A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer Science & Business Media, 2010.
- Gao et al. (2018a) Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global convergence of Stochastic Gradient Hamiltonian Monte Carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. arXiv:1809.04618, 2018a.
- Gao et al. (2018b) Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Breaking reversibility accelerates Langevin dynamics for global non-convex optimization. arXiv:1812.07725, 2018b.
- Hérau and Nier (2004) Frédéric Hérau and Francis Nier. Isotropic hypoellipticity and trend to equilibrium for the Fokker-Planck equation with a high-degree potential. Archive for Rational Mechanics and Analysis, 171(2):151–218, 2004.
- Hu et al. (2017) W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.
- Jastrzebski et al. (2017) S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.
- Kloeden and Platen (2013) Peter E Kloeden and Eckhard Platen. Numerical Solution of Stochastic Differential Equations, volume 23. Springer Science & Business Media, 2013.
- Kuruoglu (1999) E. E. Kuruoglu. Signal Processing in -Stable Noise Environments: A Least -Norm Approach. PhD Thesis, University of Cambridge, 1999.
- Lévy (1937) P. Lévy. Théorie de l’addition des variables aléatoires. Gauthiers-Villars, Paris, 1937.
- Livingstone et al. (2019) Samuel Livingstone, Michael F Faulkner, and Gareth O Roberts. Kinetic energy choice in Hamiltonian/hybrid Monte Carlo. Biometrika, 106(2):303–319, 2019.
- Lu et al. (2016) Xiaoyu Lu, Valerio Perrone, Leonard Hasenclever, Yee Whye Teh, and Sebastian J Vollmer. Relativistic Monte Carlo. arXiv preprint arXiv:1609.04388, 2016.
- Maddison et al. (2018) Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet. Hamiltonian descent methods. arXiv preprint arXiv:1809.05042, 2018.
- Mandelbrot (2013) B. B. Mandelbrot. Fractals and Scaling in Finance:Discontinuity, Concentration, Risk. Springer Science & Business Media, 2013.
- Mandt et al. (2016) S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363, 2016.
- Martin and Mahoney (2019) Charles H Martin and Michael W Mahoney. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. arXiv preprint arXiv:1901.08278, 2019.
- Montroll and Bendler (1984) Elliott W. Montroll and John T. Bendler. On Lévy (or stable) distributions and the Williams-Watts model of dielectric relaxation. Journal of Statistical Physics, 34(1):129–162, Jan 1984. ISSN 1572-9613.
- Montroll and West (1979) Elliott W. Montroll and Bruce J. West. Chapter 2 - On an enriched collection of stochastic processes. In E.W. Montroll and J.L. Lebowitz, editors, Fluctuation Phenomena, pages 61 – 175. Elsevier, 1979. ISBN 978-0-444-85248-9.
- Neal (2010) RM Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo (S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, eds.), 2010.
Nguyen et al. (2019)
Thanh Huy Nguyen, Umut Şimşekli, and Gaël Richard.
Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization.In International Conference on Machine Learning, pages 4810–4819, 2019.
- Ortigueira (2006) M. D. Ortigueira. Riesz potential operators and inverses via fractional centred derivatives. International Journal of Mathematics and Mathematical Sciences, 2006, 2006.
- Panigrahi et al. (2019) Abhishek Panigrahi, Raghav Somani, Navin Goyal, and Praneeth Netrapalli. Non-Gaussianity of stochastic gradient noise. arXiv preprint arXiv:1910.09626, 2019.
- Panloup (2008) Fabien Panloup. Recursive computation of the invariant measure of a stochastic differential equation driven by a Lévy process. Annals of Applied Probability, 18(2):379–426, 2008.
Pascanu et al. (2013)
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
On the difficulty of training recurrent neural networks.In International conference on machine learning, pages 1310–1318, 2013.
- Pavliotis (2014) Grigorios A Pavliotis. Stochastic Processes and Applications: Diffusion processes, the Fokker-Planck and Langevin Equations, volume 60. Springer, 2014.
- Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703, 2017.
- Riesz (1949) M. Riesz. L’intégrale de Riemann-Liouville et le problème de Cauchy. Acta Mathematica, 81(1):1–222, 1949.
- Schertzer et al. (2001) D. Schertzer, M. Larchevêque, J. Duan, V.V. Yanovsky, and S. Lovejoy. Fractional Fokker-Planck equation for nonlinear stochastic differential equations driven by non-Gaussian Lévy stable noises. Journal of Mathematical Physics, 42(1):200–212, 2001.
- Şimşekli et al. (2019) Umut Şimşekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018, 2019.
- Simsekli et al. (2019) Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837, 2019.
- Sliusarenko et al. (2013) O Yu Sliusarenko, DA Surkov, V Yu Gonchar, and Aleksei V Chechkin. Stationary states in bistable system driven by Lévy noise. The European Physical Journal Special Topics, 216(1):133–138, 2013.
- Smith et al. (2018) Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations, 2018.
- Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
- Wintner (1941) Aurel Wintner. The singularities of Cauchy’s distributions. Duke Math. J., 8(4):678–681, 12 1941. doi: 10.1215/S0012-7094-41-00857-8.
- Ye and Zhu (2018) Nanyang Ye and Zhanxing Zhu. Stochastic fractional Hamiltonian Monte Carlo. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018.
- Zhang et al. (2019a) Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Analysis of gradient clipping and adaptive scaling with a relaxed smoothness condition. arXiv preprint arXiv:1905.11881, 2019a.
- Zhang et al. (2019b) Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why ADAM beats SGD for attention models. arXiv preprint arXiv:1912.03194, 2019b.
- Zhu et al. (2019) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In Proc. Int. Conf. Mach. Learn., pages 7654–7663, 2019.
- Zou et al. (2018) Difan Zou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced Hamilton Monte Carlo methods. In International Conference on Machine Learning, pages 6028–6037, 2018.
Appendix A Proof of Theorem 3
Let denote the probability density of . Then it satisfies the fractional Fokker-Planck equation (see Proposition 1 and Section 7 in (Schertzer et al., 2001)):
We can compute that
We can compute that
where we used the property (Proposition 1 in (Şimşekli, 2017)) and the semi-group property of the Riesz derivative .
We can also compute that
Hence, is an invariant probability measure. The proof is complete.
Appendix B Proof of Theorem 3.1
We can compute that
for every .
Recall the definition of Fourier transform and its inverse:
Notice that the Fourier transform of is itself, i.e. , and moreover, , and therefore,
Furthermore, we can compute that
By the Taylor expansion of sine function, we get