1.1 Context and motivation
Deep neural networks have revolutionized machine learning and have ubiquitous use in many application domains(LeCun et al., 2015; Krizhevsky et al., 2012; Hinton et al., 2012). In full generality, many key tasks in deep learning reduce to solving the following optimization problem:
where denotes the weights of the neural network, denotes the loss function that is typically non-convex in , each denotes the (instantaneous) loss function that is contributed by the data point , and denotes the total number of data points. Stochastic gradient descent (SGD) is one the most popular approaches for attacking this problem in practice and is based on the following iterative updates:
where denotes the iteration number, is the step-size (or the learning rate), and denotes the stochastic gradient at iteration , that is defined as follows:
Here, is a random subset that is drawn with or without replacement at iteration , and denotes the number of elements in .
SGD is widely used in deep learning with a great success in its computational efficiency (Bottou, 2010; Bottou and Bousquet, 2008; Daneshmand et al., 2018). Beyond efficiency, understanding how SGD performs better than its full batch counterpart in terms of test accuracy remains a major challenge. Even though SGD seems to find perfect training performance at (near-) zero loss solutions on the training landscape (at least in certain regimes (Zhang et al., 2017a; Sagun et al., 2015; Keskar et al., 2016; Geiger et al., 2018)), it appears that the algorithm finds solutions with different properties depending on how it is tuned (Sutskever et al., 2013; Keskar et al., 2016; Jastrzebski et al., 2017; Hoffer et al., 2017; Masters and Luschi, 2018; Smith et al., 2017). Despite the fact that the impact of SGD on generalization has been studied (Advani and Saxe, 2017; Wu et al., 2018a; Neyshabur et al., 2017), a satisfactory theory that can explain its success in a way that encompasses such peculiar empirical properties is still lacking.
A popular approach for investigating the behavior of SGD is based on considering SGD as a discretization of a continuous-time process (Mandt et al., 2016; Jastrzebski et al., 2017; Li et al., 2017a; Hu et al., 2017; Zhu et al., 2018; Chaudhari and Soatto, 2018)
. This approach models the stochastic gradient noise as a Gaussian distribution, i.e.satisfies
denotes the multivariate (Gaussian) normal distribution and
denotes the identity matrix of appropriate size.111We note that more sophisticated assumptions than (4) have been made in terms of the covariance matrix of the Gaussian distribution (e.g. state dependent, anisotropic). However, in all these cases, the resulting distribution is still a Gaussian, therefore the same criticism holds. The rationale behind this assumption is that, if the size of the minibatch is large enough, then we can invoke the Central Limit Theorem (CLT) and assume that the distribution of is approximately Gaussian. Then, under this assumption, (2) can be written as follows:
where denotes a standard normal random vector in . If we further assume that is small enough, then the continuous-time analogue of the discrete-time process (5) is the following stochastic differential equation (SDE):
where denotes the standard Brownian motion. This SDE is a variant of the well-known Langevin diffusion and under mild regularity assumptions on , one can show that the Markov process is ergodic with its unique invariant measure, whose density is proportional to for any (Roberts and Stramer, 2002). From this perspective, the SGD recursion in (5) can be seen as a first-order Euler-Maruyama discretization of the Langevin dynamics (see also (Li et al., 2017a; Jastrzebski et al., 2017; Hu et al., 2017)), which is often referred to as the Unadjusted Langevin Algorithm (ULA) (Roberts and Stramer, 2002; Lamberton and Pages, 2003; Durmus and Moulines, 2015; Durmus et al., 2016).
Based on this observation, Jastrzebski et al. (2017) focused on the relation between this invariant measure and the algorithm parameters, namely the step-size and mini-batch size, as a function of . They concluded that the ratio of step-size divided by the batch size is the control parameter that determines the width of the minima found by SGD. Furthermore, they revisit the famous wide minima folklore (Hochreiter and Schmidhuber, 1997): Among the minima found by SGD, the wider it is, the better it performs on the test set. However, there are several fundamental issues with this approach, which we will explain below.
We first illustrate a typical mismatch between the Gaussianity assumption and the empirical behavior of the stochastic gradient noise in terms of the long term behavior. In Figure 1
, we plot the histogram of the norms of the stochastic gradient noise at the first and the last iterations that are computed using a convolutional neural network (AlexNet) in an image classification problem on the CIFAR10 dataset and compare it to the histogram of the norms of Gaussian random vectors222In our conference proceeding (Şimşekli et al., 2019), we have identified an error in the corresponding figure: the histogram of the norms was computed over all the gradients noises that were obtained throughout training. This issue is fixed in this article.. It can be clearly observed that, even though the shape of the histogram corresponding to gradients resembles the one of the Gaussian vectors at the first iteration, throughout training, it drifts apart from the Gaussian and exhibits a heavy-tailed behavior.
In addition to the empirical observations, the Gaussianity assumption also yields some theoretical issues. The first issue with this assumption is that the current SDE analyses of SGD are based on the invariant measure of the SDE, which implicitly assumes that sufficiently many iterations have been taken to converge to that measure. Recent results on ULA (Raginsky et al., 2017; Xu et al., 2018) have shown that, the required number of iterations to achieve the invariant measure often grows exponentially with the dimension . This result contradicts with the current practice: considering the large size of the neural networks and limited computational budget, only a limited number of iterations – which is much smaller than – can be taken. This conflict becomes clearer in the light of the recent works that studied the local behavior of ULA (Tzen et al., 2018; Zhang et al., 2017b). These studies showed that ULA will get close to the nearest local optimum in polynomial time; however, the required amount of time for escaping from that local optimum increases exponentially with the dimension. Therefore, the phenomenon that SGD prefers wide minima within a considerably small number of iterations cannot be explained using the asymptotic distribution of the SDE given in (6).
The second issue is related to the local behavior of the process and becomes clear when we consider the metastability analysis of Brownian motion-driven SDEs. These studies (Freidlin and Wentzell, 1998; Bovier et al., 2004; Imkeller et al., 2010b) consider the case where is initialized in a quadratic basin and then analyze the minimum time such that is outside that basin.
They show that this so-called first exit time depends exponentially on the height of the basin; however, this dependency is only polynomial with the width of the basin. These theoretical results directly contradict with the wide minima phenomenon: even if the height of a basin is slightly larger, the exit-time from this basin will be dominated by its height, which implies that the process would stay longer in (or in other words, ‘prefer’) deeper minima as opposed to wider minima. The reason why the exit-time is dominated by the height is due to the continuity of the Brownian motion, which is in fact a direct consequence of the Gaussian noise assumption.
A final remark on the issues of this approach is the observation that landscape is flat at the bottom regardless of the batch size used in SGD (Sagun et al., 2017)
. In particular, the spectrum of the Hessian at a near critical point with close to zero loss value has many near zero eigenvalues. Therefore, local curvature measures that are used as a proxy for measuring the width of a basin can be misleading. Such measures usually correlate with the magnitudes of large eigenvalues of the Hessian which are few(Keskar et al., 2016; Jastrzebski et al., 2017). Besides, during the dynamics of SGD it has been observed that the algorithm does not cross barriers except perhaps at the very initial phase (Xing et al., 2018; Baity-Jesi et al., 2018). Such dependence of width on an essentially-flat landscape combined with the lack of explicit barrier crossing during the SGD descent forces us to rethink the analysis of basin hopping under a noisy dynamics.
1.2 Proposed framework
In this study, we aim at addressing these contradictions and come up with an arguably better-suited hypothesis for the stochastic gradient noise that has more pertinent theoretical implications for the phenomena associated with SGD. In particular, we go back to (3) and (4) and reconsider the application of CLT. This classical CLT assumes that is a sum of many independent and identically distributed (i.i.d.) random vectors, whose covariance matrix exists and is invertible, and then it states that the law of converges to a Gaussian distribution, which then paves the way for (5
). Even though the finite-variance assumption seems natural and intuitive at the first sight, it turns out that in many domains, such as turbulent motions (Weeks et al. (1995)), oceanic fluid flows (Woyczyński (2001)), finance (Mandelbrot (2013)), biological evolution (Jourdain et al. (2012)), audio signals (Liutkus and Badeau (2015); Şimşekli et al. (2015); Leglaive et al. (2017); Şimşekli et al. (2018)), brain signals (Jas et al. (2017)), the assumption might fail to hold (see (Duan, 2015) for more examples). In such cases, the classical CLT along with the Gaussian approximation will no longer hold. While this might seem daunting, fortunately, one can prove a generalized CLT and show that the law of the sum of these i.i.d. variables with infinite variance still converges to a family of heavy-tailed distributions that is called the -stable distribution (Lévy, 1937). As we will detail in Section 2, these distributions are parametrized by their tail-index and they coincide with the Gaussian distribution when .
In this study, we relax the finite-variance assumption on the stochastic gradient noise and by invoking the generalized CLT, we assume that follows an -stable distribution, as hinted in Figure 1(d). By following a similar rationale to (5) and (6), we reformulate SGD with this new assumption and consider its continuous-time limit for small step-sizes. Since the noise might not be Gaussian anymore (i.e. when ), the use of the Brownian motion would not be appropriate in this case and we need to replace it with the -stable Lévy motion, whose increments have an -stable distribution (Yanovsky et al. (2000)). Due to the heavy-tailed nature of -stable distribution, the Lévy motion might incur large discontinuous jumps and therefore exhibits a fundamentally different behavior than the Brownian motion, whose paths are on the contrary almost surely continuous. As we will describe in detail in Section 2, the discontinuities also reflect in the metastability properties of Lévy-driven SDEs, which indicate that, as soon as , the first exit time from a basin does not depend on its height; on the contrary, it directly depends on its width and the tail-index . Informally, this implies that the process will escape from narrow minima – no matter how deep they are – and stay longer in wide minima. Besides, as
gets smaller, the probability for the dynamic to jump into a wide basin will increase. Therefore, if the-stable assumption on the stochastic gradient noise holds, then the existing metastability results automatically provide strong theoretical insights for illuminating the behavior of SGD.
The main contributions of this paper are twofold: (i) we perform an extensive empirical analysis of the tail-index of the stochastic gradient noise in deep neural networks and (ii) based on these empirical results, we bring an alternative perspective to the existing approaches for analyzing SGD and shed more light on the folklore that SGD prefers wide minima by establishing a bridge between SGD and the related theoretical results from statistical physics and stochastic analysis.
We conduct experiments on the most common deep learning architectures. In particular, we investigate the tail behavior under fully-connected and convolutional models using negative log likelihood (NLL) and linear hinge loss functions on MNIST, CIFAR10, and CIFAR100 datasets. For each configuration, we scale the size of the network and batch size used in SGD and monitor the effect of each of these settings on the tail index .
Our experiments reveal several remarkable results:
In all our configurations, the stochastic gradient noise turns out to be highly non-Gaussian and possesses a heavy-tailed behavior.
Increasing the size of the minibatch has a very little impact on the tail-index, and as opposed to the common belief that larger minibatches result in Gaussian gradient noise, the noise is still far from being Gaussian.
There is a strong interaction between the network architecture, network size, dataset, and the tail-index, which ultimately determine the dynamics of SGD on the training surface. This observation supports the view that, the geometry of the problem and the dynamics induced by the algorithm cannot be separated from each other.
In almost all configurations, we observe two distinct phases of SGD throughout iterations. During the first phase, the tail-index rapidly decreases and SGD possesses a clear jump when the tail-index is at its lowest value and causes a sudden jump in the accuracy. This behavior strengthens the view that SGD crosses barriers at the very initial phase.
The paper is organized as follows. In Section 2 we provide the technical background required for the -stable distributions and SDEs driven by -stable Lévy motions. We also formalize the framework in which we analyze SGD by using such SDEs as a proxy. We then describe in Section 3 the metastability and first exit time properties of such SDEs and their discretizations, and discuss their connection with the wide minima phenomenon. In Section 4, we provide formal theoretical analysis for the convergence behavior of SGD to a local optimum under heavy-tailed gradient noise and discuss the implications of this result. In Section 5 we describe our experimental methodology and in Section 6 we provide our empirical results which validate our theory. Our approach also opens up several interesting future directions and open questions, as we discuss in Section 7.
We note that an initial version of this paper was presented at the 36th International Conference on Machine Learning (Şimşekli et al., 2019). In this article, we have significantly extended our conference proceeding. These extensions can be summarized as follows:
We have updated Section 3 and added more detailed explanations for clarity.
In addition to the metastability results, we have added a discussion of the first exit time properties of the Lévy-driven SDEs, which form the basis of the metastability results (Theorem 3.1). Furthermore, we have summarized our recent theoretical findings (Nguyen et al., 2019a) on the first exit time properties of discretized processes and translated them to the context of this article (Section 3.2).
We have added a summary of the first exit time behavior of the Lévy-driven systems in (end of Section 3.1).
In Section 4, under certain assumptions, we have proved a new local convergence result for SGD with an explicit rate and made a connection between the convergence rate of SGD and the tail-index of the gradient noise. At the end of Section 4, we have also added a short discussion about the global convergence properties of the discretized process by summarizing the results that we proved in (Nguyen et al., 2019b).
Finally, we have added several new experimental results (Section 6). In particular, we have presented new results on stability tests (Section 6.1), finer-grained layer-wise tail-index estimation (Sections 6.2 and 6.4), and an investigation of the relation between the tail-index and the generalization properties of the network (6.5).
2 Stable distributions and SGD as a Lévy-Driven SDEs
The CLT states that the sum of i.i.d. random variables with a finite second moment converges to a normal distribution if the number of summands grow. However, if the variables have heavy-tails, the second moment may not exist. For instance, if their densityhas a power-law tail decreasing as where ; only -th moment exists with . In this case, generalized central limit theorem (GCLT) says that the sum of such variables will converge to a distribution called the -stable distribution instead as the number of summands grows (see e.g. (Fischer, 2010). In this work, we focus on the centered symmetric -stable () distribution, which is a special case of -stable distributions that are symmetric around the origin.
We can view the distribution as a heavy-tailed generalization of a centered Gaussian distribution. The
distributions are defined through their characteristic function via
Even though their probability density function does not admit a closed-form formula in general except in special cases, their density decays with a power law tail likewhere is called the tail-index which determines the behavior of the distribution: as gets smaller; the distribution has a heavier tail. In fact, the parameter also determines the moments: when , if and only if ; implying has infinite variance when . The parameter is the scale parameter and controls the spread of around . We recover the Gaussian distribution as a special case when
and the Cauchy distribution when.
Following the above argument, a more general assumption on the stochastic gradient noise can be given by:
where denotes the ’th component of a vector , and distributed with . Clearly, this assumption is way too general to offer reasonable theoretical treatment. We will resort to several simplifications: (1) We assume that each coordinate of is distributed with the same which depends on the state . Here, this dependency is not crucial since we are mainly interested in the tail-index , which can be estimated independently from the scale parameter. Therefore, we will simply denote as for clarity. (2) We further assume that each coordinate of is distributed with the same independent of the state . We will demonstrate the state independence at later stages of SGD experimentally in Section 6.4, however, imposing the coordinate dependence is a much harder challenge which will be addressed in the section devoted for open problems (Section 7).
where is a random vector such that . If the step-size is small enough, then we can consider the continuous-time limit of this discrete-time process, which is expressed in the following SDE driven by an -stable Lévy process:
where denotes the -dimensional -stable Lévy motion with independent components. In other words, each component of is an independent -stable Lévy motion in . For the scalar case it is defined as follows for (Duan (2015)):
For , the increments are independent ().
The difference and have the same distribution: for .
is continuous in probability (i.e. it has stochastically continuous sample paths): for all and , as .
The SDE in (10) exhibits a fundamentally different behavior than the one in (6) does. This is mostly due to the stochastic continuity property of , which enables to have a countable number of discontinuities, which are sometimes called ‘jumps’. In the rest of this section, we will recall important theoretical results about this SDE and discuss their implications on SGD.
3 First Exit Time and Metastability Analysis
We start by reviewing known metastability properties of the -stable Lévy process (10) from the literature. We will also focus on the first exit time which is, roughly speaking, the average time it takes for the process to exit a neighborhood of a local minima (a quantity we define formally later in (16)). Then, we will summarize the theoretical results proven in our recent work (Nguyen et al., 2019b) on the metastability properties of the discrete-time processes obtained by an Euler discretization of such processes.
3.1 The continuous-time process
For clarity of the presentation and notational simplicity we focus on the scalar case and consider the SDE (10) in (i.e. ). Multidimensional generalizations of the metastability results presented in this paper can be found in (Imkeller et al., 2010a) and will be summarized at the end of this section. We rewrite (10) as follows:
for , started from the initial point , where is the -stable Lévy process, is the noise level and is a non-convex objective with local minima. We denote the derivative of by . When , we recover the gradient descent dynamics in continuous time: , where the local minima are the stable points of this differential equation. However, as soon as , these states become ‘metastable’, meaning that there is a positive probability for to transition from one basin to another. However, the time required for transitioning to another basin strongly depends on the characteristics of the injected noise. The two most important cases are and . When , (i.e. the Gaussianity assumption) the process is continuous, which requires it to ‘climb’ the basin all the way up, in order to be able to transition to another basin. This fact makes the transition-time depend on the height of the basin. On the contrary, when , the process can incur discontinuities and does not need to cross the boundaries of the basin in order to transition to another one, since it can directly jump. This property is called the ‘transition phenomenon’ (Duan, 2015) and makes the transition-time mostly depend on the width of the basin. In the rest of the section, we will formalize these explanations.
Gradient-like flows driven by Brownian motion and weak error for their discretization are well studied from a theoretical standpoint (see e.g. (Li et al., 2017b; Mertikopoulos and Staudigl, 2018)), however their Lévy-driven analogue (11) and the discrete-time versions (Burghoff and Pavlyukevich, 2015) are relatively less studied. Under some assumptions on the objective , it is known that the process (11) admits a stationary density (Samorodnitsky and Grigoriu, 2003). For a general , an explicit formula for the equilibrium distribution is not known, however when the noise level is small enough, finer characterizations of the structure of the equilibrium density in dimension one is known. We next summarize known results in this area, which show that Lévy-driven dynamics spend more time in ‘wide valleys’ in the sense of (Chaudhari et al., 2017) when goes to zero.
Assume that is smooth with local minima separated by local maxima , i.e.
Furthermore, assume that the local minima and maxima are not degenerate, i.e. and for every . We also assume the objective gradient has a growth condition for some constant and when is large enough. Each local minima lies in the (interval) valley of (width) length . Consider also a -neighborhood around the local minimum with small enough so that the neighborhood is contained in the valley for every . We are interested in the first exit time from starting from a point and the transition time to a neighborhood of another local minimum, we will remove the dependency to
of the transition time in our discussions as it is clear from the context. The following result shows that the transition times are asymptotically exponentially distributed in the limit of small noise and scales likewith . [Pavlyukevich (2007)] For an initial point , in the limit , the following statements hold regarding the transition time:
If the SDE (11) would be driven by the Brownian motion instead, then an analogous theorem to Theorem 3.1 holds saying that the transition times are still exponentially distributed but the scaling needs to be replaced by where is the maximal depth of the basins to be traversed between the two local minima (Day, 1983; Bovier et al., 2005). This means that in the small noise limit, Brownian-motion driven gradient descent dynamics need exponential time to transit to another minimum whereas Levy-driven gradient descent dynamics need only polynomial time. We also note from Theorem 3.1 that the mean transition time between valleys for Lévy SDE does not depend on the depth of the valleys they reside in which is an advantage over Brownian motion driven SDE in the existence of deep valleys. Informally, this difference is due to the fact that Brownian motion driven SDE has to typically climb up a valley to exit it, whereas Lévy-driven SDE could jump out.
The following theorem says that as , up to a normalization in time, the process behaves like a finite state-space Markov process that has support over the set of local minima admitting a stationary density with an infinitesimal generator . The process jumps between the valleys , spending time proportional to probability amount of time in each valley in the equilibrium where the probabilities are given by the solution to the linear system . [Pavlyukevich (2007)] Let , for some . For , , as , in the sense of finite-dimensional distributions, where
is a continuous-time Markov chain on a state spacewith the infinitesimal generator with
This process admits a density satisfying .
A consequence of this theorem is that equilibrium probabilities are typically larger for “wide valleys”. To see this consider the special case illustrated in Figure 3 with local minima separated by a local maximum at . For this example, , and the second local minimum lies in a wider valley. A simple computation reveals
We see that , that is in the equilibrium the process spends more time on the wider valley. In particular, the ratio grows with an exponent when the ratio of the width of the valleys grows. Consequently, if the gradient noise is indeed -stable distributed, these results directly provide theoretical evidence for the wide-minima behavior of SGD assuming the loss landscape is not degenerate.
In addition to the transition time between the basins of attraction of two local minima, understanding how long it takes for the continuous-time process given by (10) to exit a neighborhood of a local minimum (given that it is started in that neighborhood) is also relevant. We formally define the first exit time of the stochastic process (10) as follows:
The following result characterizes the first exit time in dimension one. [Imkeller and Pavlyukevich (2006)] Consider the SDE (10) in dimension and assume that it has a unique strong solution. Assume further that the objective has a global minimum at zero, satisfying the conditions for every , , if and only if , and . Then, there exist positive constants , , , and such that for , the following holds:
uniformly for all initialization and , where . Consequently,
The case of . The exit behavior of the SDE (10) from an arbitrary domain in has also been studied in the literature. Imkeller et al. (2010a) generalizes Theorem 3.1 from dimension to arbitrary dimensions and showed that in the small noise limit the exit time from a domain is exponentially distributed with a parameter that depends on the tail-index . In case the components of the Lévy motion in (10) is replaced by a process that consists of the sum of finitely many one-dimensional Lévy processes with different tail-indices , it is also shown that the first exit time from a domain is determined by the smallest when the noise level is small enough.
3.2 Relating the discretization to the continuous-time process
While the metastability and first exit time results of Lévy-driven SDEs can be used as a proxy for analyzing SGD, approximating SGD as a continuous-time approach might not be accurate for any step-size , and some theoretical concerns have already been raised for the validity of such approximations (Yaida (2019)). Intuitively, one can expect that the metastable behavior of SGD would be similar to the behavior of its continuous-time limit only when the discretization step-size is small enough. Even though some theoretical results have been recently established for the discretizations of SDEs driven by Brownian motion (Tzen et al. (2018)), it is not clear how the discretized Lévy SDEs behave in terms of metastability.
In this section, we summarize the theoretical results that we proved in our recent work (Nguyen et al., 2019a). In particular, we will now present explicit conditions for the step-size such that the metastability behavior of the discrete-time system (20) is guaranteed to be close to its continuous-time limit (19). More precisely, we consider the following stochastic differential equation with both a Brownian term and a Lévy term, and its Euler discretization as follows (Duan (2015)):
with independent and identically distributed (i.i.d.) variables where is the identity matrix, the components of are i.i.d with distribution, and is the amplitude of the noise. This dynamics includes (6) and (10) as special cases. Here, is chosen as a scalar for convenience; however, we believe that this analysis can be extended to the case where is a function of .
Understanding the metastability behavior of SGD modeled by these dynamics requires understanding the first exit times for the continuous-time process given by (19) and its discretization (20). For this purpose, for any given local minimum of and , we define the following set
which is the set of points in , each at a distance of at most from the local minimum . Similar to (16), we will study the first exit times defined by
Note that in the special case , we recover introduced previously in (16).
Our result (Theorem 6) shows that with sufficiently small discretization step , the probability to exit a given neighborhood of the local optimum at a fixed time of the discretization process approximates that of the continuous process. This result also provides an explicit condition for the step-size, which explains certain impacts of the other parameters of the problem, such as dimension , noise amplitude , variance of Gaussian noise , towards the similarity of the discretization and continuous processes.
Let us now state the assumptions which will imply our result.
The SDE (19) admits a unique strong solution.
Consider the process , where denotes the whole process and the drift is defined as follows333It is easy to verify that for all (Raginsky et al., 2017).:
Here, denotes the indicator function, i.e. if and if . Then, the process satisfies:
The gradient of is -Hölder continuous: There exists a constant such that
The gradient of satisfies the following assumption: .
For some and , is -dissipative:
We note that, A1 has been a common assumption in stochastic analysis, e.g. (Imkeller and Pavlyukevich, 2006; Imkeller et al., 2010a; Liang and Wang, 2018) and A1 and A2 directly hold for bounded gradients. On the other hand, the assumptions A3-A5 are standard conditions, which are often considered in non-convex optimization algorithms that are based on discretization of diffusions (Raginsky et al., 2017; Xu et al., 2018; Erdogdu et al., 2018; Gao et al., 2018b, a).
The next assumption identifies an explicit condition for the step-size, which is required to make sure that the discrete process well-approximates the continuous one.
More explicit forms of the constants are provided in (Nguyen et al., 2019a). We then have the following theorem, associating the first exit times of the continuous and the discretized processes. [Nguyen et al. (2019a)] Under assumptions A1- A6, the following inequality holds:
Exit time versus problem parameters. In Theorem 6, if we let go to zero for any fixed, the constant will also go to zero, and since can be chosen arbitrarily small, this implies that the probability of the first exit time for the discrete process and the continuous process will approach each other when the step-size gets smaller, as expected. If instead, we decrease or , the quantity also decreases monotonically, but it does not go to zero due to the first term in the expression of .
Exit time versus width of local minima.2017)). The eigenvalues of the Hessian of the objective near local minima have also been studied in the literature (see e.g. (Sagun et al., 2016; Papyan, 2018)). If the Hessian around a local minimum is positive definite, the conditions for the multi-dimensional version of Theorem 3.1 in (Imkeller et al., 2010a)) are satisfied locally around a local minimum. For local minima lying in wider valleys, the parameter can be taken to be larger; in which case the expected exit time will be larger by the formula (18). In other words, the SDE (19) spends more time to exit wider valleys. Theorem 6 shows that SGD modeled by the discretization of this SDE will also inherit a similar behavior if the step-size satisfies the conditions we provide.
4 Convergence Analysis
The convergence of SGD iterates to a local minimum in the context of deep learning has been studied in the literature (see e.g. (Reddi et al., 2016; Wu et al., 2018b)). However, these results apply only when the gradient noise has a finite second moment. A natural question that arises is how fast SGD iterates converge to a local minimum when the noise has tail index . In this case, the second moment exists only when . For , the second moment does not exist but the -th moment exist for any . Recall that the stochastic gradient iterations with constant stepsize are of the form
where is the stepsize, is the stochastic gradient and is the initialization. Consider the random gradient noise at step . Following the literature, we assume that is measurable with respect to an increasing family of Borel fields defined on a probability space . We also assume that the noise is unbiased and the stochastic gradients have a finite -th moment, i.e.
for some and .
When (which requires ), we recover the standard setting studied in the literature (see e.g. (Reddi et al., 2016; Polyak and Juditsky, 1992)). Here we consider to account for potential heavy-tailedness in the gradient noise.
Under this assumption, we have the following convergence result for SGD. The proof is given in the appendix and is inspired by the proof technique of (Reddi et al., 2016) for the and case and extends it to the heavy-tailed case when and . Assume A 3-A 7 hold with and the objective is bounded below admitting a minimum . Consider the SGD iterations (24) with initial point and constant stepsize . Then, we have
where the constant is defined by A 3. In particular, if we let for some constant , then
Furthermore, if we choose , then
where . We note that there is no clear ‘best choice’ for the learning rate in deep learning practice, various decaying stepsize rules including the choices of and are proposed and are commonly used (see. e.g. (Reddi et al., 2016),(Wu et al., 2018b, Section 1)). In fact, when gradients have finite variance, the choice of will lead to the fastest decay (with respect to ) of the right-hand side of (26) at a rate . However, when the gradients are heavy-tailed, Theorem 7 shows that this choice will lead to a slower rate of . Instead, if we choose the stepsize , the size of the gradients will shrink at a faster rate . In particular, Theorem 7 indicates that if the tail index is smaller, the upper bound for gets smaller and the convergence gets slower. We note that some observations supporting our result have already been reported in the literature, e.g., in (Ge et al., 2018) it has been empirically shown that the decay rate performed better that in a deep learning setup, which we believe is a consequence of the heavy-tailed nature of the problem.
Convergence to global optima. Besides convergence to local minima, we have also provided finite-time guarantees for discrete-time dynamics (9) in terms of suboptimality with respect to the global minimum in (Nguyen et al., 2019b, Section 1.1) as a function of the stepsize and the scale parameter. In particular, it was shown that the heavy-tailed system has a worse dependency on both and as compared to the Gaussian case, which is in line with Theorem 7. Besides, it is known that if the scale parameter gets smaller, the dynamics admits a stationary distribution that will concentrate more and more on the global minimizer although reaching out to stationary would require an exponential number of steps in the dimension in the worst case.
5 Experimental Methodology
Before presenting our numerical results, we describe our experimental methodology regarding how we estimate the heavy-tailedness of the stochastic gradients. First, we discuss how we can compute the tail-index based on a recent estimator proposed in Mohammadi et al. (2015). Second, we describe the procedure proposed in (Brcich et al., 2005) for testing whether stochastic gradients follow a symmetric -stable distribution.
5.1 Tail index estimation
Estimating the tail-index of an extreme-value distribution is a long-standing topic. Some of the well-known estimators for this task are (Hill, 1975; Pickands, 1975; Dekkers et al., 1989; De Haan and Peng, 1998). Despite their popularity, these methods are not specifically developed for -stable distributions and it has been shown that they might fail for estimating the tail-index for -stable distributions (Mittnik and Rachev, 1996; Paulauskas and Vaičiulis, 2011).
In this section, we use a relatively recent estimator proposed in (Mohammadi et al., 2015) for -stable distributions. It is given in the following theorem. [Mohammadi et al. (2015)] Let be a collection of random variables with and . Define for . Then, the estimator
converges to almost surely, as . As shown in Theorem 2.3 of (Mohammadi et al., 2015), this estimator admits a faster convergence rate and smaller asymptotic variance than all the aforementioned methods.
In order to verify the accuracy of this estimator, we conduct a preliminary experiment, where we first generate many distributed random variables with , for different values of . Then, we estimate by using . We repeat this experiment times for each . As shown in Figure 4, the estimator is very accurate for a large range of . Due to its favorable theoretical properties such as independence of the scale parameter , combined with its empirical stability, we choose this estimator in our experiments.
In order to estimate the tail-index at iteration , we first partition the set of data points into many disjoint sets of size , such that the union of these subsets give all the data points. Formally, for all , , , and for . This approach is similar to sampling without replacement. We then compute the full gradient and the stochastic gradients for each minibatch . We finally compute the stochastic gradient noises , vectorize each and concatenate them to obtain a single vector, and compute the reciprocal of (28). In this case, we have and we set to the divisor of that is the closest to .
5.2 Stability test
Besides estimating the tail-index of a random process, it is also important to verify whether the process is symmetric -stable. In this section, we describe a procedure (Brcich et al. (2005)) for obtaining a confidence level for the stability of a random process, based on the following property: [Brcich et al. (2005)] A necessary and sufficient condition for a random variable to have an distribution is
where and and are independent copies of . Here we adopt the stability test presented in (Brcich et al., 2005). To obtain a statistical test from (29), we first separate the observations into three equal-size subsets , and , which are considered as independent copies of the observations. We then assign the first subset to the right side of (29) and estimate the tail index of this subset using the idea of the previous section. For the left side of (29), we sum and term by term, and estimate of the resulting sum. Similarly, by separating the observations into four equal-size subsets , , and , then repeating these above steps, we get from and from , for a statistical test of (30). In the end, the process is considered to be -stable if the tail indices estimated from the left and the right sides of (29) (as well as of (30)) are relatively close to each other, i.e. if the ‘condition number’ is smaller than some threshold.
We investigate the tail behavior of the stochastic gradient noise in a variety of scenarios. We first consider a fully-connected network (FCN) on the MNIST and CIFAR10 datasets. For this model, we vary the depth (i.e. the total number of layers) in the set
, the width (i.e. the number of neurons per hidden layer) in the set, and the minibatch size ranging from to full batch.
We then consider a convolutional neural network (CNN) architecture (AlexNet) on the CIFAR10 and CIFAR100 datasets. We scale the number of filters in each convolutional layer in range . We use the existing random split of the MNIST dataset into train and test parts of sizes K and K, and CIFAR10 and CIFAR100 datasets into train and test parts of sizes K and K, respectively. The order of the total number of parameters range from several thousands to tens of millions.
For both FCN and CNN, we run each configuration with the negative-log-likelihood (i.e. cross entropy) and with the linear hinge loss, and we repeat each experiment with three different random seeds (see (Geiger et al., 2018) for details on the choice of the hinge loss). The training algorithm is SGD with no explicit modification such as momentum or weight decay. The training runs for a fixed number of iterations unless it hits 100% training accuracy first. At every
th iteration, we log the full training and test accuracies, and the tail estimate of the gradients that are sampled using the corresponding mini-batch size. The codebase is implemented in python using pytorch444The codebase can be found at https://github.com/umutsimsekli/sgd_tail_index..
Below, we present the most relevant and representative results. We have observed that, in all configurations, the three different initializations yielded no significant difference. Therefore, the effects of the randomness in initialization (under a given scheme) do not appear to affect the gradient noise. Similarly, the choice of the loss function do not yield different behaviours in terms of the tail index. Even though the heavy tailed nature remains the same, the choice of the loss function results in a different way of dependence to the hyperparameters of the system, which we discuss in Section6.5 and leave the investigation to a further study.
6.1 Stability test results
We first start by investigating the stability of the stochastic gradient noises under the datasets that we use for our estimation experiments. We will first focus on the later iterations of SGD, where the tail-index becomes stationary. Using an FCN on the MNIST and CIFAR10 datasets, we estimate the condition number (as described in Section 5.2) at every th iteration of the training stage, then take its average over the last K iterations to get the final result. Here, we consider to be an acceptable level for the test since it is a quite small number with respect to estimated in our experiments.
The results using the MNIST dataset are illustrated by Figure 5(a), in which layer index at corresponds to the whole network while the indices represent the hidden layers of the network. Our experiments show that the condition number for the whole network are always smaller than the threshold , which means the gradient noise of the network satisfies our required stability criterion, even when we change the number of layers (depths) and the number of neurons per layer (widths). The same conclusion on the stability test is true when we investigate each of the hidden layers of the network.
Figure 5(b) shows the results of the stability test for CIFAR10. As can be seen from the figure, the condition of the network fails to be smaller than our required criterion in some cases. However, the gap from this number to the criterion is quite small that we can consider that it does not violate the -stable assumption on the gradient noise of the network. Unlike the MNIST dataset, we observe that for the networks with neurons per layer, even though the overall gradient noise strongly exhibits an -stable behavior, some of the hidden layers are very far from being -stable, suggesting that the characteristics of the first layer is dominating the overall structure. In contrast, the gradient noise with respect to the parameters of the hidden layers becomes more -stable with a very high number () of neurons per layer.
By these experiments, we observe that the structure of the dataset has a strong impact on the statistical properties of the gradient noise, especially for the layers with smaller number of parameters. When this number of parameters is large (which is usually the case in practice), the gradient noise corresponding to these parameters becomes more -stable. In short, this means increasing the size of the network (the number of the network parameters) tends to make the gradient noise behave similarly to an -stable noise.
6.2 Effect of varying network size
We measure the tail-index for varying the widths and depths for the FCN, and varying widths (i.e. the number of filters) for the CNN. For very small sizes, the networks perform poorly; therefore, we only illustrate sufficiently large network sizes, which yield similar accuracies. For these experiments, we compute the average of the tail-index measurements for the last K iterations (i.e. when becomes stationary) to focus on the late stage dynamics.
Figure 6 shows the results for the FCN. The first striking observation is that in all the cases, the estimated tail-index is far from , meaning that the distribution of the gradient noise is highly non-Gaussian. For the MNIST dataset, we observe that systematically decreases for increasing network size, where this behavior becomes more prominent with the depth. This result shows that, for MNIST, increasing the dimension of the network results in a gradient noise with heavier tails and therefore increases the probability of ending up in a wider basin. For the CIFAR10 dataset, we still observe that is far from ; however, in this case, increasing the network size does not have a clear effect on . In all cases, we observe that is in the range –.
In Figure 7, we plot estimated for each layer of FCNs, using MNIST dataset where the minibatch is of size . The resulting is obtained by averaging over the last K iterations. The layer index ‘’ corresponds to the estimated of the whole network. In this experiment, we observe that becomes smaller (heavier-tailed) for the deeper layers. In addition, the value of the tail-index for the whole network has a strong connection with the first layers: the for the whole network is closer to that of the first layers than of the last layers.
Figure 8 shows the results for the CNN. In this figure, we also depict the train and test accuracy, as well as the tail-index that is estimated on the test set. These results show that, for both CIFAR10 and CIFAR100, the tail-index is extremely low for the under-parametrized regime (e.g. the case when the width is , , or for CIFAR10). As we increase the size of the network the value of increases until the network performs reasonably well and stabilizes in the range –. We also observe that behaves similarly for both train and test sets555We observed a similar behavior in under-parametrized FCN; however, did not plot those results to avoid clutter..
These results show that there is strong interplay between the network architecture, dataset, and the algorithm dynamics: (i) we see that the size of the network can strongly influence , (ii) for the exact same network architecture, the choice of the dataset has a significant impact on not only the landscape of the problem, but also the noise characteristics, hence on the algorithm dynamics.
6.3 Effect of the minibatch size
In our second set of experiments, we investigate the effect of the size of the minibatch on . We focus on the FCN and monitor the behavior of for different network and minibatch sizes . Figure 9 illustrates the results. This result contradicts with the Gaussian assumption of GN for large , as the tail-index does not increase at all with the increasing batch size. We observe that stays almost the same when the depth is and it moves in a small interval when the depth is set to . We note that we obtained the same train and test accuracies for different minibatch sizes.
6.4 Tail behavior throughout the iterations
So far, we have focused on the late stages of SGD, where is in a rather stationary regime. In this set of experiments, we shift our focus on the first iterations and report an intriguing behavior that we observed in almost all our experiments. As a representative, in Figure 10, we show the temporal evolution of SGD for the FCN with layers and neurons/layer.
The results clearly show that there are two distinct phases of SGD (in this configuration before and after iteration ). In the first phase, the loss decreases very slowly, the accuracy slightly increases, and more interestingly rapidly decreases. When reaches its lowest level, the process possesses a jump, which causes a sudden decrease in the accuracy. After this point the process recovers again and we see a stationary behavior in and an increasing behavior in the accuracy.
We also investigate this behavior for each layer of an FCN with depth and width in Figure 11. The estimated tail-index for each layer has a clear phase change at earlier iterations, where we observe that this jump is more prominent in the deeper layers where the tail-index is smaller. On the other hand, unlike the whole network, the tail-index of each layer undergoes a fluctuation period before becoming stationary at the last iterations. However, this observation might be due to the measurement error since the size of the sample that is used in the estimator (28) gets smaller when we make layer-wise measurements.
The fact that the process has a jump when is at its smallest value provides a strong support to our assumptions and the metastability theory that we discussed in the previous section. Furthermore, these results also strengthen the view that SGD crosses barriers at the very initial phase and continues searching until it reaches a “wide and flat enough” region of a local optimum. On the other hand, our current analysis is not able to determine whether the process jumps in a different basin or a ‘better’ part of the same basin and we leave it as a future work.
6.5 A note on generalization
In this section, we investigate the connection between the tail-index and the generalization performance. In particular, we consider the relation between and the ratio of the step-size to the batch-size which is proportional to the noise scale of SGD when there is no momentum (Park et al., 2019). It has been empirically demonstrated that this ratio correlates with performance of the model (Jastrzebski et al., 2017), hence the higher the noise scale, the better the generalization performance until a certain level. Clearly, when the noise is too high, training may diverge, however, proper level of noise leads to better solutions.
In this section, we will investigate how the tail index of the gradient noise is affected for different noise scales. We reproduce and follow the initialization convention and the hyper-parameter scale that is studied in (Park et al., 2019, Appendix G): A fully connected model with 3 hidden layers, each hidden layer has 512 nodes. Weights are initialized , bias terms are set to zero at the initial point. Each layer, is then passed through ReLU non-linearity, and multiplied by the inverse of the width of the previous layer. As usual, the network is trained with SGD without momentum; the dataset is the standard MNIST. Minibatch size ranges in the set and step-size ranged from the set 666Note that this particular scaling is introduced in Jacot-Guillarmod et al. (2018) and it admits slightly larger values of learning rates compared to standard initialization schemes.
Figure 12(a) and Figure 12(b) visualize the results. The estimated and the test error are averaged among the candidates with the same , at the last iteration of the training process. We ignore the particular values of test error but rather focus on the way certain choices affect the trends in the system and the behaviour of SGD dynamics.
In both choices of loss functions, hinge and NLL, the behaviour of the test error with respect to the noise scale is consistent with previous observations. Similarly, in both cases, the estimated remains within a narrow band of 1, indicating the heavy tail behaviour. However, the trends in estimated are different depending on the choice of the loss. Therefore, we cannot attribute the improvement in performance to lower when increasing the noise scale. To better emphasize this point, we plot the correlation of estimated and test error in Figure 13 where the positive and negative correlations are clearly visible depending on the choice of the loss function. This contrasting behaviour is another hint that there exists a connection between and the test performance (since they are correlated in both cases) and suggests us to examine this connection in order to understand when exactly the dynamics falls into basins with better performance.
7 Conclusion and Open Problems
We investigated the tail behavior of the gradient noise in deep neural networks and empirically showed that the gradient noise is highly non-Gaussian. This outcome enabled us to analyze SGD as an SDE driven by a Lévy motion and establish a bridge between SGD and existing theoretical results, which provides more insights on the behavior of SGD, especially in terms of choosing wide minima. We also proved a new result on the local convergence of SGD, which provide