Emergent properties of the local geometry of neural loss landscapes

10/14/2019 ∙ by Stanislav Fort, et al. ∙ 6

The local geometry of high dimensional neural network loss landscapes can both challenge our cherished theoretical intuitions as well as dramatically impact the practical success of neural network training. Indeed recent works have observed 4 striking local properties of neural loss landscapes on classification tasks: (1) the landscape exhibits exactly C directions of high positive curvature, where C is the number of classes; (2) gradient directions are largely confined to this extremely low dimensional subspace of positive Hessian curvature, leaving the vast majority of directions in weight space unexplored; (3) gradient descent transiently explores intermediate regions of higher positive curvature before eventually finding flatter minima; (4) training can be successful even when confined to low dimensional random affine hyperplanes, as long as these hyperplanes intersect a Goldilocks zone of higher than average curvature. We develop a simple theoretical model of gradients and Hessians, justified by numerical experiments on architectures and datasets used in practice, that simultaneously accounts for all 4 of these surprising and seemingly unrelated properties. Our unified model provides conceptual insights into the emergence of these properties and makes connections with diverse topics in neural networks, random matrix theory, and spin glasses, including the neural tangent kernel, BBP phase transitions, and Derrida's random energy model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The geometry of neural network loss landscapes and the implications of this geometry for both optimization and generalization have been subjects of intense interest in many works, ranging from studies on the lack of local minima at significantly higher loss than that of the global minimum [1, 2] to studies debating relations between the curvature of local minima and their generalization properties [3, 4, 5, 6]

. Fundamentally, the neural network loss landscape is a scalar loss function over a very high

dimensional parameter space that could depend a priori

in highly nontrivial ways on the very structure of real-world data itself as well as intricate properties of the neural network architecture. Moreover, the regions of this loss landscape explored by gradient descent could themselves have highly atypical geometric properties relative to randomly chosen points in the landscape. Thus understanding the shape of loss functions over high dimensional spaces with potentially intricate dependencies on both data and architecture, as well as biases in regions explored by gradient descent, remains a significant challenge in deep learning. Indeed many recent studies explore extremely intriguing properties of the

local geometry of these loss landscapes, as characterized by the gradient and Hessian of the loss landscape, both at minima found by gradient descent, and along the journey to these minima.

In this work we focus on providing a simple, unified explanation of 4 seemingly unrelated yet highly intriguing local properties of the loss landscape on classification tasks that have appeared in the recent literature:

(1) The Hessian eigenspectrum is composed of a bulk plus outlier eigenvalues where is the number of classes.

Recent works have observed this phenomenon in small networks [7, 8, 9], as well as large networks [10, 11]. This implies that locally the loss landscape has highly curved directions, while it is much flatter in the vastly larger number of directions in weight space.

(2) Gradient aligns with this tiny Hessian subspace.

Recent work [12] demonstrated that the gradient

over training time lies primarily in the subspace spanned by the top few largest eigenvalues of the Hessian

(equal to the number of classes ). This implies that most of the descent directions lie along extremely low dimensional subspaces of high local positive curvature; exploration in the vastly larger number of available directions in parameter space over training utilizes a small portion of the gradient.

(3) The maximal Hessian eigenvalue grows, peaks and then declines during training.

Given widespread interest in arriving at flat minima (e.g. [6]) due to their presumed superior generalization properties, it is interesting to understand how the local geometry, and especially curvature, of the loss landscape varies over training time. Interestingly, a recent study [13] found that the spectral norm of the Hessian, as measured by the top Hessian eigenvalue, displays a non-monotonic dependence on training time. This non-monotonicity implies that gradient descent trajectories tend to enter higher positive curvature regions of the loss landscape before eventually finding the desired flatter regions. Related effects were observed in [14, 15].

(4) Initializing in a Goldilocks zone of higher convexity enables training in very low dimensional weight subspaces.

Recent work [16] showed, surprisingly, that one need not train all parameters independently to achieve good training and test error; instead one can choose to train only within a random low dimensional affine hyperplane of parameters. Indeed the dimension of this hyperplane can be far less than . More recent work [17]

explored how the success of training depends on the position of this hyperplane in parameter space. This work found a Goldilocks zone as a function of the initial weight variance, such that the intersection of the hyperplane with this Goldilocks zone correlated with training success. Furthermore, this Goldilocks zone was characterized as a region of higher than usual positive curvature as measured by the Hessian statistic

. This statistic takes larger positive values when typical randomly chosen directions in parameter space exhibit more positive curvature [17, 18]. Thus overall, the ease of optimizing over low dimensional hyperplanes correlates with intersections of this hyperplane with regions of higher positive curvature.

Taken together these somewhat surprising and seemly unrelated local geometric properties fundamentally challenge our conceptual understanding of the shape of neural network loss landscapes. It is a priori unclear how these properties may emerge naturally from the very structure of real-world data, complex neural architectures, and potentially biased explorations of the loss landscape through the dynamics of gradient descent starting at a random initialization. Moreover, it is unclear what specific aspects of data, architecture and descent trajectory are important for generating these properties, and what myriad aspects are not relevant.

Our main contribution is to provide an extremely simple, unified model that simultaneously accounts for all 4 of these local geometric properties. Our model yields conceptual insights into why these properties might arise quite generically in neural networks solving classification tasks, and makes connections to diverse topics in neural networks, random matrix theory, and spin glasses, including the neural tangent kernel [19, 20], BBP phase transitions [21, 22], and the random energy model [23].

The outline of this paper is as follows. We set up the basic notation and questions we ask about gradients and Hessians in detail Sec. 2. In Sec. 3

we introduce a sequence of simplifying assumptions about the structure of gradients, Hessians, logit gradients and logit curvatures that enable us to obtain in the end an extremely simple random model of Hessians and gradients and how they evolve both over training time and weight scale. We then immediately demonstrate in Sec. 

4 that all striking properties of local geometry of the loss landscape emerge naturally from our simple random model. Finally, in Sec. 5 we give direct evidence that our simplifying theoretical assumptions leading to our random model in Sec. 3 are indeed valid in practice, by performing numerical experiments on realistic architectures and datasets.

2 Overall framework

Here we describe the local shape of neural loss landscapes, as quantified by their gradient and Hessian, and formulate the main problem we aim to solve: conceptually understanding the emergence of the striking properties from these two fundamental objects.

2.1 Notation and general setup

We consider a classification task with classes. Let denote a dataset of

input-output vectors where the outputs

are one-hot vectors, with all components equal to except a single component if and only if is the correct class label for input . We assume a neural network transforms each input into a logit vector through the function , where denotes a dimensional vector of trainable neural network parameters. We aim to obtain the aforementioned 4 local properties of the loss landscape as a consequence of a set of simple properties and therefore we do not assume the function corresponds to any particular architecture such as a ResNet [24]

, deep convolutional neural network

[25]

, or a fully-connected neural network. We do assume though that the predicted class probabilities

are obtained from the logits via the softmax function as

(1)

We assume network training proceeds by minimizing the widely used cross-entropy loss, which on a particular input-output pair is given by

(2)

The average loss over the dataset then yields a loss landscape over the trainable parameter vector :

(3)

2.2 The gradient and the Hessian

In this work we are interested in two fundamental objects that characterize the local shape of the loss landscape , namely its slope, or gradient , with components given by

(4)

and its local curvature, defined by the Hessian matrix , with matrix elements given by

(5)

Here, is the trainable parameter, or weight specifying . Both the gradient and Hessian can vary over weight space , and therefore over training time, in nontrivial ways.

In general, the loss in (2) depends on the logit vector , which in-turn depends on the weights as . We can thus obtain explicit expressions for the gradient and Hessian with respect to weights in (4) and (5) by first computing the gradient and Hessian with respect the logits

and then applying the chain-rule. Due to the particular form of the soft-max function in (

1) and cross-entropy loss in (2), the gradient of the loss with respect to the logits is

(6)

and the Hessian of the loss with respect to logits is

(7)

Then applying the chain rule yields the gradient of w.r.t. the weights as

(8)

The chain rule also yields the Hessian in (5):

(9)

The Hessian consists of a sum of two-terms which have been previously referred to as the G-term and H-term [10], and we adopt this nomenclature here.

The basic equations (6), (7), (8) and (9) constitute the starting point of our analysis. They describe the explicit dependence of the gradient and Hessian on a host of quantities: the correct class labels , the predicted class probabilities , the logit gradients and logit curvatures . It is conceptually unclear how the striking properties of the local shape of neural network loss landscapes described in Sec. 1 all emerge naturaly from the explicit expressions in equations (6), (7), (8) and (9), and moreover, which specific properties of class labels, probabilities and logits play a key role in their emergence.

3 Analysis of the gradient and Hessian

In the following subsections, through a sequence of approximations, motivated both by theoretical considerations and empirical observations, we isolate three key features that are sufficient to explain the 4 striking properties: (1) weak logit curvature, (2) clustering of logit gradients, and (3) freezing of class probabilities, both over training time and weight scale. We discuss each of these features in the next three subsections.

3.1 Weakness of logit curvature

We first present a combined set of empirical and theoretical considerations which suggest that the G-term dominates the H-term in (9

), in determining the structure of the top eigenvalues and eigenvectors of neural network Hessians. First, empirical studies

[11, 26, 8]

demonstrate that large neural networks trained on real data with gradient descent have Hessian eigenspectra consisting of a continuous bulk eigenvalue distribution plus a small number of large outlier eigenvalues. Moreover, some of these studies have shown that the spectrum of the H-term alone is similar to the bulk spectrum of the total Hessian, while the spectrum of the G-term alone is similar to the outlier eigenvalues of the total Hessian.

This bulk plus outlier spectral structure is extremely well understood in a wide array of simpler random matrix models [21, 22]. Without delving into mathematical details, a common observation underlying these models is if where is a low rank large matrix with a small number of nonzero eigenvalues , while is a full rank random matrix with a bulk eigenvalue spectrum, then as long as the eigenvalues are large relative to the scale of the eigenvalues of , then the spectrum of will have a bulk plus outlier structure. In particular, the bulk spectrum of will look similar to that of , while the outlier eigenvalues of will be close to the eigenvalues of . However, as the scale of increases, the bulk of will expand to swallow the outlier eigenvalues of . An early example of this sudden loss of outlier eigenvalues is known as the BBP phase transition [21].

In this analogy plays the role of the G-term, while plays the role of the H-term in (9). However, what plausible training limits might diminish the scale of the H-term compared to the G-term to ensure the existence of outliers in the full Hessian? Indeed recent work exploring the Neural Tangent Kernel [19, 20] assumes that the logits depend only linearly on the weights , which implies that the logit curvatures , and therefore the H-term are identically zero. More generally, if these logit curvatures are weak over the course of training (which one might expect if the NTK training regime is similar to training regimes used in practice), then one would expect based on analogies to simpler random matrix models, that the outliers of in (8) are well modelled by the G-term alone, as empirically observed previously [10].

Based on these combined empirical and theoretical arguments, we model the Hessian using the G-term only:

(10)

where we used (7) for the Hessian w.r.t. logits.

Figure 1: The experimental results on clustering of logit gradients for different datasets, architectures, non-linearities and stages of training. The green bars correspond to in Eq. 11, the red bars to in Eq. 12, and the blue bars to in Eq. 13. In general, the gradients with respect to weights of logits will cluster well regardless of the class of the datapoint they were evaluated at. For datapoints of true class , they will cluster slightly better, while gradients of two logits will be nearly orthogonal. This is visualized in Fig 2.

.

3.2 Clustering of logit gradients

We next examine the logit gradients , which play a prominent role in both the Hessian (after neglecting logit curvature) in (10) and the gradient in (8). Previous work [27] noted that gradients of the loss cluster based on the correct class memberships of input examples . While the loss gradients are not exactly the logit gradients, they are composed of them. Based on our own numerical experiments, we investigated and found strong logit gradient clustering on a range of networks, architectures, non-linearities, and datasets as demonstrated in Figure 1 and discussed in detail Sec. 5.1. In particular, we examined three measures of logit gradient similarity. First, consider

(11)

Here SLSC is short for Same-Logit-Same-Class, and

measures the average cosine similarity over all pairs of logit gradients corresponding to the same logit component

, and all pairs of examples and belonging to the same desired class label . denotes the number of examples with correct class label . Alternatively, one could consider

(12)

Here SL is short for Same-Logit and measures the average cosine similarity over all pairs of logit gradients corresponding to the same logit component , and all pairs of examples , regardless of whether the correct class label of examples and is also . Thus averages over more restricted set of example pairs than does . Finally, consider the null control

(13)

Here DL is short for Different-Logits and measures the average cosine similarity for all pairs of different logit components and all pairs of examples .

Extensive numerical experiments detailed in Figure 1 and in Sec. 5.1 demonstrate that both and are large relative to , implying: (1) logit gradients of logit cluster together for inputs whose ground truth class is ; (2) logit gradients of logit also cluster

Figure 2: A diagram of logit gradient clustering. The logit gradients cluster based on , regardless of the input datapoint . The gradients coming from examples of the class cluster more tightly, while gradients of different logits and are nearly orthogonal.

together regardless of the class label of each example , although slightly less strongly than when the class label is restricted to ; (3) logit gradients of two different logits and are essentially orthogonal; (4) such clustering occurs not only at initialization but also after training.

Overall, these results can be viewed schematically as in Figure 2. Indeed, one can decompose the logit gradients as

(14)

where the C vectors have components

(15)

and denotes the example specific residuals. Clustering, in the sense of large , implies the mean logit gradient components are significantly larger than the residual components . In turn the observation of small implies that mean logit gradient vectors and are essentially orthogonal. Both effects are depicted schematically in Fig. 2. Overall, this observation of logit gradient clustering is similar to that noted in [28], though the explicit numerical modeling and the focus on the properties in Sec. 1 goes beyond it.

Equations (10) and (14) and suggest a random matrix approach to modelling the Hessian, as well as a random model for the gradient in (8). The basic idea is to model the mean logit gradients , the residuals , and the logits themselves (which give rise to the class probabilities through (1)) as independentrandom variables. Such a modelling approach neglects correlations between logit gradients and logit values across both examples and weights. However, we will see that this simple modelling assumption is sufficient to produce the 4 striking properties of the local shape of neural loss landscapes described in Sec. 1.

In this random matrix modelling approach, we simply choose the components to be i.i.d. zero mean Gaussian variables with variance , while we choose the residuals to be i.i.d. zero mean Gaussians with variance . With this choice, for high dimensional weight spaces with large , we can realize the logit gradient geometry depicted in Fig. 2. Indeed the mean logit gradient vectors are approximately orthogonal, and logit gradients cluster at high with a clustering value given by . Finally, inserting the decomposition (14) into (10) and neglecting cross-terms whose average would be negligible at large due to the assumed independence of the logit gradient residuals and logits in our model, yields

(16)

This is the sum of a rank term with a high rank noise term, and the larger the logit clustering , the larger the eigenvalues of the former relative to the latter, yielding outlier eigenvalues plus a bulk spectrum through the BBP analogy described in Sec. 3.1.

While these choices constitute our random model of logit-gradients, to complete the random model of both the Hessian in (16) and the gradient in (8), we need to provide a random model for the logits , or equivalently the class probabilities , which we turn to next.

3.3 Freezing of class probabilities both over training time and weight scale

Figure 3: The motion of probabilities in the probability simplex a) during training in a real network, and b) as a function of logit variance in our random model. (a) The distribution of softmax probabilities in the probability simplex for a 3-class subset of CIFAR-10 during an early, middle, and late stage of training a SmallCNN. (b) The motion of probabilities induced by increasing the logit variance (blue to red) in our random model and the corresponding decrease in the entropy of the resulting distributions.

A common observation is that over training time, the predicted softmax class probabilities evolve from hot, or high entropy distributions near the center of the dimensional probability simplex, to colder, or lower entropy distributions near the corners of the same simplex, where the one-hot vectors reside. An example is shown in Fig. 3 for the case of classes of CIFAR-10. We can develop a simple random model of this freezing dynamics over training by assuming the logits themselves are i.i.d. zero mean Gaussian variables with variance , and further assuming that this variance increases over training time. Direct evidence for the increase in logit variance over training time is presented in Fig. 4 and in Sec. 5.2.

The random Gaussian distribution of logits

with variance in turn yields a random probability vector for each example through the softmax function in (1). This random probability vector is none other than that found in Derrida’s famous random energy model [23], which is a prototypical toy example of a spin glass in physics. Here the negative logits play the role of an energy function over physical states, the logit variance plays the role of an inverse temperature, and is thought of as a Boltzmann distribution over the states. At high temperature (small ), the Boltzmann distribution explores all states roughly equally yielding an entropy . Conversely as the temperature decreases ( increases), the entropy decreases, approaching , signifying a frozen state in which concentrates on one of the physical states (with the particular chosen state depending on the particular realization of energies ). Thus this simple i.i.d. Gaussian random model of logits mimics the behavior seen in training simply by increasing over training time, yielding the observed freezing of predicted class probabilities (Fig. 3).

Such a growth in the scale of logits over training is indeed demonstrated in Fig. 4 and in Sec. 5.2, and it could arise naturally as a consequence of an increase in the scale of the weights over training, which has been previously reported [18, 29]. We note also that the same freezing of predicted softmax class probabilities could also occur at initialization as one moves radially out in weight space, which would then increase the logit variance as well. Below in Sec. 4 we will make use of the assumed feature of freezing of class probabilities both over increasing training times and over increasing weight scales at initialization.

4 Deriving loss landscape properties

We are now in a position to exploit the features and simplifying assumptions made in Sec. 3 to provide an exceedingly simple unifying model for the gradient and the Hessian that simultaneously accounts for all 4 striking properties of the neural loss landscape described in Sec. 1. We first review the essential simplifying assumptions. First, to understand the top eigenvalues and associated eigenvectors of the Hessian, we assume the logit curvature term in (9) is weak enough to neglect, yielding the model Hessian in (10), which is composed of logit gradients and predicted class probabilities . In turn, these quantities could a priori have complex, interlocking dependencies both over weight space and over training time, leading to potentially complex behavior of the Hessian and its relation to the gradient in (8).

We instead model these quantities by simply employing a set of independent zero mean Gaussian variables with specific variances that can change over either training or over weight space. We assume the logit gradients decompose as in (14) with mean logit gradients distributed as and residuals distributed as . Additionally, we assume the logits themselves are distributed as . As we see next, inserting this collection of i.i.d. Gaussian random variables into the expressions for the softmax in (1), the gradient in (8), and the Hessian model in (16), for various choices of , and , yields a simple unified model sufficient to account for the 4 striking observations in Sec. 1.

The results of our model, shown in Fig. 5, 6, 7 and 8, were obtained using , and

. The logit standard deviation was

, leading to the average highest predicted probability of . The logit gradient noise scale was and the mean logit gradient scale was , leading to a same-logit clustering value of that matches our observations in Fig. 1. We assigned class labels to random probability vectors so as to obtain a simulated accuracy of . For the experiments in Figures 7 and 8, we swept through a range of logit standard deviations from to , while also monotonically increasing as observed in real neural networks in Fig. 4. We now demonstrate that the properties emerge from our model.

Figure 4: The evolution of logit variance, logit gradient length, and weight space radius with training time. The top left panel shows that the logit variance across classes, averaged over examples, grows with training time. The top right panel shows that logit gradient lengths grow with training time. The bottom left panel shows the weight norm grows with training time. All 3 experiments were conducted with a SmallCNN on CIFAR-10. The bottom right panel shows the logit variance grows as one moves radially out in weight space, at random initialization, with no training involved, again in a SmallCNN.
(1) Hessian spectrum = bulk + outliers.

Our random model in Fig. 5 clearly exhibits this property, consistent with that observed in much more complex neural networks (e.g. [11]). The outlier emergence is a direct consequence of high logit-clustering (large ), which ensures that rank term dominates the high rank noise term in (16). This dominance yields the outliers through the BBP phase transition mechanism described in Sec. 3.1.

Figure 5: The Hessian eigenspectrum in our random model. Due to logit-clustering it exhibits a bulk + outliers. To obtain outliers, we can use mean logit gradients whose lengths vary with (data not shown).
(2) Gradient confinement to principal Hessian subspace.

Figure 6 shows the cosine angle between the gradient and the Hessian eigenvectors in our random model. The majority of the gradient power lies within the subspace spanned by the top few eigenvectors of the Hessian, consistent with observations in real neural networks [12]. This occurs because the large mean logit-gradients contribute both to the gradient in (8

) and principal Hessian eigenspace in (

16).

Figure 6: The overlap between Hessian eigenvectors and gradients in our random model. Blue dots denote cosine angles between the gradient and the sorted eigenvectors of the Hessian. The bulk (71% in this particular case) of the total gradient power lies in the top 10 eigenvectors (out of ) of the Hessian.
(3) Non-monotonic evolution of the top Hessian eigenvalue with training time.

Equating training time with a growth of logit variance and a simultaneous growth of while keeping constant, our random model exhibits eigenvalue growth then shrinkage, as shown in Figure 7 and consistent with observations on large CNNs trained on realistic datasets [13]. This non-monotonicity arises from a competition between the shrinkage, due to freezing probabilities with increasing , of the eigenvalues of the by matrix with components in (16), and the growth of the mean logit gradients in (16) due to increasing .

Figure 7: The top eigenvalue of the Hessian in our random model as a function of the logit standard deviation ( training time as demonstrated in Fig. 4). We also model logit gradient growth over training by monotonically increasing while keeping constant.
(4) The Golidlocks zone: is large (small) for small (large) weight scales.

Equating increasing weight scale with increasing logit scale , our random model exhibits this property, as shown in Fig. 8, and consistent with observations in CNNs [17]. To replicate the experiments in [17], we project our Hessian to a random dimensional hyperplane (data not shown) and verified that the behavior we observe is also numerically correct. This decrease in (which is approximately invariant to overall mean logit gradient scale ) is primarily a consequence of freezing of probabilities with increasing .

Figure 8: The as a function of the logit standard deviation ( training time as show in Fig. 4). This transition is equivalent to what was seen for CNNs in [17].

5 Justifying modelling assumptions

Our derivation of the 4 properties of the local shape of the neural loss landscape in Sec. 4 relied on several modelling assumptions in a simple, unified random model detailed in Sec. 3. These assumptions include: (1) neglecting logit curvature (introduced and justified in Sec. 3.1), (2) logit gradient clustering (introduced in Sec. 3.2 and justified in Sec. 5.1 below), and (3) increases in logit variance both over training time and weight scale, to yield freezing of class probabilities (introduced in Sec. 3.3 and justified in Sec. 5.2 below).

5.1 Logit gradient clustering

Fig. 1 demonstrates, as hypothesized in Fig. 2, that logit gradients do indeed cluster together within the same logit class, and that they are essentially orthogonal between logit classes. We observed this with fully-connected and convolutional networks, with and non-linearites, at different stages of training (including initialization), and on different datasets. We note that these measurements are related, but complimentary to the concept of stiffness in [27].

5.2 Logit variance dependence

Fig. 4 demonstrates empirical facts observed in actual CNNs trained on CIFAR-10 or at initialization: (1) logit variance across classes, averaged over examples, grows with training time; (2) logit gradient lengths grow with training time; (3) the weight norm grows with training time; (4) logit variance grows with weight scale at random initialization. These four facts justify modelling assumptions used in our random model of Hessians and gradients: (1) we can model training time by increasing corresponding to increasing logit variances in the model; while simultaneously (2) also increasing corresponding to increasing mean logit gradients in the model; (3) we can model increases in weight scale at random initialization by increasing

. We note the connection between training epoch and the weight scale has also been established in

[18, 29].

6 Discussion

Overall, we have shown that four non-intuitive, surprising, and seemingly unrelated properties of the local geometry of the neural loss landscape can all arise naturally in an exceedingly simple random model of Hessians and gradients and how they vary both over training time and weight scale. Remarkably, we do not need to make any explicit reference to highly specialized structure in either the data, the neural architecture, or potential biases in regions explored by gradient descent. Instead the key general properties we required were: (1) weakness of logit curvature; (2) clustering of logit gradients as depicted schematically in Fig. 2 and justified in Fig. 1; (3) growth of logit variances with training time and weight scale (justified in Fig. 4) which leads to freezing of softmax output distributions as shown in Fig. 3. Overall, the isolation of these key features provides a simple, unified random model which explains how surprising properties described in Sec. 1 might naturally emerge in a wide range of classification problems.

Acknowledgments

We would like to thank Yasaman Bahri and Ben Adlam from Google Brain and Stanislaw Jastrzebski from NYU for useful discussions.

References

  • Saxe et al. [2014] A Saxe, J McClelland, and S Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), 2014.
  • Dauphin et al. [2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014.
  • Hochreiter and Schmidhuber [1997] S Hochreiter and J Schmidhuber. Flat minima. Neural Comput., 9(1):1–42, January 1997.
  • Keskar et al. [2016] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch training for deep learning: Generalization gap and sharp minima. September 2016.
  • Dinh et al. [2017] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In

    Proceedings of the 34th International Conference on Machine Learning - Volume 70

    , ICML’17, pages 1019–1028, Sydney, NSW, Australia, 2017. JMLR.org.
  • Chaudhari et al. [2016] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. November 2016.
  • Sagun et al. [2016] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2016.
  • Sagun et al. [2017a] Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks, 2017a.
  • Yao et al. [2018] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W. Mahoney. Hessian-based analysis of large batch training and robustness to adversaries, 2018.
  • Papyan [2019a] Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians, 2019a.
  • Ghorbani et al. [2019] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density, 2019.
  • Gur-Ari et al. [2018] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace, 2018.
  • Jastrzebski et al. [2018] Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018.
  • Keskar et al. [2017] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum?id=H1oyRlYgg.
  • LeCun et al. [1993] Yann LeCun, Patrice Y. Simard, and Barak Pearlmutter.

    Automatic learning rate maximization by on-line estimation of the hessian eigenvectors.

    In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 156–163. Morgan-Kaufmann, 1993.
  • Li et al. [2018] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes, 2018.
  • Fort and Scherlis [2018] Stanislav Fort and Adam Scherlis. The goldilocks zone: Towards better understanding of neural network loss landscapes, 2018.
  • Fort and Jastrzebski [2019] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes, 2019.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2018.
  • Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
  • Baik et al. [2005] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, 2005.
  • Benaych-Georges and Nadakuditi [2011] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adv. Math., 227(1):494–521, May 2011.
  • Derrida [1980] B Derrida. The random energy model. Physics Reports, 67(1):29–35, 1980.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • LeCun et al. [1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • Sagun et al. [2017b] Levent Sagun, Léon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. 2017b.
  • Fort et al. [2019] Stanislav Fort, Paweł Krzysztof Nowak, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks, 2019.
  • Papyan [2019b] Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5012–5021, Long Beach, California, USA, 09–15 Jun 2019b. PMLR. URL http://proceedings.mlr.press/v97/papyan19a.html.
  • Anonymous [2020] Anonymous. Deep ensembles: A loss landscape perspective. In Submitted to International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xZAkrFPr. under review.

References

  • Saxe et al. [2014] A Saxe, J McClelland, and S Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), 2014.
  • Dauphin et al. [2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014.
  • Hochreiter and Schmidhuber [1997] S Hochreiter and J Schmidhuber. Flat minima. Neural Comput., 9(1):1–42, January 1997.
  • Keskar et al. [2016] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch training for deep learning: Generalization gap and sharp minima. September 2016.
  • Dinh et al. [2017] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In

    Proceedings of the 34th International Conference on Machine Learning - Volume 70

    , ICML’17, pages 1019–1028, Sydney, NSW, Australia, 2017. JMLR.org.
  • Chaudhari et al. [2016] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. November 2016.
  • Sagun et al. [2016] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2016.
  • Sagun et al. [2017a] Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks, 2017a.
  • Yao et al. [2018] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W. Mahoney. Hessian-based analysis of large batch training and robustness to adversaries, 2018.
  • Papyan [2019a] Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians, 2019a.
  • Ghorbani et al. [2019] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density, 2019.
  • Gur-Ari et al. [2018] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace, 2018.
  • Jastrzebski et al. [2018] Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018.
  • Keskar et al. [2017] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum?id=H1oyRlYgg.
  • LeCun et al. [1993] Yann LeCun, Patrice Y. Simard, and Barak Pearlmutter.

    Automatic learning rate maximization by on-line estimation of the hessian eigenvectors.

    In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 156–163. Morgan-Kaufmann, 1993.
  • Li et al. [2018] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes, 2018.
  • Fort and Scherlis [2018] Stanislav Fort and Adam Scherlis. The goldilocks zone: Towards better understanding of neural network loss landscapes, 2018.
  • Fort and Jastrzebski [2019] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes, 2019.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2018.
  • Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
  • Baik et al. [2005] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, 2005.
  • Benaych-Georges and Nadakuditi [2011] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adv. Math., 227(1):494–521, May 2011.
  • Derrida [1980] B Derrida. The random energy model. Physics Reports, 67(1):29–35, 1980.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • LeCun et al. [1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • Sagun et al. [2017b] Levent Sagun, Léon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. 2017b.
  • Fort et al. [2019] Stanislav Fort, Paweł Krzysztof Nowak, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks, 2019.
  • Papyan [2019b] Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5012–5021, Long Beach, California, USA, 09–15 Jun 2019b. PMLR. URL http://proceedings.mlr.press/v97/papyan19a.html.
  • Anonymous [2020] Anonymous. Deep ensembles: A loss landscape perspective. In Submitted to International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xZAkrFPr. under review.