Taxonomizing local versus global structure in neural network loss landscapes

07/23/2021
by   Yaoqing Yang, et al.
9

Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper. Among other things, local metrics (such as the smoothness of the loss landscape) have been shown to correlate with global properties of the model (such as good generalization). Here, we perform a detailed empirical analysis of the loss landscape structure of thousands of neural network models, systematically varying learning tasks, model architectures, and/or quantity/quality of data. By considering a range of metrics that attempt to capture different aspects of the loss landscape, we demonstrate that the best test accuracy is obtained when: the loss landscape is globally well-connected; ensembles of trained models are more similar to each other; and models converge to locally smooth regions. We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data; and that, if the loss landscape is globally poorly-connected, then training to zero loss can actually lead to worse test accuracy. Based on these results, we develop a simple one-dimensional model with load-like and temperature-like parameters, we introduce the notion of an effective loss landscape depending on these parameters, and we interpret our results in terms of a rugged convexity of the loss landscape. When viewed through this lens, our detailed empirical results shed light on phases of learning (and consequent double descent behavior), fundamental versus incidental determinants of good generalization, the role of load-like and temperature-like parameters in the learning process, different influences on the loss landscape from model and data, and the relationships between local and global metrics, all topics of recent interest.

READ FULL TEXT VIEW PDF

page 9

page 10

page 11

page 25

page 29

page 32

page 33

page 34

04/09/2022

FuNNscope: Visual microscope for interactively exploring the loss landscape of fully connected neural networks

Despite their effective use in various fields, many aspects of neural ne...
06/21/2017

The energy landscape of a simple neural network

We explore the energy landscape of a simple neural network. In particula...
09/21/2022

Deep Double Descent via Smooth Interpolation

Overparameterized deep networks are known to be able to perfectly fit th...
04/26/2018

The loss landscape of overparameterized neural networks

We explore some mathematical features of the loss landscape of overparam...
10/26/2017

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

We describe an approach to understand the peculiar and counterintuitive ...
10/09/2019

Loss Landscape Sightseeing with Multi-Point Optimization

We present multi-point optimization: an optimization technique that allo...
06/01/2021

Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics

To understand better the causes of good generalization performance in st...

1 Introduction

Among the many approaches to understanding the behavior of neural network (NN) models, the study of their loss landscapes [1, 2] has proven to be particularly fruitful. Indeed, analyzing loss landscapes has helped shed light on the workings of many popular techniques, including large-batch training [3, 4], adversarial training [5]

, residual connections

[6], and BatchNorm [7]. One particular concept of recent interest is the so-called sharpness of local minima [3, 8, 9, 5]. While sharpness can be measured by first-order sensitivity measures, such as the Jacobian or Lipschitz constant, it is more appropriately measured by second-order sensitivity measures, typically via the Hessian spectrum [10]. It has been observed that in some cases NNs generalize well when they converge to a relatively flat, i.e., non-sharp, local minimum [3].

While such local sharpness measures can provide insight, their focus on the local geometry of the loss landscape neglects the global structure of the loss landscape (namely, precisely the sort of structure that statistical mechanics approaches to learning aim to quantify [11, 12]). Indeed, it is well-known that existing sharpness-based metrics can be altered (trivially) by reparameterization tricks or (more interestingly) by taking algorithmic steps which have the effect of changing the local structures on the loss landscape [8, 5, 13]. For example, [5]

shows that adversarial training can decrease the magnitude of Hessian eigenvalues and bias the model towards a locally smooth area, even though adversarial training can reduce clean test accuracy

[14]. Similarly, [13] shows that Hessian eigenvalues become smaller with reduced regularization, even though increased regularization is known to reduce overfitting and improve training, if used properly. More general considerations would suggest (and indeed our own empirical results, e.g., as reported in Figure 8, demonstrate) that by training to data with noisy labels, one can find models that generalize poorly and yet simultaneously lie in very “flat” regions of the loss landscape, with small Hessian eigenvalues, and vice versa. These observations (and other observations we describe below) indicate that the previously-observed empirical correlation between very local metrics like sharpness and more global properties like generalization performance may be correlative and not causative, i.e., they may be due to the confounding factor that results in the published literature are on reasonably-good models trained to reasonably-good data, rather than due to some fundamental properties of deep NNs. They also raise the question of how to capture more global properties of the loss landscape.

Globally poorly-connected
Globally well-connected
Locally sharp
Phase I
Phase II
Locally flat
Phase III
Phase IV-A Phase IV-B
Figure 1: (Caricature of different types of loss landscapes). Globally well-connected versus globally poorly-connected loss landscapes; and locally sharp versus locally flat loss landscapes. Globally well-connected loss landscapes can be interpreted in terms of a global “rugged convexity”; and globally well-connected and locally flat loss landscapes can be further divided into two sub-cases, based on the similarity of trained models.

Motivated by these considerations, we are interested in understanding local properties/structure versus global properties/structure of the loss landscape of realistic NN models. While similar ideas underlie work that adopts a statistical mechanics perspective [11, 2, 12, 15], here we are interested in adopting an operational machine learning (ML) perspective, where we employ metrics that have been used within ML as “experimental probes” to gain insight into local versus global properties. To do so, we employ the following metrics.

  • [leftmargin=*]

  • First, we consider Hessian-based metrics, including the largest eigenvalue of the Hessian and the trace of the Hessian. These metrics try to capture local curvature properties of the loss landscape [10].

  • Second, we use mode connectivity [16, 17]—in particular, the connectivity between trained models. This metric tries to capture how well-connected different local minima are to each other in the loss landscape.

  • Third, we use CKA similarity [18] to try to capture a correlation-like similarity between the outputs of different trained models. Averaging the CKA over several pairs of models can be thought of as an approximation to so-called overlap integrals frequently appearing in statistical mechanics [11, 19, 12].

We have considered many other metrics, but these three seem to be particularly useful for identifying global structure versus local structure in loss landscapes. Informally, mode connectivity, as its name suggests, captures connectivity, where well-connected models exhibit a single “rugged basin” with low-energy / low-loss, potentially non-linear, paths through the loss landscape, (i.e., continuous chains of models) all achieving a small loss value. We expect this property to be important since the connectivity of local minima indicates efficiency of the training dynamics to explore the loss landscape, without becoming stuck at saddle points or in a “bad” local minimum. Similarly, CKA similarity captures similarity, where an ensemble of good models will produce roughly similar outputs. These two types of metrics are different and complementary; and both of them are very different than Hessian-based metrics, which clearly capture much more local information.

Here we briefly summarize our main contributions.

  • [leftmargin=*]

  • We design an experimental setup based on two control parameters, a temperature-like parameter that correlates with the magnitude of SGD noise during training, e.g., batch size (in most figures), learning rate, or weight decay, and a load-like parameter that measures the relationship between model size and data quantity and/or quality, e.g., the amount of data, size of intermediate layers, amount of exogenously-introduced label noise, etc. By training thousands of models, under a variety of settings of these parameters, and by measuring local and global metrics of the loss landscape, we identify four distinct phases in temperature-load space, with relatively sharp transitions between them.

  • Using global connectivity (measured by mode connectivity) and local flatness (measured by the Hessian), we taxonomize loss landscapes into four categories, which are pictorially represented in Figure 1, labelled Phase I through Phase IV. For reasons observed in our empirical results in Section 3, it is often convenient to further divide Phase IV into two subcategories, depending on whether the trained models produce similar representations (as measured by CKA similarity). If the loss landscape satisfies the first property, we say it is globally well-connected; and if the loss landscape also satisfies the second property, we say it is globally nice.111Our empirical results show that a loss landscape can generate dissimilar models while being globally well-connected; but we do not observe a loss landscape that generates similar models but is globally poorly-connected. Depending on whether the Hessian eigenvalues are large or small, we say the loss landscape is locally sharp or locally flat.

  • Based on these results, as well as measured model quality, e.g., test accuracy, we empirically demonstrate that the global (but not necessarily local) structure of a loss landscape is well-correlated with good generalization performance, and that the best generalization occurs in the phase associated with a locally flat, globally nice loss landscape. We demonstrate these results on a range of computer vision and natural language processing benchmarks (CIFAR-10, CIFAR-100, SVHN, and IWSLT 2016 De-En) and various models (ResNet, VGG, and Transformers). We also vary the amount of data, the number of noisy labels, etc., to study both the effect of the quantity of data and the quality of data on changing the loss landscape.

  • To understand better our main empirical results, we introduce a toy bifrequency quadratic model; and we use this model to illustrate the main effects of varying temperature-like and load-like parameters on an effective loss landscape. (See Figure 9.) This provides a simple demonstration of the shapes of Phases I–IV seen in Figure 1.

  • We observe the well-known double descent phenomenon [20, 21] in our experiments, which exhibits itself as a “bad fluctuation” between the different phases (e.g., see the transition that separates Phase I and II from Phase III and IV in Figure 3(a)). Our empirical observations on double descent corroborates recent theoretical analysis [22, 23], which views the phenomenon as a consequence of a transition between qualitatively different phases of learning [12].

Computing connectivity and similarity requires comparing multiple distinct models. Compared to Hessian computations, this could be computationally expensive, especially if model training is expensive (this is also more expensive than recent work on NN weight matrix analysis [15, 24, 25]). For many reasonably-sized models, however, the metrics we consider are sufficiently tractable so as to be useful, e.g., during model training (although a full analysis of that is outside the scope of this paper). Moreover, both connectivity and similarity can be computed only from the training data and trained networks, without access to any testing data, thus providing non-trivial predictors of generalization performance. We should also note that the use of connectivity and similarity for studying the global structure of NN loss landscapes has motivations in classical spin-glass theory (which has been widely applied in studying NNs [26, 11, 27, 28], as well as neuroscience [29, 30, 31]); and was inspired by [12], whose results suggest a rugged convexity in the NN loss landscape, as well as the concept of a folding funnel in the (statistical mechanics of) protein folding literature [32, 33, 34, 35]. Also motivating our approach is a large body of work related to energy landscapes [36, 37]. We discuss related work further in Section 5.

2 Setup

In the sequel, we consider training a NN , with trainable parameters , to a dataset consisting of datapoint/label pairs

. Our nominal training objective is to minimize a loss function of the form

(1)

Here is a loss function, typically chosen to be the cross entropy loss. The parameter is the weight decay parameter, which controls the level of regularization. We consider optimizing NN models using standard minibatch SGD, with iterates of the form

(2)

where is the learning rate, is the batch size, and the indices of each minibatch are sampled without replacement from

. For classification tasks, we consider also the training/testing accuracy, which is simply the fraction of correctly classified points,

and similarly for on a given test set .

We now briefly introduce the main metrics and control parameters which we will consider. Due to space constraints, we defer further details to Appendix A.

Temperature and load. In the sequel, a load-like parameter of a loss landscape refers to some quantity related to the amount and/or quality of data, relative to the size of the model. Specifically, we vary either i) model size (e.g., width, which captures the size of an internal representation of the data), for fixed training set size , ii) training set size , for fixed model size, or iii) the “quality” of training data, which is varied by randomizing a fraction of the training labels. Each of these control parameters directly induces a different loss landscape by changing the data and/or architecture for which the loss is being computed. For example, we expect that increasing width will result in a smoother loss landscape [38]; we shall see this effect with CKA similarity in the transition from Phase IV-A to IV-B.

The second control parameter we vary in our experiments is a temperature-like parameter, representing the amount of noise introduced in the SGD iterates (2). Most commonly, we take this to be the batch size , although we will also use the learning rate and the weight decay parameter . Increasing temperature corresponds to smaller batch size, and large learning rate or weight decay. Varying the temperature does not directly define a different loss function , but rather it indirectly induces a different effective loss function. This is because, at different temperatures, the iterates of SGD concentrate on different regions of the loss landscape. Due to the noise in the stochastic optimization algorithm, the training dynamics may not be able to “see” certain features of the loss landscape. Later, in Section 4, we formalize this notion by considering the stationary distributions of a simple Langevin model for SGD with different levels of noise.

CKA similarity. To measure the similarity of two NN representations, we use the centered kernel alignment (CKA) metric, proposed in [18]. For a NN , let denote the concatenation of the outputs222For this work, we focus on the similarity of representations at the output layer, i.e., after the softmax is applied, although the CKA similarity can be used to compare the representations at any layer. of the network over a set of randomly sampled datapoints. Then the (linear) CKA similarity between two parameter configurations is given by

(3)

where for , we define , and is the centering matrix. The CKA similarity is known to be an effective way to compare the overall representations learned by two different trained NNs [18]. Rather than computing the similarity directly on the original training points, we measure CKA on a perturbed training set comprised of Mixup samples [39]; this can reduce trivial similarity that occurs when the models are trained to exactly or approximately zero training error. See also Appendix A.4.1 for the ablation study on different perturbed training sets.

Mode connectivity. For two parameter configurations , computing mode connectivity involves finding a low-energy curve , , for which , such that is minimized [17, 16]. A number of techniques have been proposed to find such curves . In this work, we use the technique proposed in [16], which parameterizes the Bezier curve with bends, given by for where , and are trainable parameters of additional models, defining “bends” on the curve . We use Bezier curves with three bends (). Given the curve , we define the mode connectivity of the models to be

(4)

where maximizes the deviation . There are three possibilities for mode connectivity. If , then , which means there is a “barrier” of high loss between ; in this case, we will say that the loss landscape is poorly-connected or simply say that mode connectivity is poor. If , then this implies a curve of low loss connecting , but it also implies that the training failed to locate a reasonable optimum, i.e., and are large. If , then we will say that the loss landscape is well-connected or simply say that the mode connectivity is good

. Note that for all the experiments except neural machine translation, we use the training error (0-1 loss) when computing mode connectivity, so that mode connectivity is always normalized to the range of [

,

]. We provide additional details on this procedure, as well as an ablation study on different mode connectivity hyperparameters, in Appendix 

A.4.2.

Hessian. The Hessian at a given point in parameter space is represented by the matrix . To summarize the Hessian in a single scalar value, we report the dominant eigenvalue and/or the trace , calculated using the PyHessian software [10].

distance. We will also occasionally report the distance between two parameter configurations as a measure of similarity between models, although we typically find that the CKA similarity is a more informative measure.

3 Empirical results on taxonomizing local versus global structure

In this section, we present our main empirical results. Among other things, our results will highlight the presence of globally nice, globally well-connected/poorly-connected, and locally flat/sharp loss landscapes, and the phase transitions which separate them. In addition to test accuracy, results on six other metrics are presented, including training loss, leading Hessian eigenvalue, trace of Hessian, CKA similarity, mode connectivity, and

distance measured between model weights. For each metric, the results are presented in a 2D diagram, in which the horizontal dimension is the load (with increasing load to the right), and the vertical dimension is the temperature (with increasing temperature to the top).

We will illustrate our main results in a simple setting, and then consider several variants of this setting to illustrate how these results do or do not change when various parameters and design decisions are modified. To start, we will consider ResNets [40] trained on CIFAR-10 [41] as the standard setting to demonstrate different loss landscapes. We will scale the network width to change the size of the network. For ResNet18 which contains four major blocks with channel width , we select different values of to obtain ResNet models with different widths. In the standard setting, batch size, learning rate, and weight decay are kept constant throughout training to study interactions between temperature-like parameters, load-like parameters, and the loss landscape. Below, we will apply learning rate decay and consider other variations of this standard setting, in separate experiments. More details on the experimental setup can be found in Appendix B.

3.1 Types of loss landscapes and phase transitions

(a) Test accuracy
(b) Training loss
(c) Hessian eigenvalue
(d) Hessian trace
(e) Mode connectivity
(f) CKA similarity
(g) distance
Figure 2: (Standard setting). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

In this subsection, we discuss our standard setting, in which we vary model width as the load-like parameter and batch size as the temperature-like parameter. A summary of the results is displayed in Figure 2. Each pixel represents a specific training configuration tuple (width, batch size), averaged over five independent runs.

Observe that there are two phase transitions (identified by different metrics) that separate each plot into four primary regions (corresponding to those shown in Figure 1).

  • [leftmargin=*,noitemsep]

  • Hessian distinguishes locally sharp versus locally flat loss landscapes. The first phase transition is displayed in Figure 1(c) and 1(d), separating Phase I/II from Phase III/IV. A larger Hessian eigenvalue or Hessian trace (darker color) represents a sharper local loss landscape [10, 5]. In Figure 1(b), we find this transition coincides with a significant decrease in the training loss. Indeed, the training loss experiences a more than tenfold reduction when transitioning from the upper side to the lower side on the right of the figure. Comparing Figures 1(a) and 1(c)-1(d), categorizing loss landscapes based solely on the Hessian (or, from other results, other local flatness metrics) is insufficient to predict test accuracy, e.g., the test accuracy in Phase III is lower than Phase IV-A but the Hessian eigenvalues are almost the same.

  • Mode connectivity distinguishes globally well-connected versus globally poorly-connected loss landscapes. The second phase transition is shown in Figure 1(e). The white region represents near-zero mode connectivity which, according to our definition, implies a flat curve in the loss landscape between freshly-trained weights; the blue region represents negative mode connectivity which implies a high barrier between weights; and the red region represents positive mode connectivity which implies a low-loss curve between weights, although the weights are not trained to a reasonable optimum. The loss along individual mode connectivity curves is presented in Appendix A.5. In contrast to training loss, test accuracy only appears to show significant improvements after this transition. In particular, for well-connected loss landscapes, one can improve the test accuracy with suitable choice of temperature. This phase transition forms a roughly vertical line separating Phase I from II, and separates Phase III from IV.

Based on the two transitions, we now classify the loss landscapes into the following phases.

  • [leftmargin=*,noitemsep]

  • Phase I: Globally poorly-connected and locally sharp: Training loss is high; Hessian eigenvalue and trace are large; and mode connectivity is poor.

  • Phase II: Globally well-connected and locally sharp: Training loss is high; Hessian eigenvalue and trace are large; and mode connectivity is poor because the trained weights fail to locate a reasonable minimum.

  • Phase III: Globally poorly-connected and locally flat: Training loss is small; Hessian eigenvalue and trace are small; yet mode connectivity still remains poor.

  • Phase IV: Globally well-connected and locally flat: Training loss is small; Hessian eigenvalue and trace are small; and mode connectivity is good (near-zero).

We remark that in Figure 2 (and subsequent figures below) the load-like and temperature-like parameters are on the X and Y axes, respectively, and we have, to the extent possible, kept other control parameters (in particular, those which are also load-like and temperature-like) fixed, so as to isolate the effect of load-like and temperature-like behavior on trained models. One might wonder (or even criticize our experimental setup, if one were not to realize that we are trying to isolate the effects of load-like and temperature-like parameters) what would be the effect of varying learning rate (which is another temperature-like parameter) during the training process. Thus, we include the setting with decaying learning rate during training in Section 3.2.

Here are two additional observations we can make from Figure 1.

  • [leftmargin=*,noitemsep]

  • CKA further distinguishes two subcategories in Phase IV. From Figure 1(f), CKA can be used to further divide Phase IV into Phase IV-A and Phase IV-B, with the latter exhibiting larger CKA similarity.

  • Simple distance is not enough. A challenge in measuring similarity between models is that the same model can be realized using different weights [8]. To reconcile this effect, the distance between two models is commonly defined in terms of their predictions instead of weights. Indeed, the representation-based CKA similarity is seen to be preferable to the weights-based distance. For example, from Figure 1(g), the distance provides some limited information, but it is not as informative as CKA similarity.

Based on these results, we assert the following central claim of this work: optimal test accuracy is obtained when the loss landscape is globally nice and the trained model converges to a locally flat region; and we can diagnose these different phases in the load-like–temperature-like phase diagram with Hessian, mode connectivity, and CKA metrics. Importantly, both similarity and connectivity metrics are required for a globally nice loss landscape. Phase IV-B is precisely the region with globally nice landscapes, exhibiting the highest test accuracies.

3.2 Corroborating results

In this subsection, we consider initial corroborating results, modifying the setup of Section 3.1 to train with learning rate decay, or to data with exogenously-introduced noisy labels, etc. Still more results can be found in Section 3.3 and in the Appendix.

Training with learning rate decay. Next, we consider a similar experimental setup and the same phase diagram, except with the same learning rate decay schedule applied in the middle of training rather than with a fixed learning rate throughout. We still vary batch size to change temperature. The results are presented in Figure 3. Comparing Figure 3 with Figure 2, we see that the four phases are still present, and the test accuracy is maximized when the loss landscape is globally nice and locally flat. Therefore, our central claim is unaffected by the learning rate decay schedule. In Figures 2(c) and 2(d), smaller temperatures (or larger batch size) in Phase IV-A appear to increase the size of the Hessian. This is a well-known issue with large-batch training [3]. Finally, note that the optimal test accuracy achieved improves in the presence of learning rate decay. See Appendix C for further discussions on learning rate decay.

(a) Test accuracy
(b) Training loss
(c) Hessian eigenvalue
(d) Hessian trace
(e) Mode connectivity
(f) CKA similarity
(g) distance
Figure 3: (Learning rate decay). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, varying batch size to change temperature and varying model width to change load. Learning rate decay is applied during training. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

Training to noisy labels and double descent. Next, we consider a similar experimental setup and the same phase diagram, except that we randomize 10% of the training labels (similar to [42]). The results are presented in Figure 4. Comparing with Figure 2, we see that our main conclusion still holds, i.e., the loss landscape which is both globally nice and locally flat achieves the best test accuracy, shown in Phase IV-B. However, an additional observation can be made: if we compare Figure 3(a) with Figure 1(a), a “dark band” arises between different learning phases. In particular, from Figure 3(a), we see that the test accuracy exhibits both width-wise and temperature-wise double descent [20, 21, 42, 23, 22], for certain parameter choices. In particular, the shape of the dark band matches that of the transitions shown in Figure 3(c) and 3(d).

Double descent and phases of learning. The significance of this “dark band” is the following. A central prediction when viewing different phases of optimization landscapes from a statistical mechanics perspective [11, 12] is that there should be “bad fluctuations” between qualitatively different phases of learning (e.g., see the transition that separates Phase I and II from Phase III and IV in Figure 3(a)). The connection between phases and fluctuations in the popular double descent [20, 21] was made precise theoretically in analyzable settings [22, 23]. Here, we complement [22, 23] by exhibiting the same type of transitions empirically between different phases in our taxonomy, and demonstrating that empirical double descent is a consequence of qualitatively different phases of learning.

Training to zero loss. Next, we use Figure 4 to discuss whether to train to (approximately) zero loss, which is popular in recent work. From Figure 3(b), we observe that Phase III and Phase IV achieve almost exactly zero loss, while Phase I and Phase II do not. Once again, the loss experiences a more than tenfold decay when transitioning from Phase I/II to Phase III/IV. However, if we restrict to globally poorly-connected regions (Phase I and III in Figure 3(a)) and we restrict to a particular width value, i.e., selecting one column slice in the diagram that cuts through Phase III, such as the red block shown in Figure 3(a), we see that the best test accuracy is obtained in Phase I, instead of Phase III. Note that Phase I not only does not achieve zero loss, but it also has locally sharp minima (observed from Figure 3(c) and 3(d)). This means that for globally poorly-connected loss landscapes, it is possible that converging to a locally flat region achieves lower accuracy than a locally sharp region. More interestingly, this locally sharp region does not even converge to close-to-zero training loss.

We should note that this observation is obtained only for a constant learning rate, without learning rate decay, which has been studied in Figure 3. In other words, for Phase III, training is done with a low temperature throughout. This can restrict the ability of SGD to “explore” the loss landscape to find a better minima. Thus, even though Phase III achieves even lower accuracy than non-converged training configurations in Phase I, we attribute it to insufficient exploration. Also, we note that Phase III has both poor mode connectivity and small CKA similarity, as shown in Figure 3(e) and Figure 3(f), respectively. However, one will wrongly predict that Phase III outperforms Phase I if one only looks at local sharpness, e.g., the Hessian plots shown in Figure 3(c)-3(d), because both the Hessian eigenvalue and the Hessian trace are smaller in Phase III than in Phase I.

(a) Test accuracy
(b) Training loss
(c) Hessian eigenvalue
(d) Hessian trace
(e) Mode connectivity
(f) CKA similarity
(g) distance
Figure 4: (Training to noisy labels and double descent). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. 10% of labels are randomized, and double descent is observed between different phases. For an arbitrary column slice that cuts through Phase III (e.g., the red block), optimal accuracy is achieved in Phase I with locally sharp minima. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

3.3 Ablation study

Different temperature parameters. First, we study weight decay as an alternative temperature parameter, in addition to batch size. We change the temperature parameter from batch size used in Figure 2 to weight decay, and we report the results in Figure 5. The results shown in Figure 5 are similar to those seen in Figure 2. One observation is that, once again, the best test accuracy is obtained when the loss landscape is both globally nice and locally flat. Another observation with Figure 4(b) is that when training a wide model with small weight decay (which is shown on the bottom of the figure), the Hessian trace becomes extremely small. This matches observations in [13] that decreasing weight decay reduces the size of the Hessian. Since weight decay is known to improve generalization, this also demonstrates that local metrics alone are insufficient to predict test performance.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 5: (Weight decay as temperature). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using weight decay as the temperature and varying model width to change load. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

Different amount of training data. Next, we vary the amount of training data (as another way of changing load) and see how that affects our results. We vary the number of training samples in CIFAR-10 by a factor of ten. Results are shown in Figure 6. Again, the optimal test accuracy is achieved when the Hessian eigenvalue and trace are small, mode connectivity is near-zero, and CKA similarity is large. Perhaps unsurprisingly, better test accuracy is achieved with more data. Here, CKA provides useful complementary information to the Hessian and mode connectivity for explaining the utility of larger data. The Hessian alone cannot predict the correct trend, as it increases in magnitude with data. Mode connectivity alone cannot predict the correct trend either, becoming increasingly poor with larger data (see the shrinking white region). Indeed, it appears that larger models are required to keep the loss landscape well-connected with increasing data. In contrast, CKA precisely captures the relationship of increasing test accuracy with additional data.

To make these trends more visible, we rearrange the plots in Figure 6 by replacing the X-axis of each figure with the amount of data, while having separate plots for different model width. We also include results on both training with or without noisy labels. See the rearranged results in Figure 7. Now, we can clearly see that CKA is the only metric that precisely captures the relationship of increasing test accuracy with additional data. Interestingly, we observe double descent in the test-accuracy plots, for both training with and without noisy labels, consistent with our previous results.

These observations also imply that the utilities of extra data and larger models are different: larger models can increase connectivity in the loss landscape (e.g., Figure 1(e)); while increasing data boosts signal in the landscape, enabling trained models to become more similar to each other. Clearly, researchers have been increasing both the size of data and the size of models in recent years; our methodology suggests obvious directions for doing this in more principled ways.

Number of
samples
5000
10000
20000
30000
40000
50000

ł/ in Test accuracy /accuracy, Hessian trace /hessian_t, Mode connectivity /curve, CKA similarity /CKA ł

Figure 6: (Varying amount of training data). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. We vary quantities of training data from CIFAR-10 in different columns. All plots are on the same set of axes.
Model
width
4
8
16
8
16
32
Noisy
label
No
No
No
Yes
Yes
Yes
Test
accuracy
Hessian
trace
Mode
connectivity
CKA
similarity
Figure 7: (Amount of training data as load). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying the amount of training data to change load. All plots are on the same set of axes. (Left three columns). Original CIFAR-10 data. Models are ResNet18 with different width. (Right three columns). Randomizing 10% of training labels in CIFAR-10, and still training with ResNet18 of different width.

Different quality of data by changing the amount of noisy labels. Next, we vary the proportion of randomized labels to simulate the change in the quality of data, as another way to change load. To generate randomized labels, a percentage of the training data is randomly selected and altered to an incorrect target class. Results are shown in Figure 8. Once again, local information alone fails to measure the quality of training data. We can see that training with a large amount of noise does not significantly affect the Hessian — see Figure 7(b). In particular, as the temperature decreases (down to the bottom of Figure 7(b)), the Hessian becomes smaller, independent of the quantity of noisy labels. This is especially evident in Figure 7(e), where we plot Hessian trace against batch size. However, looking instead at mode connectivity in Figure 7(c) and CKA in Figure 7(d), one can easily deduce that training with more noisy labels leads to more poorly-connected loss landscapes.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
(e) Hessian trace versus batch size
Figure 8: (Proportion of randomized labels as load). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying proportion of randomized training labels to change load. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes. (e) shows that the Hessian trace changes slowly with the proportion of noisy labels when training loss is small.

Different datasets, architectures, load/temperature parameters, and training schemes. We have performed a wide range of other experiments, only a subset of which we report here. In Appendix D, we cover additional datasets, including SVHN, CIFAR-100, and IWSLT 2016 German to English (De-En) (a machine translation dataset), as well as additional NN architectures, including VGG11 and Transformers. While there are many subtleties in such a detailed analysis (several of which point to future research directions), all experiments support our main conclusions. Here, we briefly summarize these results.

In Appendix D.3, we study an analogous plot to Figure 4, training with 10% noisy labels but replacing the temperature-like parameter from batch size to learning rate. Again, we observe the double descent phenomenon. Using this experiment, we infer that the decision to train to zero loss (traditionally a rule-of-thumb in computer vision tasks, although note that recent work has highlighted how the difference between exactly zero versus approximately zero can matter [25]) should depend on the global connectivity of the loss landscape. Indeed, for small models with poor connectivity, we find that training to zero loss can harm test accuracy. This suggests that the common wisdom to fit training data to zero loss is derived from experiments involving relatively high-quality data and models, and is not a principle of learning more generally.

In Appendix D.4, we show that in the setting of machine translation, the loss landscape remains poorly-connected (i.e., the mode connectivity remains negative) even for a reasonably large embedding dimension up to 512 (see Figure 19). In this case, generalization can be quite poor when training to zero loss. This conclusion matches (with hindsight) the observations in practice, e.g., dropout and early stopping can improve test loss [43, 44]. It also suggests that an embedding size of dimension 512 (for six-layer Transformers with eight attention heads used in our experiments) is still not large enough for baseline machine translation, and that certain (different) training schemes should be designed to improve the optimization on these loss landscapes.

In Appendix D.5, we study learning rate as an alternative temperature parameter, which produces analogous results to Figure 2. In Appendix D.7.1, we study large-batch training and show that it increases local sharpness. Note that for most experiments, we intentionally keep a constant learning rate when varying the batch size to study the change in the landscape with a changing temperature; thus, in Appendix D.7.2, we provide additional results on tuning learning rate with changing batch size, including the commonly used “linear scaling rule” [45].

4 Temperature and the effective loss landscape

Thus far, we have seen that varying temperature-like parameters (e.g., batch size) and load-like parameters (e.g., model width or amount of data) induces loss landscapes with qualitatively or quantitatively different properties. This was illustrated pictorially in Figure 1; and Figure 2 (and others) presented results on local and global similarity and connectivity metrics to support this. However, the loss function is defined independently of these quantities, rendering the precise relationship unclear. In this section, using a simple toy model, we illustrate how varying the temperature and load concentrates the optimizer on different regions of the loss landscape, inducing an effective loss landscape, exhibiting the properties seen in Figure 1.

To begin, consider the typical Langevin diffusion model for SGD dynamics (2):

(5)

where denotes Brownian motion,

represents the variance in the stochastic gradient (decreasing with batch size) and

is the step-size from (2). The unique stationary distribution of this process is a Gibbs distribution of the form (when it exists), where is referred to as the temperature. As expected, the temperature increases proportionally to the learning rate and inversely to the batch size. As , the Gibbs distribution concentrates on the set . On the other hand, as ,

becomes increasingly uniform. Therefore, our choice of temperature should influence the regions of the loss landscape explored. To formalize this concept, recall that the quantile function

(the pseudoinverse of the distribution function) of a random variable

is the unique transformation sending uniform random variables on to . In particular, for any function , . Letting denote the quantile functions of the marginal distributions of , we define the effective loss landscape as the function with . In the case where

is replaced by the joint distribution, we recover the copula associated with

. In general, this formulation will fail to capture details about the correlations in , so we restrict our attention to a one-dimensional model, which will suffice for basic qualitative purposes.

Figure 9: (Bifrequency quadratic model). Illustration of the effective landscapes for the bifrequency quadratic model (6). Here we take and and vary and . Note that as we vary temperature and load, the effective loss landscapes behave in an analogous fashion to the landscapes depicted in Figure 1.

Our one-dimensional model of choice is the following bifrequency quadratic model:

(6)

where . As usual, denotes the associated Gibbs distribution. The two cosine waves here describe noise on both global and local scales, so and . The parameter is used to describe load, quantifying the amount of signal-to-noise in the landscape. Indeed, as the load-like parameter becomes large, the quadratic term dominates; while for small values of , resembles a two-frequency cosine wave. For additional details on the bifrequency quadratic model, see Appendix E. See Figure 9, which illustrates the model for different parameter settings. In particular, we see that as the temperature is decreased, the corresponding effective loss landscape becomes locally smoother, analogous to the transitions from Phase I to Phase III and/or Phase II to Phase IV in Figure 1; and as the load parameter is decreased, the corresponding effective loss landscape becomes globally less well-connected, analogous to the transitions from Phase II to Phase I and/or Phase IV to Phase III in Figure 1.

5 Related work

In this section, we review prior work related to the optimization of loss landscapes, including both local and global approaches, as well as the double descent phenomenon.

Loss landscape. Loss landscapes and the connections to SGD training have been important topics in ML research for years. Many of these ideas have roots in statistical mechanics and chemical physics [37, 36, 2, 35]. More recently, within ML, [27] shows that for large-size NNs, the local minima of the loss function often stay within a band close to the global minima; [46]

observes that the loss interpolating between trained weights before and after each training iteration is convex, thus deducing that SGD moves in valley-like regions on the loss landscape; and

[47] observes that SGD training can be viewed as a way to smooth the loss function, providing a way to rigorously analyze the effect of stochastic noise. More recently, [15, 48, 49, 24, 25]

use ideas from heavy-tailed random matrix theory to measure the energy landscape through the empirical spectral densities of learned weight matrices, without even using the training data. They show that the heavy-tail distributions in these spectral densities can predict generalization. Subsequently,

[50] questions the common Brownian motion-based analyses and shows that a heavy-tailed random variable can better capture the “jump” phenomena in SGD exploration; and [51] shows that the multiplicative noise that commonly arises in SGD can improve the exploration of SGD dynamics on the non-convex loss landscape through hopping between basins. A paper that is closely related to ours is [52]

, which also studies both local and global geometry around trained networks. It uses a method of parent child spawning to study training dynamics, and it shows that the transition from chaotic to stabilized training happens during the first few epochs. For more discussions on the global properties of loss landscapes, see

[53] for a survey of recent theoretical and empirical results. There are also a large number of papers that theoretically analyze the convergence properties of gradient-descent-based methods [54, 55, 56, 57, 58, 59, 60]. Another line of related papers considers the properties of NNs measured in the input space, such as the sensitivity measure characterized by the input-output Jacobian norm [61]

, and the similarity of gradients among different samples, which is called generalized signal-to-noise ratio 

[62].

Sharpness and Hessian spectrum. Sharpness-based analysis is an important building block of current research on loss landscapes. [3] proposes a sharpness-based metric and shows that large-batch training can bias trained NNs to sharp local minima, thus making generalization worse. Several papers [63, 9, 64] use a PAC-Bayesian approach to bound generalization, which can also derive sharpness-based bounds [9]. Although researchers have found counter arguments to the belief that flat minima generalize better [8, 13], it has been shown recently [65] that sharpness-based metrics perform well relative to other complexity metrics that aim to predict generalization; see also [25] for a discussion of the connection between sharpness and weight analysis. One way to measure sharpness is to look at the Hessian spectrum using (randomized) numerical linear algebra approaches. [5] measures the Hessian spectrum and shows that large-batch training leads to larger Hessian eigenvalues. The paper also shows that robust training can allow convergence to flat regions. [66] shows that there is a connection between overparameterization of deep NNs and the “jamming transition” of repulsive ellipses. Moreover, when the training loss function has exactly zero value, the Hessian spectrum of both systems can have a sharp phase transition. Several papers [67, 68, 5, 69, 70]

show that the Hessian spectrum is sparse and contains outliers that are dictated by a class and “cross-class” structure. This is connected to the perhaps surprising observation that the training dynamics seems to happen in a low-dimensional structure in the weight space. Beyond sharp and flat minima,

[71] proposes the concept of “asymmetric valley” which contains asymmetric directions on the loss landscape, along which the loss can change sharply on one side and flatly on the other, and the paper shows that solutions biased towards the flat side generalize better. Sharpness-based and spectrum-based analysis also leads to many practical ways to improve the training and use of NNs, such as in [72, 73, 74, 75].

Connectivity of loss landscape. [76] shows that, unlike the common belief regarding the difficulties of non-convex optimization, the linear path connecting the initialization and the minima found by SGD often shows smooth and monotonic loss changes. [17] and [16] propose the concept of mode connectivity to explicitly construct nonlinear curves connecting trained solutions in the weight space, on which the loss remains small. This finding is essential in that it might suggest the whole concept of “local minima” is flawed because there might (effectively) only exists one large complex connected minima (akin to the rugged convexity we discuss). This is consistent with our results suggesting a global rugged convexity for a well-trained NN loss landscape. Another practical motivation for studying mode connectivity is to find better optima on the curve or through some ensemble technique. On the theory side, [77] proves that the locus of global minima of an overparameterized NN is a “connected submanifold”. Another paper [78]

studies a more general property on the connectivity of “sublevel sets” for deep linear NNs and one-hidden-layer ReLU networks. Further,

[79]

proves that the sublevel sets of deep NNs are connected if one of the hidden layers has more neurons than the number of training samples. In

[80], mode connectivity of multilayer ReLU networks is proved by assuming properties such as dropout stability. The paper also constructs an interesting two-layer network for which overparameterization does not lead to connection between all local minima. Then, [81] shows that in the mean field regime, dropout stability holds for deep and wide NNs. Using the dropout stability, it justifies the empirical observation in [17] that mode connectivity improves with the size of the NN. [82] shows that linear low-loss paths can be found between two networks if they originate from shared trained initialization. [83] extends the one-dimensional curves studied in previous literature to high-dimensional “tunnels” between a set of optima, and it shows that many regularization hyperparameters can increase the “angular width” of the tunnels that connect different local minima.

Double descent. It has recently been observed that the classical U-shape generalization curve can be extended to a double descent curve [20, 21, 42, 84, 85, 86]. This curve reconciles the fact that large models which completely overfit to training data can still generalize [87]. [88, 89, 90] show that even double descent might not be the complete picture, and more complex non-monotonicity in the overparameterized regime may exist. See also [91] for similar results. [92] derives the bias-variance tradeoff and shows that the peak seen in double descent arises from increased variance. That the fluctuational properties around phase transitions would lead to a double descent phenomenon in values of volume/entropy measures (such as generalization) has long been known [11, 12]; and the most relevant results to our paper are those which explicitly show that double descent can indeed arise from transitions between different learning phases [22, 23] (which is a natural prediction from the existence of these learning phases [12]).

6 Conclusions

Motivated by recent work in the statistical mechanics of learning and loss landscape analysis of NN models, we have performed a detailed empirical analysis of the loss landscape of realistic models, with particular attention to how properties vary as load-like and temperature-like control parameters are varied. In particular, local properties (such as those based on Hessian eigenvalues) are relatively easy to measure; and while more global properties of a loss landscape are more challenging to measure, we have found success with a combination of similarity metrics and connectivity metrics. This complements recent work that uses tools from statistical mechanics and heavy-tailed random matrix theory, as we can perform large-scale empirical evaluations using metrics (CKA, mode connectivity, Hessian eigenvalues, etc.) that are more familiar to the ML community. We interpreted these metrics in terms of connectivity and similarity, and we used them to obtain insight into the local versus global properties of NN loss landscapes.

Here, we summarize a few observations from our connectivity and similarity plots (that we expect will be increasingly relevant as larger data sets and models are considered).

  • [leftmargin=*]

  • A larger width improves connectivity: From Figure 1(e), 2(e), 3(e), 4(c), 14(c), 16(c), and 20(c) (see the Appendices), we see that increasing model width improves connectivity.

  • More data improves similarity: From the CKA-similarity row in Figure 6, we see that increasing the quantity of data can improve similarity.

  • Better data quality improves connectivity: From Figure 7(c), we see that increasing the quality of data by reducing the amount of randomized labels can improve connectivity.

  • A larger width and a higher temperature in Phase IV improves similarity: From Figure 1(f), 2(f), 3(f), 4(d), 14(d), 16(d), and 20(d), we see that i) a larger width increases similarity of trained models; and ii) for Phase IV (globally well-connected and locally flat minima), using a relatively large temperature can improve similarity. From our analysis in Section 4, changing temperature does not directly change the loss landscape, but it implicitly changes the stationary distribution of SGD training, which “effectively” changes the loss landscape according to the shapes in Figure 1.

In addition to our empirical evaluations, which cover various data, architectures, and training schemes, we present a simple model to provide a theoretical explanation of the interactions between the stationary behaviour of SGD and temperature, and how these interact with training schemes, the amount of data, and the loss landscape. In future work, we aim to look at phase diagrams outside of the load/temperature form, especially in the low-connectivity regime, which is most challenging according to our taxonomy.

Acknowledgements. We want to thank Charles Martin, Rajiv Khanna, Zhewei Yao, and Amir Gholami for helpful discussions. Michael W. Mahoney would like to acknowledge the UC Berkeley CLTC, ARO, IARPA (contract W911NF20C0035), NSF, and ONR for providing partial support of this work. Kannan Ramchandran would like to acknowledge support from NSF CIF-1703678 and CIF-2002821. Joseph E. Gonzalez would like to acknowledge supports from NSF CISE Expeditions Award CCF-1730628 and gifts from Alibaba Group, Amazon Web Services, Ant Group, CapitalOne, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

References

  • [1] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Conference on Neural Information Processing Systems, pages 6389–6399, 2018.
  • [2] Andrew J Ballard, Ritankar Das, Stefano Martiniani, Dhagash Mehta, Levent Sagun, Jacob D Stevenson, and David J Wales. Energy landscapes for machine learning. Physical Chemistry Chemical Physics, 19(20):12585–12603, 2017.
  • [3] Nitish Shirish Keskar, Jorge Nocedal, Ping Tak Peter Tang, Dheevatsa Mudigere, and Mikhail Smelyanskiy.

    On large-batch training for deep learning: Generalization gap and sharp minima.

    In International Conference on Learning Representations, 2017.
  • [4] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Large batch size training of neural networks with adversarial training and second-order information. Technical Report Preprint: arXiv:1810.01021, 2018.
  • [5] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. In Conference on Neural Information Processing Systems, volume 31, pages 4949–4959, 2018.
  • [6] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation. In Conference on Neural Information Processing Systems, pages 597–607, 2017.
  • [7] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Mądry.

    How does batch normalization help optimization?

    In Neural Information Processing Systems, pages 2488–2498, 2018.
  • [8] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028, 2017.
  • [9] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  • [10] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. PyHessian: Neural networks through the lens of the hessian. In IEEE International Conference on Big Data (Big Data), pages 581–590, 2020.
  • [11] Andreas Engel and Christian P. L. Van den Broeck. Statistical mechanics of learning. Cambridge University Press, New York, NY, USA, 2001.
  • [12] Charles H Martin and Michael W Mahoney. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. Technical Report Preprint: arXiv:1710.09553, 2017.
  • [13] Diego Granziol. Flatness is a false friend. Technical Report Preprint: arXiv:2006.09091, 2020.
  • [14] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.

    Robustness may be at odds with accuracy.

    In International Conference on Learning Representations, 2019.
  • [15] Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 00(00):000–000, 2021.
  • [16] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Conference on Neural Information Processing Systems, pages 8803–8812, 2018.
  • [17] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318, 2018.
  • [18] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529, 2019.
  • [19] Madhu Advani, Subhaneil Lahiri, and Surya Ganguli.

    Statistical mechanics of complex neural systems and high dimensional data.

    Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03014, mar 2013.
  • [20] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • [21] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.

    SIAM Journal on Mathematics of Data Science

    , 2(4):1167–1180, 2020.
  • [22] Zhenyu Liao, Romain Couillet, and Michael W Mahoney. A random matrix analysis of random fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent. In Conference on Neural Information Processing Systems, 2020.
  • [23] Michał Dereziński, Feynman Liang, and Michael W Mahoney. Exact expressions for double descent and implicit regularization via surrogate random design. In Conference on Neural Information Processing Systems, volume 33, 2020.
  • [24] Charles H Martin, Tongsu (Serena) Peng, and Michael W Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nature Communications, 00(00):000–000, 2021.
  • [25] Charles H Martin and Michael W Mahoney. Post-mortem on a deep learning contest: a Simpson’s paradox and the complementary roles of scale metrics versus shape metrics. Technical Report Preprint: arXiv:2106.00734, 2021.
  • [26] Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14):1530, 1985.
  • [27] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
  • [28] Elena Agliari, Adriano Barra, Andrea Galluzzi, Daniele Tantari, and Flavia Tavani. A walk in the statistical mechanical formulation of neural networks. In International Joint Conference on Computational Intelligence, pages 210–217, 2014.
  • [29] Mark C Fuhs and David S Touretzky. A spin glass model of path integration in rat medial entorhinal cortex. The Journal of neuroscience: the official journal of the Society for Neuroscience, 26(16):4266–4276, 2006.
  • [30] Anthony G Hudetz, Colin J Humphries, and Jeffrey R Binder. Spin-glass model predicts metastable brain states that diminish in anesthesia. Frontiers in systems neuroscience, 8:234, 2014.
  • [31] Ibon Recio and Joaquín J Torres. Emergence of low noise frustrated states in E/I balanced neural networks. Neural Networks, 84:91–101, 2016.
  • [32] Joseph D Bryngelson and Peter G Wolynes. Spin glasses and the statistical mechanics of protein folding. Proceedings of the National Academy of sciences, 84(21):7524–7528, 1987.
  • [33] Piotr Garstecki, Trinh Xuan Hoang, and Marek Cieplak. Energy landscapes, supergraphs, and “folding funnels” in spin systems. Physical Review E, 60(3):3219, 1999.
  • [34] Konstantin Klemm, Christoph Flamm, and Peter F Stadler. Funnels in energy landscapes. The European Physical Journal B, 63(3):387–391, 2008.
  • [35] Charles L Brooks, José N Onuchic, and David J Wales. Taking a walk on a landscape. Science, 293(5530):612–613, 2001.
  • [36] D. J. Wales. Energy Landscapes: Applications to Clusters, Biomolecules and Glasses. Cambridge University Press, 2003.
  • [37] F. H. Stillinger. Energy Landscapes, Inherent Structures, and Condensed-Matter Phenomena. Princeton University Press, 2016.
  • [38] Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. Technical Report Preprint: arXiv:1810.08591, 2018.
  • [39] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  • [40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [41] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [42] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2019.
  • [43] Antonio Valerio Miceli-Barone, Barry Haddow, Ulrich Germann, and Rico Sennrich. Regularization techniques for fine-tuning in neural machine translation. In Conference on Empirical Methods in Natural Language Processing, pages 1489–1494, 2017.
  • [44] Yarin Gal and Zoubin Ghahramani.

    A theoretically grounded application of dropout in recurrent neural networks.

    In Conference on Neural Information Processing Systems, volume 29, pages 1019–1027, 2016.
  • [45] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. Technical Report Preprint: arXiv:1706.02677, 2017.
  • [46] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with SGD. Technical Report Preprint: arXiv:1802.08770, 2018.
  • [47] Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698–2707, 2018.
  • [48] Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pages 4284–4293, 2019.
  • [49] Charles H Martin and Michael W Mahoney. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In SIAM International Conference on Data Mining, pages 505–513. SIAM, 2020.
  • [50] Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837, 2019.
  • [51] Liam Hodgkinson and Michael W Mahoney. Multiplicative noise and heavy tails in stochastic optimization. In International Conference on Machine Learning, pages 1019–1028, 2017.
  • [52] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Conference on Neural Information Processing Systems, 2020.
  • [53] Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. The global landscape of neural networks: An overview. IEEE Signal Processing Magazine, 37(5):95–108, 2020.
  • [54] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Conference on Neural Information Processing Systems, pages 2933–2941, 2014.
  • [55] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.

    Escaping from saddle points—online stochastic gradient for tensor decomposition.

    In Conference on learning theory, pages 797–842, 2015.
  • [56] SS Du, C Jin, MI Jordan, B Póczos, A Singh, and JD Lee. Gradient descent can take exponential time to escape saddle points. In Conference on Neural Information Processing Systems, pages 1068–1078, 2017.
  • [57] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732, 2017.
  • [58] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441, 2018.
  • [59] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
  • [60] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685, 2019.
  • [61] Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018.
  • [62] Jinlong Liu, Yunzhi Bai, Guoqing Jiang, Ting Chen, and Huayan Wang. Understanding why neural networks generalize well through GSNR of parameters. In International Conference on Learning Representations, 2019.
  • [63] David A McAllester. PAC-Bayesian model averaging. In

    Annual Conference on Computational Learning Theory

    , pages 164–170, 1999.
  • [64] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
  • [65] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2019.
  • [66] Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E, 100(1):012115, 2019.
  • [67] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the Hessian in deep learning: Singularity and beyond. Technical Report Preprint: arXiv:1611.07476, 2016.
  • [68] Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. Technical Report Preprint: arXiv:1812.04754, 2018.
  • [69] Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research, 21(252):1–64, 2020.
  • [70] Stanislav Fort and Surya Ganguli. Emergent properties of the local geometry of neural loss landscapes. Technical Report Preprint: arXiv:1910.05929, 2019.
  • [71] Haowei He, Gao Huang, and Yang Yuan. Asymmetric valleys: Beyond sharp and flat local minima. In Conference on Neural Information Processing Systems, pages 2553–2564, 2019.
  • [72] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 12(12):124018, 2019.
  • [73] P Izmailov, AG Wilson, D Podoprikhin, D Vetrov, and T Garipov. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, pages 876–885, 2018.
  • [74] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. HAWQ: Hessian aware quantization of neural networks with mixed-precision. In IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.
  • [75] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-BERT: Hessian based ultra low precision quantization of bert. In AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
  • [76] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. Technical Report Preprint: arXiv:1412.6544, 2014.
  • [77] Yaim Cooper. The loss landscape of overparameterized neural networks. Technical Report Preprint: arXiv:1804.10200, 2018.
  • [78] C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. In International Conference on Learning Representations, 2017.
  • [79] Quynh Nguyen. On connected sublevel sets in deep learning. In International Conference on Machine Learning, pages 4790–4799, 2019.
  • [80] Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Rong Ge, and Sanjeev Arora. Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in Neural Information Processing Systems, 32:14601–14610, 2019.
  • [81] Alexander Shevchenko and Marco Mondelli. Landscape connectivity and dropout stability of SGD solutions for over-parameterized neural networks. In International Conference on Machine Learning, pages 8773–8784, 2020.
  • [82] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269, 2020.
  • [83] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32:6709–6717, 2019.
  • [84] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020.
  • [85] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics.
  • [86] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler.

    Benign overfitting in linear regression.

    Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  • [87] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
  • [88] Lin Chen, Yifei Min, Mikhail Belkin, and Amin Karbasi. Multiple descent: Design your own generalization curve. Technical Report Preprint: arXiv:2008.01036, 2020.
  • [89] Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84, 2020.
  • [90] Stéphane d’Ascoli, Levent Sagun, and Giulio Biroli. Triple descent and the two kinds of overfitting: Where & why do they appear? In Conference on Neural Information Processing Systems, 2020.
  • [91] M. Dereziński, R. Khanna, and M. W. Mahoney. Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nystrom method. Technical Report Preprint: arXiv:2002.09073, 2020.
  • [92] Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pages 10767–10777, 2020.
  • [93] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • [94] Mauro Cettolo, Christian Girardi, and Marcello Federico. Wit3: Web inventory of transcribed and translated talks. In Conference of european association for machine translation, pages 261–268, 2012.
  • [95] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

Appendix A Formal definitions on connectivity and similarity

In this section, we more formally define what we mean by globally nice loss landscapes, characterized by connectivity and similarity. Recall that we consider the nominal empirical risk minimization formulation, defined in Eqn. (1), and solved using SGD iterations described in Eqn. (2). We denote the underlying distribution from which we draw training/test samples as .

In many cases, such as for deep NNs, a single input-output mapping can be realized by an infinite set of different weights , where two weights and give the same mapping if for any , . For example, for networks using the ReLU activation, if we multiply the weights of a particular layer by a constant and divide the weights of the next layer by the same constant, we get the same mapping. Thus, we view the different weights and that give the same input-output mapping as being essentially the same. We use the definitions in the following section to ensure that the similarity between and equals 1 (which is the largest similarity in our definition) if they give the same input-output mapping.

a.1 Definition of similarity

In this subsection, we quantify the closeness between weights trained from different initializations. We use to denote a similarity metric between two matrices and , i.e., it equals if and is close to when and are dissimilar.

Definition 1 (Similarity between weights).

Given the similarity metric and the concatenation of the outputs of and on i.i.d. samples from a perturbed distribution , written as

(7)

and

(8)

we define the similarity between two weights and as

(9)

We say that a loss landscape defined by the training data and the space over which is optimized has a similarity level for a particular random training scheme , if for two weights and , trained using with different random initializations,

(10)

where the expectation is taken over the randomness of the training scheme , including the random initializations and the random shuffling of data during training. A similarity level close to 1 means that the trained models concentrate around the same mapping in terms of making predictions on .

Measuring similarity. In this paper, we use the CKA similarity defined in (3) to measure the similarity between two weights. The definition in (3) can be obtained from (9) by setting

(11)

where , and is the centering matrix. We define the perturbed distribution as the distribution of interpolations between training samples. Thus, we can sample from by linearly combining samples from the training data [39]. To measure CKA similarity, we compute 640 interpolated samples from the training set, which are obtained using linear coefficients following the Beta(16,16) distribution. In Appendix A.4.1, we test different configurations on the perturbed training set.

Remark 2.

The main reason that we use perturbed samples instead of the original training data is that measuring similarity on the training data can lead to trivial similarity, especially if the training aims to completely fit the training data. To explore this, we provide ablation study on different perturbed distributions in Appendix A.4.1.

a.2 Definition of connectivity

In this subsection, we quantify the connectivity between trained models.

Definition 3 (A curve between weights).

Suppose and are two weights. Then, a curve between the two weights is a continuous mapping , such that and . We say that the curve is -low-energy if

(12)

In this paper, we use mode connectivity , defined in (4), to measure the connectivity between two weights.

Similar to the definition of similarity level, we say that a loss landscape defined by the training data and the function class has a connectivity level for a particular random training scheme if

(13)

where the expectation is taken over the randomness of the training scheme , including the random initializations, shuffling of data, and the random training schemes applied to find the curve .

Intuitively speaking, if trained models from random initializations on a loss landscape have negative connectivity level, it means the loss landscape is hard for the purpose of exploration. However, a large positive connectivity level may simply suggest that the training loss achieved by the two fixed end points is high. Thus, for our evaluation, a “good” connectivity level is achieved when .

Measuring connectivity. To find a low-energy curve between two weights and , we follow the procedures in [16] and use the Bezier curve given by

(14)

where , and are trainable parameters of additional models, defining “bends” on the curve . The minimization of the loss on the curve is realized through repeated sampling of and minimizing the loss with respect to the weights .

For most experiments, we follow [16] and use three bends (including the two fixed end points) to parameterize the trainable curve. To train the curve, we use SGD with 50 epochs, initial learning rate 0.01, and decay the learning rate to 1% of the initial value linearly from epoch 25 to epoch 45. The learning rate schedule follows [16], and we provide additional ablation results in Appendix A.4.2 to show that the mode connectivity results are not significantly affected by hyperparameter settings. We evaluate five values on the curve when calculating the maximum point in (4).

a.3 Globally nice loss landscapes

In this subsection, using the similarity level and the connectivity level, we can define what we mean by globally nice loss landscapes.

Definition 4.

For a learning problem with the training data , a loss function , a randomized training scheme , and a perturbed distribution from the ground-truth distribution , we say that the problem has a (, )-nice loss landscape if the following conditions hold:

  • The similarity level is equal to .

  • The connectivity level is equal to .

If is close to 1 and is larger than 0, we say that the loss landscape is globally nice. We note that the best connectivity is achieved when the connectivity level is close to 0.

a.4 Ablation study on different metrics

In this subsection, we study different configurations when measuring the CKA similarity and the mode connectivity.

a.4.1 Ablation study on measuring CKA similarity

Here, we provide the details of the perturbed distribution when measuring the CKA similarity. We study two ways of creating the perturbed distribution. The first one samples each perturbed image from two randomly sampled training images, i.e., , where follows the distribution Beta(,) [39]. In Figure 10, we reproduce Figure 1(f) with different values.

The second way of creating the perturbed distribution samples each image by randomly adding uniformly distributed noise to each pixel value of a randomly sampled training image. In Figure

11, we reproduce Figure 1(f) with different magnitude of pixel noise added to the training images.

From the ablation results, we have the following conclusions. First, for perturbed distributions generated with pixel noise, i.e., Figure 11, if the noise perturbation is too small, the CKA similarity becomes less informative especially for Phase III and Phase IV, i.e., the two phases that achieve an almost exactly zero training loss (see Figure 1(b)). This is exactly the reason we use a perturbed distribution instead of the original training data distribution. Second, if we choose linear combinations, i.e, Figure 10, the CKA measurement becomes more informative and is insensitive to the specific choice of . We note that Beta(,) with a large value concentrates at . Thus, we effectively create linearly combined samples close to . In the main paper, we choose .

Figure 10: (Ablation study on CKA). Ablation study on different Beta(,) distributions when plotting the CKA results in Figure 1(f) using linear combinations of samples. From left to right: , , , , .
Figure 11: (Ablation study on CKA). Ablation study on different magnitude of pixel noise when plotting the CKA results in Figure 1(f). From left to right: noise magnitude = 5, 10, 20, 40, 80.

a.4.2 Ablation study on measuring mode connectivity

Figure 12: (Ablation study on mode connectivity). Ablation study on different ways of measuring mode connectivity. From left to right: i) Small learning rate 0.003 with three bends; ii) Medium learning rate 0.01 with three bends; iii) Large learning rate 0.03 with three bends; and iv) Medium learning rate 0.01 with four bends.

Here, we study the best configuration to produce the mode connectivity plot shown in Figure 1(e). We test four configurations. In the first three configurations, we use learning rate = 0.003, 0.01, and 0.03, respectively, and we use three bends (including the two end checkpoints) that form a quadratic Bezier curve. In the fourth configuration, we use learning rate = 0.01, and we use four bends that form a cubic Bezier curve. See Figure 12. Note that we scale the training time to match the learning rates. When training with learning rate 0.01, we train for 50 epochs. When training with learning rate 0.003, we train for 150 epochs. When training with learning rate 0.03, we train for 40 epochs. We use learning rate decay in all these experiments. From Figure 1(e), the mode connectivity results are robust to the choice of specific configurations. We note that training with a medium learning rate 0.01 and using three bends is the standard setting that we use in all the other experiments.

a.5 Visualizing individual low-energy curves in the mode connectivity plots

In this subsection, we present the low-energy curves calculated from mode connectivity, which we have used to draw the mode connectivity plots such as Figure 1(e). The individual curves are shown in Figure 13. Each subfigure in Figure 13 represents the low-energy loss curve found using the curve searching algorithm described in Appendix A.2 for a specific (batch size, width) configuration. The loss curve represents the loss evaluated at the interpolation points between two trained models for the specific (batch size, width) configuration. The two trained models are fixed during the curve searching process. Then, we summarize each loss curve using the mc value defined in (4). When the low-energy curve is convex, e.g., in the top-right corner of the figure, mc is positive. This shape of curve implies that the two end points (which are two same models trained with different random initialization) are not able to find a good local minima with a large temperature. When the low-energy curve has a high “barrier” at the middle point, e.g., in the left three columns of the figure, mc is negative. This shape of curve implies that the loss landscape is poorly-connected, because it is hard to find a low-energy path between two trained models with different initializations. When the entire curve remains at 0, e.g., in the bottom-right corner of the figure, we get zero mc value, which corresponds to the white regions in the bottom-right corner of Figure 1(e). This shape of curve implies that the loss landscape is well-connected.

Figure 13: (Individual mode connectivity curves). Plotting the individual low-energy curves found by mode connectivity, which are used to draw Figure 1(e). Subplots on the left three columns show barriers in the middle, which imply poorly-connected loss landscapes. Subplots on the top-right corner show convex-like curves, which suggest the two models on the two ends are not sufficiently well-trained. Subplots on the bottom-right corner show flat curves with error close to 0, which suggest reasonably well-connected loss landscapes and reasonably well-trained models.

Appendix B Implementation details

In this section, we provide some implementation details for our empirical evaluations.

b.1 Datasets

For most of our experiments, we use CIFAR-10 [41]. We provide additional results on SVHN [93] and CIFAR-100 in Appendix D.1. For natural language processing, we use IWSLT16 German to English (De-En), which is a common machine translation dataset [94]. The results on IWSLT16 De-En are given in Appendix D.4. Following [42], we randomly sample 4K sentence pairs in our training. In the experiments on training with random labels, e.g., Figure 4 and 8, we randomly pick a certain percentage of training samples and change their labels to random target different from the original label.

b.2 Architectures

We use ResNets [40] in the standard setting. We scale the network width to change the size of the network. For ResNet18 which contains four major blocks with channel width , we select different values of to obtain ResNets with different widths. Similarly, we use VGG11 which contains blocks with channel width , and we vary . See Appendix D.2. For Transformers, following [95, 42], we use a six-layer Transformer with eight attention heads, and we vary the embedding dimension to change the model width. The experiments on Transformers are reported in Appendix D.4.

b.3 Training procedures

In the standard setting, we purposely use constant batch size, learning rate, and weight decay throughout the training, to study the interactions between temperature-like parameters, load-like parameters, and the loss landscape. We also provide results when training with learning rate decay in Figure 3. Note that this also means that, when training with different batch sizes, we do not change learning rate accordingly, which is different from the commonly used “linear scaling rule” [45]. We provide additional results on tuning batch size with the linear scaling rule in Appendix D.7.2.

For training on CIFAR-10, following [15], we use SGD and stop the training if the change of training loss is smaller than 0.0001 for 5 consecutive epochs. If this cannot be satisfied, we train for 150 epochs and save the model with the best training loss. For the standard setting, we train with learning rate 0.05, batch size 128, and weight decay 5e-4. For training on IWSLT16, following [42], we use Adam and train for 80K gradient updates, with 4K steps of linear warmup, 10% label smoothing, and no drop-out. We repeat each experiment for five individual runs with random initialization and average the results. We observe that using data augmentation (such as random flipping and cropping) makes training with noisy labels harder to converge. Thus, again to avoid confounding factors, we do not apply data augmentation in this paper. However, it should be noted that data augmentation can improve test accuracy if it is used properly.

b.4 Hyperparameters for different metrics

To compute Hessian information, we use the PyHessian software [10] to measure the Hessian trace and leading eigenvalues. We find that using one batch of 200 random samples can already give stable results, and so we use that in all of our experiments. PyHessian uses the power iteration method to measure the leading eigenvalues, and it uses the Hutchinson’s method to measure the Hessian trace. The maximum number of iterations used in these methods is set to be 100, while a relative tolerance level of 1e-3 is used to early stop the computation.

To compute CKA and mode connectivity, see Appendix A.1 and A.2, respectively.

b.5 Computing infrastructure

All experiments use NVIDIA GPU servers as computing nodes. Each server contains 8 Tesla V100 GPUs. The experiments are implemented in PyTorch. Each test accuracy plot in the main paper requires several days on one server to reproduce. The exact time depends on the granularity of the plot and the data/model configurations.

Appendix C Learning rate decay helps the most when the loss landscape is close to being well-connected

In this section, we continue to study the improvement on test accuracy provided by learning rate decay. If we find the optimal test accuracy for each width value in Figure 2(a) (i.e., finding the optimal test accuracy in each column slice in Figure 2(a)), and we compare it with the optimal test accuracy for each width value in Figure 1(a), we obtain the accuracy improvement purely from using learning rate decay. We plot the improved accuracy shown in Figure 14. Observe that the peak of the improved accuracy happens almost exactly at the transition between globally well-connected and poorly-connected phases. We conjecture the following: i) this means that training with a large temperature at the beginning to explore the loss landscape helps when the loss landscape is poorly-connected; and also ii) this means that the exploration can help the most when the loss landscape is close to being well-connected. When the loss landscape is very well-connected, e.g., when the width is large, the improvement reduces.

Figure 14: (Improvement from learning rate decay). Left. Best test accuracy with or without learning rate decay. Right. Improved accuracy due to learning rate decay. Transition between globally well-connected and poorly-connected regions coincides with the peak of improved accuracy.

Appendix D Additional results

In this section, we provide additional empirical results supporting those presented in the main paper.

d.1 Additional datasets

In this subsection, we provide results on additional datasets.

First, we show the results on the SVHN dataset. See Figure 15, which shows similar plots but to a lower resolution. Again, we have the same conclusion that the best test accuracy is achieved when mode connectivity is close to zero, CKA similarity is large, and Hessian eigenvalue and trace are small. Interestingly, for SVHN the smallest Hessian trace is achieved when the batch size is small. In this case, the test accuracy is not optimal. Thus, one has to use mode connectivity and CKA to find the optimal test accuracy. Next, we show the results on CIFAR-100. See Figure 16. The results are quite similar to Figure 2.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 15: (SVHN). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. Models are trained with ResNet18 on SVHN. All plots are on the same set of axes.
(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 16: (CIFAR-100). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. Models are trained with ResNet18 on CIFAR-100. All plots are on the same set of axes.

d.2 Additional network architectures

In this subsection, we show the results on the VGG networks in addition to ResNets studied in the main paper. See the results in Figure 17.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 17: (VGG11). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. Models are trained with VGG11 on CIFAR-10. All plots are on the same set of axes.

d.3 Additional results on double descent and noisy labels

In this subsection, we provide an additional experiment on training with noisy labels and the double descent phenomenon. In Figure 18, we show an analogous result to Figure 4 but with learning rate as the temperature. The results in Figure 18 are almost identical to Figure 4. In particular, the optimal test accuracy for a column slice on the left part could be achieved in Phase I instead of Phase III.

From the results shown in Figure 18 and Figure 4, an operational way to decide whether we should train to zero loss is by using the mode connectivity. From the mode connectivity plot in Figure 3(e), for a specific width value, we first train with low temperature and see if the mode connectivity mc (defined in Eqn.(4)) is close to zero, i.e., if it falls into the bottom-right white region. If mc is indeed close to zero (when trained with low temperature), we can safely train to zero training loss. However, if mc has a large negative value, it means that the loss landscape is still relatively poorly-connected, and training to zero loss may harm test accuracy. In this case, one should try to explore the loss landscape to find the best minima before reducing the training loss to zero. Note that an incorrect interpretation of this observation is to say that we should always passively avoid reducing the training loss to zero as long as the mode connectivity mc is negative (which means that the loss landscape is poorly-connected). This is because one can either increase the model width to improve mode connectivity or design exploration schemes before training to zero loss.

(a) Test accuracy
(b) Training loss
(c) Hessian eigenvalue
(d) Hessian trace
(e) Mode connectivity
(f) CKA similarity
(g) distance
Figure 18: (Training to noisy labels, with learning rate as temperature). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using learning rate as the temperature and varying model width to change load. 10% of labels are randomized, and double descent is observed between phases. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

A side note is that, apart from distinguishing the two subcategories in Phase IV, CKA can also help categorize the other phases. In particular, for Figure 2, CKA(Phase I and II) CKA(Phase III) CKA(Phase IV). However, as we will see in Figure 4, the CKA in Phase III (globally poorly-connected and locally flat) can become worse than Phase I and II, suggesting that both global structural deficiency and local structural deficiency can dictate the large dissimilarity between models. (Clearly, this suggests the need for improved global metrics to compare models.) What remains a consistent trend, however, is that Phase IV-B always has the largest CKA similarity.

d.4 Results on machine translation

In this subsection, we show the results on the neural machine translation task on IWSLT 2016 De-En. In this experiment, we still define load as the width of the Transformer model, which is the dimension of embedding vectors. See the results in Figure

19. Similar to Figure 4, the results exhibit both width-wise and temperature-wise double descent. The first type of double descent matches previous work [42].

More importantly, the mode connectivity shown in Figure 18(c) remains poor even if we increase the width of the model to a large value. We also note that we only use 4K samples, which, according to the results on subsampled CIFAR-10 shown in Figure 6, means that mode connectivity should be easier to become closer to zero than if we used the whole dataset. Thus, the mode connectivity result here suggests that the loss landscape in this machine translation task is significantly worse than that of image classification on CIFAR-10 (shown in Figure 2 and Figure 4). This suggests that, even for the large-width Transformers, we still have not transitioned to globally well-connected loss landscapes—in particular, this means that the entire Figure 18(a) only covers the “top-left corner” of the 2D phase diagram shown in Figure 2, i.e., Phase I. From the discussion in Appendix D.3, one should be careful about training to zero loss in globally poorly-connected loss landscapes. Indeed, we show in the next paragraph that early stopping can significantly reduce the test cross entropy loss. It is worth noting that the training on the top-right corner of each subplot in Figure 19 is difficult to converge, due to large temperature and large size of the Transformer, so we explicitly mark that region with “NC” (meaning “not converged”).

(a) Test CE loss
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 19: (Machine translation). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width (token embedding dimension) to change load. Models are trained with Transformers on IWSLT 2016 De-En with 4K subsamples. All plots are on the same set of axes. Mode connectivity shows that the loss landscape is poorly-connected even for a large embedding dimension. We find the training on the upper-right corner of each subplot hard to converge.

Optimal early stopping helps when the global connectivity is low Now, we provide more results to the machine translation task. In particular, we report the results on training with optimal early stopping and training with an inverse square-root learning rate. Training with optimal early stopping means that we choose the best test accuracy during the entire training process [42], which is used to show the theoretically optimal accuracy improvement from using early stopping. See Figure 20.

(a) Training with a constant learning rate and 80K gradient updates
(b) Training with an inverse square-root learning rate and 80K gradient updates
(c) Training with a constant learning rate and optimal early stopping
(d) Training with an inverse square-root learning rate and optimal early stopping
Figure 20: (Early stopping). (First row). Training with 80K gradient updates. (Second row). Training with optimal early stopping. (First column). Training with a constant learning rate. (Second column). Training with an inverse square-root learning rate. The main conclusion is that training with optimal early stopping significantly improves test accuracy in this case.

Comparing the left and the right column in Figure 20, we see that the inverse square-root learning rate does not significantly change the results.

However, comparing the first and the second row in Figure 20, we see that optimal early stopping significantly improves the test accuracy. We note that this is expected and has been observed in [42] that optimal early stopping can significantly mitigate double descent. Since the global connectivity in this task is low even for large width (shown in Figure 18(c)), the observation here further supports our conclusion in Appendix D.3, which is that one should not completely fit the training data when the global connectivity is low.

d.5 Additional temperature parameters

In this subsection, we reproduce the results in Figure 2, but we change the temperature parameter from batch size to learning rate. See the results shown in Figure 21. We see that the results are very similar to Figure 2.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 21: (Learning rate as temperature). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using learning rate as the temperature and varying model width to change load. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

d.6 Additional ways to change load

In this subsection, we present results analogous to Figure 8, but we change the load parameter by changing the amount of additive noise to each image, instead of the amount of randomized labels. More specifically, to change the amount of noise added to each image sample, we put a random additive noise with uniform distribution in to each pixel in the image, and we vary . See the results shown in Figure 22.

When we compare Figure 22 to Figure 8, we find that mode connectivity in Figure 21(c) exhibits well-connected loss landscape despite changing amount of noise. We conjecture that this is because the amount of noise added to each pixel does not degrade the quality of data as substantially, and that this type of noise certainly is not as strong as changing the label. However, CKA still captures the degradation of data quality – if we move from the right to the left (meaning adding more noise) at the bottom of Figure 21(d), we see gradually darker color.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 22: (Varying additive noise on images to change load). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying amount of additive noise to each image to change load. Models are trained with ResNet18 on CIFAR-10. Mode connectivity remains close to zero even for a large amount of additive noise on each image. All plots are on the same set of axes.

d.7 Additional training schemes

In this subsection, we provide additional results on additional training schemes.

d.7.1 Large-batch training

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 23: (Large-batch training). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. Large-batch training is used. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

Here, we present results analogous to our main results in Figure 2, but with a large batch size. We increase the maximum batch size to 8192 in this subsection. See results in Figure 23.

The results show a similar trend to Figure 2, in that the best accuracy is achieved when Hessian is small, mode connectivity is close to zero, and CKA similarity is large. The only slight difference is that the Hessian becomes much larger than in Figure 2 when trained with a large batch size. We note that this observation matches the common belief that local minima becomes sharper in large-batch training [3].

d.7.2 Training with the linear scaling rule

Here, we present analogous results to Figure 2, but train with the linear scaling rule, i.e., when the batch size is multiplied by , the learning rate is multiplied by the same constant [45]. We choose the “standard setting” to be training with learning rate 0.05 and batch size 128, which matches the settings in other experiments, and we change learning rate to 0.05k when we change the batch size to 128k. The results are shown in Figure 24.

First, if we compare the test accuracy in Figure 23(a) to that without using the linear scaling rule, shown in Figure 1(a), we see that the test accuracy changes much more slowly with batch size (along the Y-axis) in the first case. Moreover, the mode connectivity and Hessian almost do not change at all along the -axis. This phenomenon is expected because the linear scaling rule aims to maintain a constant noise variance when scaling up the batch size.

(a) Test accuracy
(b) Hessian trace
(c) Mode connectivity
(d) CKA similarity
Figure 24: (Tuning learning rate with batch size). Partitioning the 2D load-like—temperature-like diagram into different phases of learning, using batch size as the temperature and varying model width to change load. Linear scaling rule is used to tune learning rate with batch size. Models are trained with ResNet18 on CIFAR-10. All plots are on the same set of axes.

Appendix E More on the bifrequency quadratic model

Recall the noisy quadratic model used to introduce the effective loss landscape,

(15)

and the associated Gibbs distribution at temperature , . Given our study of Hessians in the main text, it is natural to ask whether similar properties can be observed in this simple model as well. In particular, a quantity one might consider is

(16)

Directly computing these integrals numerically is difficult, in particular because computing the normalizing constant leads to numerical instability. Instead, we use a Laplace approximation around each of the minima of . Specifically, we consider the approximate distribution

(17)

where , , and

is the PDF of the Gaussian distribution with mean

and variance . Using this approximate distribution, the expected Hessian can be computed in closed form:

(18)

In Figure 25, we plot the expected Hessians for the bifrequency quadratic model for varying values of load , and temperature . We observe that varying these two parameters can produce a large variety of local/Hessian behavior, analogous to the behavior observed in our empirical results throughout this paper. We expect that quantitatively reproducing the Hessian properties of real NNs will require going beyond a simple 1D model.

Figure 25: Expected Hessians of the bifrequency quadratic model.