DNN's Sharpest Directions Along the SGD Trajectory

07/13/2018 ∙ by Stanisław Jastrzębski, et al. ∙ 4

Recent work has identified that using a high learning rate or a small batch size for Stochastic Gradient Descent (SGD) based training of deep neural networks encourages finding flatter minima of the training loss towards the end of training. Moreover, measures of the flatness of minima have been shown to correlate with good generalization performance. Extending this previous work, we investigate the loss curvature through the Hessian eigenvalue spectrum in the early phase of training and find an analogous bias: even at the beginning of training, a high learning rate or small batch size influences SGD to visit flatter loss regions. In addition, the evolution of the largest eigenvalues appears to always follow a similar pattern, with a fast increase in the early phase, and a decrease or stabilization thereafter, where the peak value is determined by the learning rate and batch size. Finally, we find that by altering the learning rate just in the direction of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generalization, which suggests the curvature of the endpoint found by SGD is not predictive of its generalization properties.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) are often massively over-parameterized (Zhang et al., 2016), yet show state-of-the-art generalization performance on a wide variety of tasks when trained with Stochastic Gradient Descent (SGD). While understanding the generalization capability of DNNs remains an open challenge, it has been hypothesized that SGD acts as an implicit regularizer, limiting the complexity of the found solution (Poggio et al., 2017; Advani and Saxe, 2017; Wilson et al., 2017; Jastrzębski et al., 2017).

Various links between the curvature of the final minima reached by SGD and generalization have been studied (Murata et al., 1994; Neyshabur et al., 2017). In particular, it is a popular view that models corresponding to wide minima of the loss in the parameter space generalize better than those corresponding to sharp minima (Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Jastrzębski et al., 2017; Wang et al., 2018). The existence of this empirical correlation between the curvature of the final minima and generalization motivates our study.

Our work aims at understanding the interaction between SGD and the sharpest directions of the loss surface, i.e. those corresponding to the largest eigenvalues of the Hessian. In contrast to studies such as those by Keskar et al. (2016) and Jastrzębski et al. (2017) our analysis focuses on the whole training trajectory of SGD rather than just on the endpoint. We will show in Sec. 3.1 that the evolution of the largest eigenvalues of the Hessian follows a consistent pattern for the different networks and datasets that we explore. Initially, SGD is in a region of broad curvature, and as the loss decreases, SGD visits regions in which the top eigenvalues of the Hessian are increasingly large, reaching a peak value with a magnitude influenced by both learning rate and batch size. After that point in training, we typically observe a decrease or stabilization of the largest eigenvalues.

To further understand this phenomenon, we study the dynamics of SGD in relation to the sharpest directions in Sec. 3.2 and Sec. 3.3. Projecting to the sharpest directions111That is considering for different , where is the gradient and is the normalized eigenvector corresponding to the largest eigenvalue of the Hessian., we see that the regions visited in the beginning resemble bowls with curvatures such that an SGD step is typically too large, in the sense that an SGD step cannot get near the minimum of this bowl-like subspace; rather it steps from one side of the bowl to the other, see Fig. 1 for an illustration.

Finally in Sec. 4 we study further practical consequences of our observations and investigate an SGD variant which uses a reduced and fixed learning rate along the sharpest directions. In most cases we find this variant optimizes faster and leads to a sharper region, which generalizes the same or better compared to vanilla SGD with the same (small) learning rate. While we are not proposing a practical optimizer, these results may open a new avenue for constructing effective optimizers tailored to the DNNs’ loss surface in the future.

On the whole this paper exposes and analyses SGD dynamics in the subspace of the sharpest directions. In particular, we argue that the SGD dynamics along the sharpest directions influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the training speed, and the final generalization capability.

2 Experimental setup and notation

Figure 1: Left: Outline of the phenomena discussed in the paper. Curvature along the sharpest direction(s) initially grows (A to C). In most iterations, we find that SGD crosses the minimum if restricted to the subspace of the sharpest direction(s) by taking a too large step (B and C). Finally, curvature stabilizes or decays with a peak value determined by learning rate and batch size (C, see also right). Right two: Representative example of the evolution of the top (decreasing, red to blue) eigenvalues of the Hessian for a SimpleCNN model during training (with , note that is close to ).

We perform experiments mainly on Resnet-32222

In Resnet-32 we omit Batch-Normalization layers due to their interaction with the loss surface curvature 

(Bjorck et al., 2018) and use initialization scaled by the depth of the network (Taki, 2017). Additional results on Batch-Normalization are presented in the Appendix

and a simple convolutional neural network, which we refer to as SimpleCNN (details in the Appendix

D), and the CIFAR-10 dataset (Krizhevsky et al., ). SimpleCNN is a 4 layer CNN, achieving roughly test accuracy on the CIFAR-10 dataset. For training both of the models we use standard data augmentation on CIFAR-10 and for Resnet-32 L2 regularization with coefficient . We additionally investigate the training of VGG-11 (Simonyan and Zisserman, 2014) on the CIFAR-10 dataset (we adapted the final classification layers for

classes) and of a bidirectional Long Short Term Memory (LSTM) model (following the “small” architecture employed by 

Zaremba et al. (2014), with added dropout regularization of ) on the Penn Tree Bank (PTB) dataset. All models are trained using SGD, without using momentum, if not stated otherwise.

The notation and terminology we use in this paper are now described. We will use

(time) to refer to epoch or iteration, depending on the context. By

and we denote the SGD learning rate and batch size, respectively. is the Hessian of the empirical loss at the current -dimensional parameter value evaluated on the training set, and its eigenvalues are denoted as , (ordered by decreasing absolute magnitudes). is the maximum eigenvalue, which is equivalent to the spectral norm of . The top eigenvectors of are denoted by , for , and referred to in short as the sharpest directions. We will refer to the mini-batch gradient calculated based on a batch of size as and to as the SGD step. We will often consider the projection of this gradient onto one of the top eigenvectors, given by , where . Computing the full spectrum of the Hessian for reasonably large models is computationally infeasible. Therefore, we approximate the top (up to ) eigenvalues using the Lanczos algorithm (Lanczos, 1950; Dauphin et al., 2014), an extension of the power method, on approximately

of the training data (using more data was not beneficial). When regularization was applied during training (such as dropout, L2 or data augmentation), we apply the same regularization when estimating the Hessian. This ensures that the Hessian reflects the loss surface accessible to SGD. The code for the project is made available at

https://github.com/kudkudak/dnn_sharpest_directions.

3 A study of the Hessian along the SGD path

In this section, we study the eigenvalues of the Hessian of the training loss along the SGD optimization trajectory, and the SGD dynamics in the subspace corresponding to the largest eigenvalues. We highlight that SGD steers from the beginning towards increasingly sharp regions until some maximum is reached; at this peak the SGD step length is large compared to the curvature along the sharpest directions (see Fig. 1 for an illustration). Moreover, SGD visits flatter regions for a larger learning rate or a smaller batch-size.

3.1 Largest eigenvalues of the Hessian along the SGD path

(a) SimpleCNN
(b) Resnet-32
(c) SimpleCNN (zoom)
(d) Resnet-32 (zoom)
Figure 2: Top: Evolution of the top eigenvalues of the Hessian for SimpleCNN and Resnet-32 trained on the CIFAR-10 dataset with and . Bottom: Zoom on the evolution of the top eigenvalues in the beginning of training. A sharp initial growth of the largest eigenvalues followed by an oscillatory-like evolution is visible. Training and test accuracy of the corresponding models are provided for reference.

We first investigate the training loss curvature in the sharpest directions, along the training trajectory for both the SimpleCNN and Resnet-32 models.

Largest eigenvalues of the Hessian grow initially.

In the first experiment we train SimpleCNN and Resnet-32 using SGD with and and estimate the largest eigenvalues of the Hessian, throughout training. As shown in Fig. 2 (top) the spectral norm (which corresponds to the largest eigenvalue), as well as the other tracked eigenvalues, grows in the first epochs up to a maximum value. After reaching this maximum value, we observe a relatively steady decrease of the largest eigenvalues in the following epochs.

To investigate the evolution of the curvature in the first epochs more closely, we track the eigenvalues at each iteration during the beginning of training, see Fig. 2 (bottom). We observe that initially the magnitudes of the largest eigenvalues grow rapidly. After this initial growth, the eigenvalues alternate between decreasing and increasing; this behaviour is also reflected in the evolution of the accuracy. This suggests SGD is initially driven to regions that are difficult to optimize due to large curvature.

To study this further we look at a full-batch gradient descent training of Resnet-32333To avoid memory limitations we preselected the first examples of CIFAR-10 to simulate full-batch gradient training for this experiment.. This experiment is also motivated by the instability of large-batch size training reported in the literature, as for example by Goyal et al. (2017). In the case of Resnet-32 (without Batch-Normalization) we can clearly see that the magnitude of the largest eigenvalues of the Hessian grows initially, which is followed by a sharp drop in accuracy suggesting instability of the optimization, see Fig. 3. We also observed that the instability is partially solved through use of Batch-Normalization layers, consistent with the findings of Bjorck et al. (2018), see Fig. 9 in Appendix. Finally, we report some additional results on the late phase of training, e.g. the impact of learning rate schedules, in Fig. 12 in the Appendix.

(a) Resnet-32,
(b) Resnet-32,
Figure 3: Full batch-size training of Resnet-32 for and (left) (right) on CIFAR-10. Evolution of the top eigenvalues of the Hessian and accuracy are shown in each case. The training is unstable; an initial growth of curvature scale is followed by a sudden drop in accuracy. The CIFAR-10 dataset was subsampled to points.
(a) Effect of learning rate on curvature along SGD trajectory
(b) Effect of batch size on curvature along SGD trajectory
Figure 4: Evolution of the two largest eigenvalues (solid and dashed line) of the Hessian for Resnet-32, SimpleCNN, and LSTM (trained on the PTB dataset) models on a log-scale for different learning rates (top) and batch-sizes (bottom). Blue/green/red correspond to increasing and decreasing in each figure. Left side of the vertical blue bar in each plot corresponds to the early phase of training. Larger learning rate or a smaller batch-size correlates with a smaller and earlier peak of the spectral norm and the next largest eigenvalue.

Learning rate and batch-size limit the maximum spectral norm.

Next, we investigate how the choice of learning rate and batch size impacts the SGD path in terms of its curvatures. Fig. 4 shows the evolution of the two largest eigenvalues of the Hessian during training of the SimpleCNN and Resnet-32 on CIFAR-10, and an LSTM on PTB, for different values of and . We observe in this figure that a larger learning rate or a smaller batch-size correlates with a smaller and earlier peak of the spectral norm and the subsequent largest eigenvalue. Note that the phase in which curvature grows for low learning rates or large batch sizes can take many epochs. Additionally, momentum has an analogous effect – using a larger momentum leads to a smaller peak of spectral norm, see Fig. 13 in Appendix. Similar observations hold for VGG-11 and Resnet-32 using Batch-Normalization, see Appendix A.1.

Summary.

These results show that the learning rate and batch size not only influence the SGD endpoint maximum curvature, but also impact the whole SGD trajectory. A high learning rate or a small batch size limits the maximum spectral norm along the path found by SGD from the beginning of training. While this behavior was observed in all settings examined (see also the Appendix), future work could focus on a theoretical analysis, helping to establish the generality of these results.

3.2 Sharpest direction and SGD step

The training dynamics (which later we will see affect the speed and generalization capability of learning) are significantly affected by the evolution of the largest eigenvalues discussed in Section 3.1. To demonstrate this we study the relationship between the SGD step and the loss surface shape in the sharpest directions. As we will show, SGD dynamics are largely coupled with the shape of the loss surface in the sharpest direction, in the sense that when projected onto the sharpest direction, the typical step taken by SGD is too large compared to curvature to enable it to reduce loss. We study the same SimpleCNN and Resnet-32 models as in the previous experiment in the first epochs of training with SGD with =0.01 and .

The sharpest direction and the SGD step.

First, we investigate the relation between the SGD step and the sharpest direction by looking at how the loss value changes on average when moving from the current parameters taking a step only along the sharpest direction - see Fig. 6 left. For all training iterations, we compute , for ; the expectation is approximated by an average over different mini-batch gradients. We find that increases relative to for , and decreases for . More specifically, for SimpleCNN we find that and lead to a and increase in loss, while and both lead to a decrease of approximately . For Resnet-32 we observe a and increase for and , and approximately a decrease for and , respectively. The observation that SGD step does not minimize the loss along the sharpest direction suggests that optimization is ineffective along this direction. This is also consistent with the observation that learning rate and batch-size limit the maximum spectral norm of the Hessian (as both impact the SGD step length).

These dynamics are important for the overall training due to a high alignment of SGD step with the sharpest directions. We compute the average cosine between the mini-batch gradient and the top sharpest directions . We find the gradient to be highly aligned with the sharpest direction, that is, depending on and model the maximum average cosine is roughly between and . Full results are presented in Fig. 5.

Figure 5: Average cosine between mini-batch gradient (y axis) and top sharpest directions (averaged over top ) for different

(color) evaluated at different level of accuracies, during training (x axis). For comparison, the horizontal purple line is alignment with a random vector in the parameter space. Curves were smoothed for clarity.

Qualitatively, SGD step crosses the minimum along the sharpest direction.

Next, we qualitatively visualize the loss surface along the sharpest direction in the first few epochs of training, see Fig 6 (right). To better reflect the relation between the sharpest direction and the SGD step we scaled the visualization using the expected norm of the SGD step where the expectation is over mini-batch gradients. Specifically, we evaluate , where is the current parameter vector, and

is an interpolation parameter (we use

). For both SimpleCNN and Resnet-32 models, we observe that the loss on the scale of starts to show a bowl-like structure in the largest eigenvalue direction after six epochs. This further corroborates the previous result that SGD step length is large compared to curvature in the sharpest direction.

Training and validation accuracy are reported in Appendix B. Furthermore, in the Appendix B we demonstrate that a similar phenomena happens along the lower eigenvectors, for different , and in the later phase of training.

Summary.

We infer that SGD steers toward a region in which the SGD step is highly aligned with the sharpest directions and would on average increase the loss along the sharpest directions, if restricted to them. This in particular suggests that optimization is ineffective along the sharpest direction, which we will further study in Sec. 4.

Figure 6: Early on in training, SGD finds a region such that in the sharpest direction, the SGD step length is often too large compared to curvature. Experiments on SimpleCNN and Resnet-32, trained with , , learning curves are provided in the Appendix. Left two: Average change in loss , for corresponding to red, green, and blue, respectively. On average, the SGD step length in the direction of the sharpest direction does not minimize the loss. The red points further show that increasing the step size by a factor of two leads to increasing loss (on average). Right two: Qualitative visualization of the surface along the top eigenvectors for SimpleCNN and Resnet-32 support that SGD step length is large compared to the curvature along the top eigenvector. At iteration we plot the loss , around the current parameters , where is the expected norm of the SGD step along the top eigenvector . The -axis represents the interpolation parameter , the -axis the epoch, and the -axis the loss value, the color indicated spectral norm in the given epoch (increasing from blue to red).

3.3 How SGD steers to sharp regions in the beginning

Here we discuss the dynamics around the initial growth of the spectral norm of the Hessian. We will look at some variants of SGD which change how the sharpest directions get optimized.

Experiment.

We used the same SimpleCNN model initialized with the parameters reached by SGD in the previous experiment at the end of epoch . The parameter updates used by the three SGD variants, which are compared to vanilla SGD (blue), are based on the mini-batch gradient as follows: variant 1 (SGD top, orange) only updates the parameters based on the projection of the gradient on the top eigenvector direction, i.e. ; variant 2 (SGD constant top, green) performs updates along the constant direction of the top eigenvector of the Hessian in the first iteration, i.e.  ; variant 3 (SGD no top, red) removes the gradient information in the direction of the top eigenvector, i.e.  . We show results in the left two plots of Fig. 7. We observe that if we only follow the top eigenvector, we get to wider regions but don’t reach lower loss values, and conversely, if we ignore the top eigenvector we reach lower loss values but sharper regions.

Summary.

The take-home message is that SGD updates in the top eigenvector direction strongly influence the overall path taken in the early phase of training. Based on these results, we will study a related variant of SGD throughout the training in the next section.

Figure 7: Evolution of the loss and the top eigenvalue during training with variations of SGD with parameter updates restricted to certain subspaces. All variants are initialized at the last iteration in Fig. 1. An SGD variant (orange) that follows at each iteration only the projection of the gradient on the top eigenvector of effectively finds a region with lower but increases the loss, while a variant subtracting this projection from the gradient (red), finds a sharper region while achieving a similar loss level as vanilla SGD (blue).

4 Optimizing faster while finding a good sharp region

Figure 8: Nudged-SGD for an increasing number of affected sharpest directions () optimizes significantly faster, whilst generally finding increasingly sharp regions. Experiment is run using Resnet-32 and the Cifar-10 dataset. Left and center: Validation accuracy and the and Frobenius norm (y axis, solid and dashed) using increasing (blue to red) compared against SGD baseline using the same (black), during training (x axis). Rightmost: Test accuracy (red) and Frobenius norm (blue) achieved using NSGD with an increasing (x axis) compared to an SGD baseline using the same (blue and red horizontal lines).

In this final section we study how the convergence speed of SGD and the generalization of the resulting model depend on the dynamics along the sharpest directions.

Our starting point are the results presented in Sec. 3 which show that while the SGD step can be highly aligned with the sharpest directions, on average it fails to minimize the loss if restricted to these directions. This suggests that reducing the alignment of the SGD update direction with the sharpest directions might be beneficial, which we investigate here via a variant of SGD, which we call Nudged-SGD (NSGD). Our aim here is not to build a practical optimizer, but instead to see if our insights from the previous section can be utilized in an optimizer. NSGD is implemented as follows: instead of using the standard SGD update, , NSGD uses a different learning rate, , along just the top eigenvectors, while following the normal SGD gradient along all the others directions444While NSGD can be seen as a second order method, NSGD in contrast to typical second order methods does not adapt the learning rate to be in some sense optimal given the curvature; to make it more precise we included in Appendix E a discussion on differences between NSGD and the Newton method.. In particular we will study NSGD with a low base learning rate , which will allow us to capture any implicit regularization effects NSGD might have. We ran experiments with Resnet-32 and SimpleCNN on CIFAR-10. Note, that these are not state-of-the-art models, which we leave for future investigation.

We investigated NSGD with a different number of sharpest eigenvectors , in the range between and ; and with the rescaling factor . The top eigenvectors are recomputed at the beginning of each epoch555In these experiments each epoch of NSGD takes approximately - longer compared to vanilla SGD. This overhead depends on the number of iterations needed to reach convergence in Lanczos algorithm used for computing the top eigenvectors.. We compare the sharpness of the reached endpoint by both computing the Frobenius norm (approximated by the top eigenvectors), and the spectral norm of the Hessian. The learning rate is decayed by a factor of when validation loss has not improved for epochs. Experiments are averaged over two random seeds. When talking about the generalization we will refer to the test accuracy at the best validation accuracy epoch. Results for Resnet-32 are summarized in Fig. 8 and Tab. 1; for full results on SimpleCNN we relegate the reader to Appendix, Tab. 2. In the following we will highlight the two main conclusions we can draw from the experimental data.

Test acc. Val. acc. (50) Loss Dist.
/
/
/
/
SGD(0.005) /
SGD(0.1) /
Table 1: Nudged-SGD optimizes faster and finds a sharper final endpoint with a slightly better generalization performance. Experiments on Resnet-32 model on the Cifar-10 dataset. Columns: the Frobenius norm of the Hessian at the best validation point and the final epoch; test accuracy; validation at epoch ; cross-entropy loss in the final epoch; distance of the parameters from the initialization to the best validation epoch parameters. Experiments were performed with and . In the last rows we report SGD using .

NSGD optimizes faster, whilst traversing a sharper region.

First, we observe that in the early phase of training NSGD optimizes significantly faster than the baseline, whilst traversing a region which is an order of magnitude sharper. We start by looking at the impact of which controls the amount of eigenvectors with adapted learning rate; we test in with a fixed . On the whole, increasing correlates with a significantly improved training speed and visiting much sharper regions (see Fig. 8). We highlight that NSGD with reaches a maximum of approximately compared to baseline SGD reaching approximately . Further, NSGD retains an advantage of over ( for SimpleCNN) validation accuracy, even after epochs of training (see Tab. 1).

NSGD can improve the final generalization performance, while finding a sharper final region.

Next, we turn our attention to the results on the final generalization and sharpness. We observe from Tab. 1 that using can result in finding a significantly sharper endpoint exhibiting a slightly improved generalization performance compared to baseline SGD using the same . On the contrary, using led to a wider endpoint and a worse generalization, perhaps due to the added instability. Finally, using a larger generally correlates with an improved generalization performance (see Fig. 8, right).

More specifically, baseline SGD using the same learning rate reached test accuracy with the Frobenius norm of the Hessian ( with on SimpleCNN). In comparison, NSGD using found endpoint corresponding to test accuracy and ( and on SimpleCNN). Finally, note that in the case of Resnet-32 leads to test accuracy and which closes the generalization gap to SGD using . We note that runs corresponding to generally converge at final cross-entropy loss around and over training accuracy.

As discussed in Sagun et al. (2017) the structure of the Hessian can be highly dataset dependent, thus the demonstrated behavior of NSGD could be dataset dependent as well. In particular NSGD impact on the final generalization can be dataset dependent. In the Appendix C and Appendix F we include results on the CIFAR-100, Fashion MNIST (Xiao et al., 2017) and IMDB (Maas et al., 2011) datasets, but studies on more diverse datasets are left for future work. In these cases we observed a similar behavior regarding faster optimization and steering towards sharper region, while generalization of the final region was not always improved. Finally, we relegate to the Appendix C additional studies using a high base learning and momentum.

Summary.

We have investigated what happens if SGD uses a reduced learning rate along the sharpest directions. We show that this variant of SGD, i.e. NSGD, steers towards sharper regions in the beginning. Furthermore, NSGD is capable of optimizing faster and finding good generalizing sharp minima, i.e. regions of the loss surface at the convergence which are sharper compared to those found by vanilla SGD using the same low learning rate, while exhibiting better generalization performance. Note that in contrast to Dinh et al. (2017) the sharp regions that we investigate here are the endpoints of an optimization procedure, rather than a result of a mathematical reparametrization.

5 Related work

Tracking the Hessian: The largest eigenvalues of the Hessian of the loss of DNNs were investigated previously but mostly in the late phase of training. Some notable exceptions are: LeCun et al. (1998) who first track the Hessian spectral norm, and the initial growth is reported (though not commented on).  Sagun et al. (2016) report that the spectral norm of the Hessian reduces towards the end of training.  Keskar et al. (2016) observe that a sharpness metric grows initially for large batch-size, but only decays for small batch-size. Our observations concern the eigenvalues and eigenvectors of the Hessian, which follow the consistent pattern, as discussed in Sec. 3.1. Finally,  Yao et al. (2018) study the relation between the Hessian and adversarial robustness at the endpoint of training.

Wider minima generalize better:  Hochreiter and Schmidhuber (1997) argued that wide minima should generalize well. Keskar et al. (2016) provided empirical evidence that the width of the endpoint minima found by SGD relates to generalization and the used batch-size. Jastrzębski et al. (2017) extended this by finding a correlation of the width and the learning rate to batch-size ratio.  Dinh et al. (2017) demonstrated the existence of reparametrizations of networks which keep the loss value and generalization performance constant while increasing sharpness of the associated minimum, implying it is not just the width of a minimum which determines the generalization. Recent work further explored importance of curvature for generalization (Wen et al., 2018; Wang et al., 2018).

Stochastic gradient descent dynamics. Our work is related to studies on SGD dynamics such as Goodfellow et al. (2014); Chaudhari and Soatto (2017); Xing et al. (2018); Zhu et al. (2018). In particular, recently  Zhu et al. (2018) investigated the importance of noise along the top eigenvector for escaping sharp minima by comparing at the final minima SGD with other optimizer variants. In contrast we show that from the beginning of training SGD visits regions in which SGD step is too large compared to curvature. Concurrent with this work Xing et al. (2018) by interpolating the loss between parameter values at consecutive iterations show it is roughly-convex, whereas we show a related phenomena by investigating the loss in the subspace of the sharpest directions of the Hessian.

6 Conclusions

The somewhat puzzling empirical correlation between the endpoint curvature and its generalization properties reached in the training of DNNs motivated our study. Our main contribution is exposing the relation between SGD dynamics and the sharpest directions, and investigating its importance for training. SGD steers from the beginning towards increasingly sharp regions of the loss surface, up to a level dependent on the learning rate and the batch-size. Furthermore, the SGD step is large compared to the curvature along the sharpest directions, and highly aligned with them.

Our experiments suggest that understanding the behavior of optimization along the sharpest directions is a promising avenue for studying generalization properties of neural networks. Additionally, results such as those showing the impact of the SGD step length on the regions visited (as characterized by their curvature) may help design novel optimizers tailor-fit to neural networks.

Acknowledgements

SJ was supported by Grant No. DI 2014/016644 from Ministry of Science and Higher Education, Poland and No. 2017/25/B/ST6/01271 from National Science Center, Poland. Work at Mila was funded by NSERC, CIFAR, and Canada Research Chairs. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work is supported by the Swiss State Secretariat for Education‚ Research and Innovation (SERI) under contract number 16.0159.

References

Appendix A Additional results for Sec. 3.1

a.1 Additional results on the evolution of the largest eigenvalues of the Hessian

First, we show that the instability in the early phase of full-batch training is partially solved through use of Batch-Normalization layers, consistent with the results reported by Bjorck et al. [2018]; results are shown in Fig. 9.

(a) Resnet-32, , with BN
(b) Resnet-32, , with BN
Figure 9: Full-batch training with Batch-Normalization is more stable. Evolution of the top eigenvalues of the Hessian, and accuracy, for Resnet-32 trained on the CIFAR-10 dataset with (left) and (right).

Next, we extend results of Sec. 3.1 to VGG-11 and Batch-Normalized Resnet-32 models, see Fig. 11 and Fig. 10. Importantly, we evaluated the Hessian in the inference mode, which resulted in large absolute magnitudes of the eigenvalues on Resnet-32.

Figure 10: Same as Fig. 4, but for the VGG-11 architecture.
Figure 11: Same as Fig. 4, but for Resnet-32 architecture using Batch-Normalization layers.

a.2 Impact of learning rate schedule

In the paper we focused mostly on SGD using a constant learning rate and batch-size. We report here the evolution of the spectral and Frobenius norm of the Hessian when using a simple learning rate schedule in which we vary the length of the first stage ; we use for epochs and drop it afterwards to . We test in . Results are reported in Fig. 12. The main conclusion is that depending on the learning rate schedule in the next stages of training curvature along the sharpest directions (measured either by the spectral norm, or by the Frobenius norm) can either decay or grow. Training for a shorter time (a lower ) led to a growth of curvature (in term of Frobenius norm and spectral norm) after the learning drop, and a lower final validation accuracy.

Figure 12: Depending on the length of the first stage of a learning rate schedule, spectral norm of the Hessian and generalization performance evolve differently. Left: Spectral norm of the Hessian (solid) and Frobenius norm (dashed), during training (x axis). Center: Accuracy and validation accuracy, during training (x axis). Right: Learning rate, during training (x axis).

a.3 Impact of using momentum

In the paper we focused mostly on experiments using plain SGD, without momentum. In this section we report that large momentum similarly to large leads to a reduction of spectral norm of the Hessian, see Fig. 13 for results on the VGG11 network on CIFAR10 and CIFAR100.

Figure 13: Momentum limits the maximum spectral norm of the Hessian throughout training. Training at and on VGG11 and CIFAR10 (left) and CIFAR100 (right).

Appendix B Additional results for Sec. 3.2

In Fig. 14 we report the corresponding training and validation curves for the experiments depicted in Fig. 6. Next, we plot an analogous visualization as in Fig. 6, but for the 3rd and 5th eigenvector, see Fig. 15 and Fig. 16, respectively. To ensure that the results do not depend on the learning rate, we rerun the Resnet-32 and SimpleCNN experiments with , see Fig. 17.

Finally, we run the same experiment as in Sec. 3.2, but instead of focusing on the early phase we replot Fig. 6 for the first epochs, see Fig. 18

Figure 14: Training and validation accuracy for experiments in Fig 6. Left corresponds to SimpleCNN. Right corresponds to Resnet-32.
Figure 15: Similar to Fig 6 but computed for the third eigenvector of the Hessian. We see that the loss surface is much flatter in this direction. To facilitate a direct comparison, scale of each axis is kept the same as in Fig. 6.
Figure 16: Similar to Fig 6 but for the fourth eigenvector. We see that the loss surface is much flatter in this direction. To facilitate a direct comparison, scale of each axis is kept the same as in Fig. 6.
Figure 17: Similar to Fig 6 but for Resnet-32 (top) and SimpleCNN (bottom) with .
Figure 18: Similar to Fig 6 but for Resnet-32 (top) and SimpleCNN (bottom) run for epochs.

Appendix C Additional results for Sec. 4

Here we report additional results for Sec. 4. Most importantly, in Tab. 2 we report full results for SimpleCNN model. Next, we rerun the same experiment, using the Resnet-32 model, on the CIFAR-100 dataset, see Tab. 5, and on the Fashion-MNIST dataset, see Tab. 6. In the case of CIFAR-100 we observe that conclusion carry-over fully. In the case of Fashion-MNIST we observe that the final generalization for the case of and is similar. Therefore, as discussed in the main text, the behavior seems to be indeed dataset dependent.

In Sec. 4 we purposedly explored NSGD in the context of suboptimally picked learning rate, which allowed us to test if NSGD has any implicit regularization effect. In the next two experiments we explored how NSGD performs when using either a large base learning , or when using momentum. Results are reported in Tab. 3 (learning rate ) and Tab. 4 (momentum ). In both cases we observe that NSGD improves training speed and reaches a significantly sharper region initially, see Fig. 22. However, the final region curvature in both cases, and the final generalization when using momentum, is not significantly affected. Further study of NSGD using a high learning rate or momentum is left for future work.

Figure 19: Evolution of the spectral norm and Frobenius norm of the Hessian (y axis, solid and dashed line, respectively) for different (color) for the experiment using a larger learning (left) and momentum (right), see text for details.
Test acc. Val. acc. (50) Loss Dist.
/
/
/
/
SGD(0.005) /
SGD(0.05) /
Table 2: Same as Tab. 1, but for Simple-CNN model. NSGD with diverged and was excluded from the Table.
Test acc. Val. acc. (50) Loss Dist.
/
/
/
Table 3: Same as Tab. 1, but for Resnet-32 and NSGD using base learning rate . NSGD with diverged and was excluded from the Table.
Test acc. Val. acc. (50) Loss Dist.
/
/
/
/
Table 4: Same as Tab. 1, but for Resnet-32 and NSGD using momentum coefficient .
name Test acc. Val. acc. (50) Loss Dist.
/
/
/
/
Table 5: Same as Tab. 1, but for Simple-CNN and NSGD on the Fashion-MNIST dataset.
name Test acc. Val. acc. (50) Loss Dist.
/
/
/
/
Table 6: Same as Tab. 1, but for Resnet-32 and NSGD on the CIFAR-100 dataset.

Appendix D SimpleCNN Model

The SimpleCNN used in this paper has four convolutional layers. The first two have filters, while the third and fourth have filters. In all convolutional layers, the convolutional kernel window size used is (,

) and ‘same’ padding is used. Each convolutional layer is followed by a ReLU activation function. Max-pooling is used after the second and fourth convolutional layer, with a pool-size of (

,). After the convolutional layers there are two linear layers with output size and respectively. After the first linear layer ReLU activation is used. After the final linear layer a softmax is applied. Please also see the provided code.

Appendix E Comparing NSGD to Newton Method

Nudged-SGD is a second order method, in the sense that it leverages the curvature of the loss surface. In this section we argue that it is significantly different from the Newton method, a representative second order method.

The key reason for that is that, similarly to SGD, NSGD is driven to a region in which curvature is too large compared to its typical step. In other words NSGD does not use an optimal learning rate for the curvature, which is the key design principle for second order methods. This is visible in Fig. 20, where we report results of a similar study as in Sec. 3.2, but for NSGD (, ). The loss surface appears sharper in this plot, because reducing gradients along the top sharpest directions allows optimizing over significantly sharper regions.

As discussed, the key difference stems for the early bias of SGD to reach maximally sharp regions. It is therefore expected that in case of a quadratic loss surface Newton and NSGD optimizers are very similar. In the following we construct such an example. First, recall that update in Newton method is typically computed as:

(1)

where is a scalar. Now, if we assume that is diagonal and put , and finally let , it can be seen that the update of NSGD with and is equivalent to that of Newton method.

Figure 20: Same as Fig. 6, but for NSGD with and .

Appendix F Experiments on sentiment analysis dataset

Most of the experiments in the paper focused on image classification datasets (except for language modeling experiments in Sec. 3.1). The goal of this section is to extend some of the experiments to the text domain. We experiment with the IMDB [Maas et al., 2011] binary sentiment classification dataset and use the simple CNN model from the Keras [Chollet et al., 2015] example repository666Accessible at https://github.com/keras-team/keras/blob/master/examples/imdb_cnn.py.

First, we examine the impact of learning rate and batch-size on the Hessian along the training trajectory as in Sec. 3.1. We test and . As in Sec. 3.1 we observe that the learning rate and the batch-size limit the maximum curvature along the training trajectory. In this experiment the phase in which the curvature grows took many epochs, in contrast to the CIFAR-10 experiments. The results are summarized in Fig. 21.

Next, we tested Nudged SGD with , and . We test . We increased the number of parameters of the base model by increasing by a factor of

number of filters in the first convolutional layer and the number of neurons in the dense layer to encourage overfitting. Experiments were repeated over

seeds.

We observe that in this setting NSGD for optimizes significantly faster and finds a sharper region initially. At the same time using does not result in finding a better generalizing region. The results are summarized in Tab. 7 and Fig. 22.

Figure 21: Same as Fig. 4, but for CNN model on the IMDB dataset.
Figure 22: NSGD experiments on IMDB dataset. Evolution of the validation accuracy (left) and spectral norm of the Hessian (left) for different .
Test acc. Val. acc. (10) Loss Dist.
/ 86.79
/ 84.52
/ 78.32
Table 7: Same as Tab. 1, but for CNN on IMDB dataset.