On the Maximum Hessian Eigenvalue and Generalization

06/21/2022
by   Simran Kaur, et al.
0

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly λ_max, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between λ_max and generalization. In this paper, we present findings that call λ_max's influence on generalization further into question. We show that: (1) while larger learning rates reduce λ_max for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change λ_max without affecting generalization; (3) while SAM produces smaller λ_max for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller λ_max; and (5) while batch-normalization does not consistently produce smaller λ_max, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to λ_max's ability to explain generalization in neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2020

How do SGD hyperparameters in natural training affect adversarial robustness?

Learning rate, batch size and momentum are three important hyperparamete...
research
08/02/2021

Batch Normalization Preconditioning for Neural Network Training

Batch normalization (BN) is a popular and ubiquitous method in deep lear...
research
01/27/2019

Augment your batch: better training with larger batches

Large-batch SGD is important for scaling training of deep neural network...
research
06/07/2023

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

In this paper, we first present an explanation regarding the common occu...
research
02/24/2018

A Walk with SGD

Exploring why stochastic gradient descent (SGD) based optimization metho...
research
03/06/2019

Mean-field Analysis of Batch Normalization

Batch Normalization (BatchNorm) is an extremely useful component of mode...
research
02/14/2022

Black-Box Generalization

We provide the first generalization error analysis for black-box learnin...

Please sign up or login with your details

Forgot password? Click here to reset