Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

07/24/2019
by   Xinyan Li, et al.
5

While stochastic gradient descent (SGD) and variants have been surprisingly successful for training deep nets, several aspects of the optimization dynamics and generalization are still not well understood. In this paper, we present new empirical observations and theoretical results on both the optimization dynamics and generalization behavior of SGD for deep nets based on the Hessian of the training loss and associated quantities. We consider three specific research questions: (1) what is the relationship between the Hessian of the loss and the second moment of stochastic gradients (SGs)? (2) how can we characterize the stochastic optimization dynamics of SGD with fixed and adaptive step sizes and diagonal pre-conditioning based on the first and second moments of SGs? and (3) how can we characterize a scale-invariant generalization bound of deep nets based on the Hessian of the loss, which by itself is not scale invariant? We shed light on these three questions using theoretical results supported by extensive empirical observations, with experiments on synthetic data, MNIST, and CIFAR-10, with different batch sizes, and with different difficulty levels by synthetically adding random labels.

READ FULL TEXT

page 6

page 12

page 13

page 14

page 15

page 25

page 36

page 38

research
10/01/2019

How noise affects the Hessian spectrum in overparameterized neural networks

Stochastic gradient descent (SGD) forms the core optimization method for...
research
06/09/2021

Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

Understanding generalization in deep learning has been one of the major ...
research
10/11/2022

SGD with large step sizes learns sparse features

We showcase important features of the dynamics of the Stochastic Gradien...
research
08/18/2019

Towards Better Generalization: BP-SVRG in Training Deep Neural Networks

Stochastic variance-reduced gradient (SVRG) is a classical optimization ...
research
06/16/2020

Flatness is a False Friend

Hessian based measures of flatness, such as the trace, Frobenius and spe...
research
06/07/2023

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

In this work, we reveal a strong implicit bias of stochastic gradient de...
research
11/16/2018

The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size

Previous works observed the spectrum of the Hessian of the training loss...

Please sign up or login with your details

Forgot password? Click here to reset