Flatness is a False Friend

06/16/2020
by   Diego Granziol, et al.
0

Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness. This implies that solutions obtained using L2 regularisation should in principle be sharper than those without, despite generalising better. We show this to be true for logistic regression, multi-layer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-100 datasets. Furthermore, we show that for adaptive optimisation algorithms using iterate averaging, on the VGG-16 network and CIFAR-100 dataset, achieve superior generalisation to SGD but are 30 × sharper. This theoretical finding, along with experimental results, raises serious questions about the validity of Hessian based sharpness measures in the discussion of generalisation. We further show that the Hessian rank can be bounded by the a constant times number of neurons multiplied by the number of classes, which in practice is often a small fraction of the network parameters. This explains the curious observation that many Hessian eigenvalues are either zero or very near zero which has been reported in the literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2019

How noise affects the Hessian spectrum in overparameterized neural networks

Stochastic gradient descent (SGD) forms the core optimization method for...
research
02/22/2018

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

Large batch size training of Neural Networks has been shown to incur acc...
research
11/22/2016

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

We look at the eigenvalues of the Hessian of a loss function before and ...
research
07/24/2019

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

While stochastic gradient descent (SGD) and variants have been surprisin...
research
11/06/2016

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

This paper proposes a new optimization algorithm called Entropy-SGD for ...
research
01/31/2022

On the Power-Law Spectrum in Deep Learning: A Bridge to Protein Science

It is well-known that the Hessian matters to optimization, generalizatio...
research
05/16/2023

The Hessian perspective into the Nature of Convolutional Neural Networks

While Convolutional Neural Networks (CNNs) have long been investigated a...

Please sign up or login with your details

Forgot password? Click here to reset