A Scale Invariant Flatness Measure for Deep Network Minima

02/06/2019
by   Akshay Rangamani, et al.
0

It has been empirically observed that the flatness of minima obtained from training deep networks seems to correlate with better generalization. However, for deep networks with positively homogeneous activations, most measures of sharpness/flatness are not invariant to rescaling of the network parameters, corresponding to the same function. This means that the measure of flatness/sharpness can be made as small or as large as possible through rescaling, rendering the quantitative measures meaningless. In this paper we show that for deep networks with positively homogenous activations, these rescalings constitute equivalence relations, and that these equivalence relations induce a quotient manifold structure in the parameter space. Using this manifold structure and an appropriate metric, we propose a Hessian-based measure for flatness that is invariant to rescaling. We use this new measure to confirm the proposition that Large-Batch SGD minima are indeed sharper than Small-Batch SGD minima.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2020

Contrastive Weight Regularization for Large Minibatch SGD

The minibatch stochastic gradient descent method (SGD) is widely applied...
research
01/08/2021

BN-invariant sharpness regularizes the training model to better generalization

It is arguably believed that flatter minima can generalize better. Howev...
research
05/28/2019

A Hessian Based Complexity Measure for Deep Networks

Deep (neural) networks have been applied productively in a wide range of...
research
05/20/2023

Evolutionary Algorithms in the Light of SGD: Limit Equivalence, Minima Flatness, and Transfer Learning

Whenever applicable, the Stochastic Gradient Descent (SGD) has shown its...
research
10/12/2020

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer f...
research
07/03/2017

Parle: parallelizing stochastic gradient descent

We propose a new algorithm called Parle for parallel training of deep ne...
research
03/15/2017

Sharp Minima Can Generalize For Deep Nets

Despite their overwhelming capacity to overfit, deep learning architectu...

Please sign up or login with your details

Forgot password? Click here to reset