Explaining the Adaptive Generalisation Gap

11/15/2020
by   Diego Granziol, et al.
0

We conjecture that the reason for the difference in generalisation between adaptive and non adaptive gradient methods stems from the failure of adaptive methods to account for the greater levels of noise associated with flatter directions in their estimates of local curvature. This conjecture motivated by results in random matrix theory has implications for optimisation in both simple convex settings and deep neural networks. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or damping constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation. We show that the numerical stability/damping constant used in these methods can be decomposed into a learning rate reduction and linear shrinkage of the estimated curvature matrix. We then demonstrate significant generalisation improvements by increasing the shrinkage coefficient, closing the generalisation gap entirely in our neural network experiments. Finally, we show that other popular modifications to adaptive methods, such as decoupled weight decay and partial adaptivity can be shown to calibrate parameter updates to make better use of sharper, more reliable directions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2022

Adaptive Gradient Methods at the Edge of Stability

Very little is known about the training dynamics of adaptive gradient me...
research
06/16/2020

Curvature is Key: Sub-Sampled Loss Surfaces and the Implications for Large Batch Training

We study the effect of mini-batching on the loss landscape of deep neura...
research
10/09/2019

On the adequacy of untuned warmup for adaptive optimization

Adaptive optimization algorithms such as Adam (Kingma Ba, 2014) are ...
research
10/14/2019

Emergent properties of the local geometry of neural loss landscapes

The local geometry of high dimensional neural network loss landscapes ca...
research
04/18/2021

On the Φ-Stability and Related Conjectures

Let 𝐗 be a random variable uniformly distributed on the discrete cube { ...
research
10/21/2019

Learning a Generic Adaptive Wavelet Shrinkage Function for Denoising

The rise of machine learning in image processing has created a gap betwe...
research
08/23/2023

Quantifying degeneracy in singular models via the learning coefficient

Deep neural networks (DNN) are singular statistical models which exhibit...

Please sign up or login with your details

Forgot password? Click here to reset