Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

10/12/2020
by   Pan Zhou, et al.
46

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones <cit.>, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2021

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

The empirical success of deep learning is often attributed to SGD's myst...
research
05/26/2020

Inherent Noise in Gradient Based Methods

Previous work has examined the ability of larger capacity neural network...
research
02/02/2019

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Despite the non-convex nature of their loss functions, deep neural netwo...
research
07/04/2021

AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations

Adaptive optimization methods have been widely used in deep learning. Th...
research
05/05/2021

Understanding Long Range Memory Effects in Deep Neural Networks

Stochastic gradient descent (SGD) is of fundamental importance in deep l...
research
01/20/2022

Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape

In this paper, we study the sharpness of a deep learning (DL) loss lands...
research
02/06/2019

A Scale Invariant Flatness Measure for Deep Network Minima

It has been empirically observed that the flatness of minima obtained fr...

Please sign up or login with your details

Forgot password? Click here to reset