On the Generalization Mystery in Deep Learning

03/18/2022
by   Satrajit Chatterjee, et al.
0

The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.

READ FULL TEXT
research
02/25/2020

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

An open question in the Deep Learning community is why neural networks t...
research
08/03/2020

Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment

We propose a new metric (m-coherence) to experimentally study the alignm...
research
03/16/2020

Explaining Memorization and Generalization: A Large-Scale Study with Coherent Gradients

Coherent Gradients is a recently proposed hypothesis to explain why over...
research
10/07/2019

Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

The skip-connections used in residual networks have become a standard ar...
research
08/09/2020

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Deep learning algorithms are well-known to have a propensity for fitting...
research
07/20/2023

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Despite extensive studies, the underlying reason as to why overparameter...
research
01/21/2021

Invariance, encodings, and generalization: learning identity effects with neural networks

Often in language and other areas of cognition, whether two components o...

Please sign up or login with your details

Forgot password? Click here to reset