Explaining Memorization and Generalization: A Large-Scale Study with Coherent Gradients

03/16/2020
by   Piotr Zielinski, et al.
0

Coherent Gradients is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. Inspired by random forests, Coherent Gradients proposes that (Stochastic) Gradient Descent (SGD) finds common patterns amongst examples (if such common patterns exist) since descent directions that are common to many examples add up in the overall gradient, and thus the biggest changes to the network parameters are those that simultaneously help many examples. The original Coherent Gradients paper validated the theory through causal intervention experiments on shallow, fully connected networks on MNIST. In this work, we perform similar intervention experiments on more complex architectures (such as VGG, Inception and ResNet) on more complex datasets (such as CIFAR-10 and ImageNet). Our results are in good agreement with the small scale study in the original paper, thus providing the first validation of coherent gradients in more practically relevant settings. We also confirm in these settings that suppressing incoherent updates by natural modifications to SGD can significantly reduce overfitting–lending credence to the hypothesis that memorization occurs when few examples are responsible for most of the gradient used in the update. Furthermore, we use the coherent gradients theory to explore a new characterization of why some examples are learned earlier than other examples, i.e., "easy" and "hard" examples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2020

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

An open question in the Deep Learning community is why neural networks t...
research
03/18/2022

On the Generalization Mystery in Deep Learning

The generalization mystery in deep learning is the following: Why do ove...
research
11/25/2019

Empirical Study of Easy and Hard Examples in CNN Training

Deep Neural Networks (DNNs) generalize well despite their massive size a...
research
08/03/2020

Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment

We propose a new metric (m-coherence) to experimentally study the alignm...
research
05/21/2020

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different ...
research
07/03/2019

Circuit-Based Intrinsic Methods to Detect Overfitting

The focus of this paper is on intrinsic methods to detect overfitting. T...
research
12/18/2017

On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent

Because stochastic gradient descent (SGD) has shown promise optimizing n...

Please sign up or login with your details

Forgot password? Click here to reset