DeepAI AI Chat
Log In Sign Up

Explaining Memorization and Generalization: A Large-Scale Study with Coherent Gradients

by   Piotr Zielinski, et al.

Coherent Gradients is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. Inspired by random forests, Coherent Gradients proposes that (Stochastic) Gradient Descent (SGD) finds common patterns amongst examples (if such common patterns exist) since descent directions that are common to many examples add up in the overall gradient, and thus the biggest changes to the network parameters are those that simultaneously help many examples. The original Coherent Gradients paper validated the theory through causal intervention experiments on shallow, fully connected networks on MNIST. In this work, we perform similar intervention experiments on more complex architectures (such as VGG, Inception and ResNet) on more complex datasets (such as CIFAR-10 and ImageNet). Our results are in good agreement with the small scale study in the original paper, thus providing the first validation of coherent gradients in more practically relevant settings. We also confirm in these settings that suppressing incoherent updates by natural modifications to SGD can significantly reduce overfitting–lending credence to the hypothesis that memorization occurs when few examples are responsible for most of the gradient used in the update. Furthermore, we use the coherent gradients theory to explore a new characterization of why some examples are learned earlier than other examples, i.e., "easy" and "hard" examples.


page 1

page 2

page 3

page 4


Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

An open question in the Deep Learning community is why neural networks t...

On the Generalization Mystery in Deep Learning

The generalization mystery in deep learning is the following: Why do ove...

Empirical Study of Easy and Hard Examples in CNN Training

Deep Neural Networks (DNNs) generalize well despite their massive size a...

Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment

We propose a new metric (m-coherence) to experimentally study the alignm...

Random initialisations performing above chance and how to find them

Neural networks trained with stochastic gradient descent (SGD) starting ...

Circuit-Based Intrinsic Methods to Detect Overfitting

The focus of this paper is on intrinsic methods to detect overfitting. T...

Memory Augmented Optimizers for Deep Learning

Popular approaches for minimizing loss in data-driven learning often inv...