DeepAI AI Chat
Log In Sign Up

On the distance between two neural networks and the stability of learning

02/09/2020
by   Jeremy Bernstein, et al.
79

How far apart are two neural networks? This is a foundational question in their theory. We derive a simple and tractable bound that relates distance in function space to distance in parameter space for a broad class of nonlinear compositional functions. The bound distills a clear dependence on depth of the composition. The theory is of practical relevance since it establishes a trust region for first-order optimisation. In turn, this suggests an optimiser that we call Frobenius matched gradient descent—or Fromage. Fromage involves a principled form of gradient rescaling and enjoys guarantees on stability of both the spectra and Frobenius norms of the weights. We find that the new algorithm increases the depth at which a multilayer perceptron may be trained as compared to Adam and SGD and is competitive with Adam for training generative adversarial networks. We further verify that Fromage scales up to a language transformer with over 10^8 parameters. Please find code reproducibility instructions at: https://github.com/jxbz/fromage.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/07/2022

Generalization Error Bounds for Deep Neural Networks Trained by SGD

Generalization error bounds for deep neural networks trained by stochast...
10/15/2020

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...
03/11/2022

Enhancing Adversarial Training with Second-Order Statistics of Weights

Adversarial training has been shown to be one of the most effective appr...
09/23/2018

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective fun...
01/26/2020

Reproducibility Challenge NeurIPS 2019 Report on "Competitive Gradient Descent"

This is a report for reproducibility challenge of NeurlIPS 2019 on the p...
06/11/2020

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimi...

Code Repositories

fromage

🧀 Pytorch code for the Fromage optimiser.


view repo