DeepAI AI Chat
Log In Sign Up

On the distance between two neural networks and the stability of learning

by   Jeremy Bernstein, et al.

How far apart are two neural networks? This is a foundational question in their theory. We derive a simple and tractable bound that relates distance in function space to distance in parameter space for a broad class of nonlinear compositional functions. The bound distills a clear dependence on depth of the composition. The theory is of practical relevance since it establishes a trust region for first-order optimisation. In turn, this suggests an optimiser that we call Frobenius matched gradient descent—or Fromage. Fromage involves a principled form of gradient rescaling and enjoys guarantees on stability of both the spectra and Frobenius norms of the weights. We find that the new algorithm increases the depth at which a multilayer perceptron may be trained as compared to Adam and SGD and is competitive with Adam for training generative adversarial networks. We further verify that Fromage scales up to a language transformer with over 10^8 parameters. Please find code reproducibility instructions at:


page 1

page 2

page 3

page 4


Generalization Error Bounds for Deep Neural Networks Trained by SGD

Generalization error bounds for deep neural networks trained by stochast...

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...

Enhancing Adversarial Training with Second-Order Statistics of Weights

Adversarial training has been shown to be one of the most effective appr...

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective fun...

Reproducibility Challenge NeurIPS 2019 Report on "Competitive Gradient Descent"

This is a report for reproducibility challenge of NeurlIPS 2019 on the p...

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimi...

Code Repositories


🧀 Pytorch code for the Fromage optimiser.

view repo