On the distance between two neural networks and the stability of learning

02/09/2020
by   Jeremy Bernstein, et al.
79

How far apart are two neural networks? This is a foundational question in their theory. We derive a simple and tractable bound that relates distance in function space to distance in parameter space for a broad class of nonlinear compositional functions. The bound distills a clear dependence on depth of the composition. The theory is of practical relevance since it establishes a trust region for first-order optimisation. In turn, this suggests an optimiser that we call Frobenius matched gradient descent—or Fromage. Fromage involves a principled form of gradient rescaling and enjoys guarantees on stability of both the spectra and Frobenius norms of the weights. We find that the new algorithm increases the depth at which a multilayer perceptron may be trained as compared to Adam and SGD and is competitive with Adam for training generative adversarial networks. We further verify that Fromage scales up to a language transformer with over 10^8 parameters. Please find code reproducibility instructions at: https://github.com/jxbz/fromage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2022

Generalization Error Bounds for Deep Neural Networks Trained by SGD

Generalization error bounds for deep neural networks trained by stochast...
research
10/15/2020

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...
research
03/11/2022

Enhancing Adversarial Training with Second-Order Statistics of Weights

Adversarial training has been shown to be one of the most effective appr...
research
09/23/2018

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective fun...
research
07/21/2020

Unsupervised Learning of Solutions to Differential Equations with Generative Adversarial Networks

Solutions to differential equations are of significant scientific and en...
research
06/11/2020

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimi...
research
05/20/2022

PSO-Convolutional Neural Networks with Heterogeneous Learning Rate

Convolutional Neural Networks (ConvNets or CNNs) have been candidly depl...

Please sign up or login with your details

Forgot password? Click here to reset