Rotational Optimizers: Simple Robust DNN Training

05/26/2023
by   Atli Kosson, et al.
0

The training dynamics of modern deep neural networks depend on complex interactions between the learning rate, weight decay, initialization, and other hyperparameters. These interactions can give rise to Spherical Motion Dynamics in scale-invariant layers (e.g., normalized layers), which converge to an equilibrium state, where the weight norm and the expected rotational update size are fixed. Our analysis of this equilibrium in AdamW, SGD with momentum, and Lion provides new insights into the effects of different hyperparameters and their interactions on the training process. We propose rotational variants (RVs) of these optimizers that force the expected angular update size to match the equilibrium value throughout training. This simplifies the training dynamics by removing the transient phase corresponding to the convergence to an equilibrium. Our rotational optimizers can match the performance of the original variants, often with minimal or no tuning of the baseline hyperparameters, showing that these transient phases are not needed. Furthermore, we find that the rotational optimizers have a reduced need for learning rate warmup and improve the optimization of poorly normalized networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2020

Spherical Motion Dynamics of Deep Neural Networks with Batch Normalization and Weight Decay

We comprehensively reveal the learning dynamics of deep neural networks ...
research
08/12/2015

The Effects of Hyperparameters on SGD Training of Neural Networks

The performance of neural network classifiers is determined by a number ...
research
03/29/2021

FixNorm: Dissecting Weight Decay for Training Deep Neural Networks

Weight decay is a widely used technique for training Deep Neural Network...
research
10/16/2019

An Exponential Learning Rate Schedule for Deep Learning

Intriguing empirical evidence exists that deep learning can work well wi...
research
02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...
research
08/03/2022

Empirical Study of Overfitting in Deep FNN Prediction Models for Breast Cancer Metastasis

Overfitting is defined as the fact that the current model fits a specifi...
research
07/19/2022

To update or not to update? Neurons at equilibrium in deep models

Recent advances in deep learning optimization showed that, with some a-p...

Please sign up or login with your details

Forgot password? Click here to reset