Flatter, faster: scaling momentum for optimal speedup of SGD

10/28/2022
by   Aditya Cowsik, et al.
0

Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study implicit bias arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter 1-β with the learning rate to the power of 2/3 maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We perform experiments, including matrix sensing and ResNet on CIFAR10, which provide evidence for the robustness of these results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2020

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Momentum plays a crucial role in stochastic gradient-based optimization ...
research
07/20/2021

Edge of chaos as a guiding principle for modern neural network training

The success of deep neural networks in real-world problems has prompted ...
research
08/20/2019

Automatic and Simultaneous Adjustment of Learning Rate and Momentum for Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) methods are prominent for training mac...
research
10/08/2021

Momentum Doesn't Change the Implicit Bias

The momentum acceleration technique is widely adopted in many optimizati...
research
07/25/2019

DEAM: Accumulated Momentum with Discriminative Weight for Stochastic Optimization

Optimization algorithms with momentum, e.g., Nesterov Accelerated Gradie...
research
02/22/2018

Characterizing Implicit Bias in Terms of Optimization Geometry

We study the bias of generic optimization methods, including Mirror Desc...
research
02/04/2020

Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform

Strictly enforcing orthonormality constraints on parameter matrices has ...

Please sign up or login with your details

Forgot password? Click here to reset