At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

09/26/2019
by   Niv Giladi, et al.
0

Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more scalable. However, asynchronous training has its pitfalls, mainly a degradation in generalization, even after convergence of the algorithm. This gap remains not well understood, as theoretical analysis so far mainly focused on the convergence rate of asynchronous methods. Contributions: We examine asynchronous training from the perspective of dynamical stability. We find that the degree of delay interacts with the learning rate, to change the set of minima accessible by an asynchronous stochastic gradient descent algorithm. We derive closed-form rules on how the learning rate could be changed, while keeping the accessible set the same. Specifically, for high delay values, we find that the learning rate should be kept inversely proportional to the delay. We then extend this analysis to include momentum. We find momentum should be either turned off, or modified to improve training stability. We provide empirical experiments to validate our theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2020

A Sharp Convergence Rate for the Asynchronous Stochastic Gradient Descent

We give a sharp convergence rate for the asynchronous stochastic gradien...
research
10/11/2021

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer w...
research
06/29/2020

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Ra...
research
09/29/2019

Distributed SGD Generalizes Well Under Asynchrony

The performance of fully synchronized distributed systems has faced a bo...
research
06/12/2017

YellowFin and the Art of Momentum Tuning

Hyperparameter tuning is one of the big costs of deep learning. State-of...
research
07/26/2019

Taming Momentum in a Distributed Asynchronous Environment

Although distributed computing can significantly reduce the training tim...
research
03/09/2020

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

While the generalization properties of neural networks are not yet well ...

Please sign up or login with your details

Forgot password? Click here to reset