YellowFin and the Art of Momentum Tuning

06/12/2017
by   Jian Zhang, et al.
0

Hyperparameter tuning is one of the big costs of deep learning. State-of-the-art optimizers, such as Adagrad, RMSProp and Adam, make things easier by adaptively tuning an individual learning rate for each variable. This level of fine adaptation is understood to yield a more powerful method. However, our experiments, as well as recent theory by Wilson et al., show that hand-tuned stochastic gradient descent (SGD) achieves better results, at the same rate or faster. The hypothesis put forth is that adaptive methods converge to different minima (Wilson et al.). Here we point out another factor: none of these methods tune their momentum parameter, known to be very important for deep learning applications (Sutskever et al.). Tuning the momentum parameter becomes even more important in asynchronous-parallel systems: recent theory (Mitliagkas et al.) shows that asynchrony introduces momentum-like dynamics, and that tuning down algorithmic momentum is important for efficient parallelization. We revisit the simple momentum SGD algorithm and show that hand-tuning a single learning rate and momentum value makes it competitive with Adam. We then analyze its robustness in learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for both momentum and learning rate in SGD. YellowFin optionally uses a novel momentum-sensing component along with a negative-feedback loop mechanism to compensate for the added dynamics of asynchrony on the fly. We empirically show YellowFin converges in fewer iterations than Adam on large ResNet and LSTM models, with a speedup up to 2.8x in synchronous and 2.7x in asynchronous settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2023

The Marginal Value of Momentum for Small Learning Rate SGD

Momentum is known to accelerate the convergence of gradient descent in s...
research
05/31/2016

Asynchrony begets Momentum, with an Application to Deep Learning

Asynchronous methods are widely used in deep learning, but have limited ...
research
05/20/2021

Comment on Stochastic Polyak Step-Size: Performance of ALI-G

This is a short note on the performance of the ALI-G algorithm (Berrada ...
research
11/23/2021

Variance Reduction in Deep Learning: More Momentum is All You Need

Variance reduction (VR) techniques have contributed significantly to acc...
research
10/18/2019

Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm

We study the problem of fitting task-specific learning rate schedules fr...
research
09/26/2019

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Background: Recent developments have made it possible to accelerate neur...
research
08/13/2019

On the Convergence of AdaBound and its Connection to SGD

Adaptive gradient methods such as Adam have gained extreme popularity du...

Please sign up or login with your details

Forgot password? Click here to reset