DeepAI AI Chat
Log In Sign Up

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

by   Zeke Xie, et al.

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating training of deep neural networks. But Adam often generalizes significantly worse than Stochastic Gradient Descent (SGD). It is still mathematically unclear how Adaptive Learning Rate and Momentum affect saddle-point escaping and minima selection. Based on the diffusion theoretical framework, we separate the effects of Adaptive Learning Rate and Momentum on saddle-point escaping and minima selection. We find that SGD escapes saddle points very slowly along the directions of small-magnitude eigenvalues of the Hessian. We prove that Adaptive Learning Rate can make learning dynamics near saddle points approximately Hessian-independent, but cannot select flat minima as SGD does. In contrast, Momentum provides a momentum drift effect to help passing through saddle points, and almost does not affect flat minima selection. This mathematically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Motivated by the diffusion theoretical analysis, we design a novel adaptive optimizer named Adaptive Inertia Estimation (Adai), which uses parameter-wise adaptive inertia to accelerate training and provably favors flat minima as much as SGD. Our real-world experiments demonstrate that Adai can converge similarly fast to Adam, but generalize significantly better. Adai even generalizes better than SGD.


Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information t...

On the Hyperparameters in Stochastic Gradient Descent with Momentum

Following the same routine as [SSJ20], we continue to present the theore...

Does Adam optimizer keep close to the optimal point?

The adaptive optimizer for training neural networks has continually evol...

Reverse engineering learned optimizers reveals known and novel mechanisms

Learned optimizers are algorithms that can themselves be trained to solv...

YellowFin and the Art of Momentum Tuning

Hyperparameter tuning is one of the big costs of deep learning. State-of...

Equilibrated adaptive learning rates for non-convex optimization

Parameter-specific adaptive learning rate methods are computationally ef...

Code Repositories


The PyTorch Implementation of Adaptive Inertia Methods. The algorithms are based on the paper: "Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia".

view repo