Non-Convergence and Limit Cycles in the Adam optimizer

10/05/2022
by   Sebastian Bock, et al.
0

One of the most popular training algorithms for deep neural networks is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba. Despite its success in many applications there is no satisfactory convergence analysis: only local convergence can be shown for batch mode under some restrictions on the hyperparameters, counterexamples exist for incremental mode. Recent results show that for simple quadratic objective functions limit cycles of period 2 exist in batch mode, but only for atypical hyperparameters, and only for the algorithm without bias correction. adaptive gradient methods which try to estimate a fitting learning rate and / or search direction from the training data to improve the learning process compared to pure gradient descent with fixed learningrate. We extend the convergence analysis for Adam in the batch mode with bias correction and show that even for quadratic objective functions as the simplest case of convex functions 2-limit-cycles exist, for all choices of the hyperparameters. We analyze the stability of these limit cycles and relate our analysis to other results where approximate convergence was shown, but under the additional assumption of bounded gradients which does not apply to quadratic functions. The investigation heavily relies on the use of computer algebra due to the complexity of the equations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/19/2021

Local Convergence of Adaptive Gradient Descent Optimizers

Adaptive Moment Estimation (ADAM) is a very popular training algorithm f...
research
08/21/2022

Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One

Practical results have shown that deep learning optimizers using small c...
research
06/22/2021

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

Recent progress on deep learning relies heavily on the quality and effic...
research
02/04/2020

Large Batch Training Does Not Need Warmup

Training deep neural networks using a large batch size has shown promisi...
research
12/18/2020

Convergence dynamics of Generative Adversarial Networks: the dual metric flows

Fitting neural networks often resorts to stochastic (or similar) gradien...
research
10/20/2021

AdamD: Improved bias-correction in Adam

Here I present a small update to the bias-correction term in the Adam op...
research
07/27/2022

FASFA: A Novel Next-Generation Backpropagation Optimizer

This paper introduces the fast adaptive stochastic function accelerator ...

Please sign up or login with your details

Forgot password? Click here to reset