Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

06/27/2022
by   Hideaki Iiduka, et al.
0

Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters (β_1 and β_2) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness condition in order to bridge the gap between theory and practice. The main contribution is to show theoretical evidence that Adam using small learning rates and hyperparameters close to one performs well, whereas the previous theoretical results were all for hyperparameters close to zero. Our analysis also leads to the finding that Adam performs well with large batch sizes. Moreover, we show that Adam performs well when it uses diminishing learning rates and hyperparameters close to one.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2022

Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One

Practical results have shown that deep learning optimizers using small c...
research
03/07/2017

Global optimization of Lipschitz functions

The goal of the paper is to design sequential strategies which lead to e...
research
03/02/2022

A Quantitative Geometric Approach to Neural Network Smoothness

Fast and precise Lipschitz constant estimation of neural networks is an ...
research
02/26/2020

Lipschitz standardization for robust multivariate learning

Current trends in machine learning rely on out-of-the-box gradient-based...
research
02/20/2020

Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

We study distributed optimization algorithms for minimizing the average ...
research
05/28/2019

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

We provide a theoretical explanation for the fast convergence of gradien...
research
05/14/2019

Plug-and-Play Methods Provably Converge with Properly Trained Denoisers

Plug-and-play (PnP) is a non-convex framework that integrates modern den...

Please sign up or login with your details

Forgot password? Click here to reset