Extending AdamW by Leveraging Its Second Moment and Magnitude

12/09/2021
by   Guoqiang Zhang, et al.
0

Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with the purpose to relax the requirement on small learning rate for local stability, which we refer to as Aida. Firstly, we consider tracking the 2nd moment r_t of the pth power of the gradient-magnitudes. r_t reduces to v_t of AdamW when p=2. Suppose m_t is the first moment of AdamW. It is known that the update direction m_t+1/(v_t+1+epsilon)^0.5 (or m_t+1/(v_t+1^0.5+epsilon) of AdamW (or Adam) can be decomposed as the sign vector sign(m_t+1) multiplied elementwise by a vector of magnitudes |m_t+1|/(v_t+1+epsilon)^0.5 (or |m_t+1|/(v_t+1^0.5+epsilon)). Aida is designed to compute the qth power of the magnitude in the form of |m_t+1|^q/(r_t+1+epsilon)^(q/p) (or |m_t+1|^q/((r_t+1)^(q/p)+epsilon)), which reduces to that of AdamW when (p,q)=(2,1). Suppose the origin 0 is a local optimal solution of a twice-differentiable function. It is found theoretically that when q>1 and p>1 in Aida, the origin 0 is locally stable only when the weight-decay is non-zero. Experiments are conducted for solving ten toy optimisation problems and training Transformer and Swin-Transformer for two deep learning (DL) tasks. The empirical study demonstrates that in a number of scenarios (including the two DL tasks), Aida with particular setups of (p,q) not equal to (2,1) outperforms the setup (p,q)=(2,1) of AdamW.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2018

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Adam is shown not being able to converge to the optimal solution in cert...
research
04/07/2020

Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

In deep learning tasks, the learning rate determines the update step siz...
research
07/02/2020

On the Outsized Importance of Learning Rates in Local Update Methods

We study a family of algorithms, which we refer to as local update metho...
research
09/08/2020

Sequential Subspace Search for Functional Bayesian Optimization Incorporating Experimenter Intuition

We propose an algorithm for Bayesian functional optimisation - that is, ...
research
02/09/2022

Optimal learning rate schedules in high-dimensional non-convex optimization problems

Learning rate schedules are ubiquitously used to speed up and improve op...
research
12/17/2014

Optimal Triggering of Networked Control Systems

The problem of resource allocation of nonlinear networked control system...

Please sign up or login with your details

Forgot password? Click here to reset