On Exploiting Layerwise Gradient Statistics for Effective Training of Deep Neural Networks

03/24/2022
by   Guoqiang Zhang, et al.
0

Adam and AdaBelief compute and make use of elementwise adaptive stepsizes in training deep neural networks (DNNs) by tracking the exponential moving average (EMA) of the squared-gradient g_t^2 and the squared prediction error (m_t-g_t)^2, respectively, where m_t is the first momentum at iteration t and can be viewed as a prediction of g_t. In this work, we attempt to find out if layerwise gradient statistics can be expoited in Adam and AdaBelief to allow for more effective training of DNNs. We address the above research question in two steps. Firstly, we slightly modify Adam and AdaBelief by introducing layerwise adaptive stepsizes in their update procedures via either pre or post processing. Empirical study indicates that the slight modification produces comparable performance for training VGG and ResNet models over CIFAR10, suggesting that layer-wise gradient statistics plays an important role towards the success of Adam and AdaBelief for at least certian DNN tasks. In the second step, instead of manual setup of layerwise stepsizes, we propose Aida, a new optimisation method, with the objective that the elementwise stepsizes within each layer have significantly small statistic variances. Motivated by the fact that (m_t-g_t)^2 in AdaBelief is conservative in comparison to g_t^2 in Adam in terms of layerwise statistic averages and variances, Aida is designed by tracking a more conservative function of m_t and g_t than (m_t-g_t)^2 in AdaBelief via layerwise orthogonal vector projections. Experimental results show that Aida produces either competitive or better performance with respect to a number of existing methods including Adam and AdaBelief for a set of challenging DNN tasks.

READ FULL TEXT
research
12/03/2020

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Momentum plays a crucial role in stochastic gradient-based optimization ...
research
02/24/2019

Rapidly Adapting Moment Estimation

Adaptive gradient methods such as Adam have been shown to be very effect...
research
02/02/2023

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

A number of recent adaptive optimizers improve the generalisation perfor...
research
04/10/2021

Use of Metamorphic Relations as Knowledge Carriers to Train Deep Neural Networks

Training multiple-layered deep neural networks (DNNs) is difficult. The ...
research
05/23/2023

Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

We propose a new per-layer adaptive step-size procedure for stochastic f...
research
01/21/2020

Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

As deep neural networks (DNNs) achieve tremendous success across many ap...
research
12/06/2021

AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural Networks

We propose a new algorithm for training deep neural networks (DNNs) with...

Please sign up or login with your details

Forgot password? Click here to reset