A Simple Asymmetric Momentum Make SGD Greatest Again

09/05/2023
by   Gongyue Zhang, et al.
0

We propose the simplest SGD enhanced method ever, Loss-Controlled Asymmetric Momentum(LCAM), aimed directly at the Saddle Point problem. Compared to the traditional SGD with Momentum, there's no increase in computational demand, yet it outperforms all current optimizers. We use the concepts of weight conjugation and traction effect to explain this phenomenon. We designed experiments to rapidly reduce the learning rate at specified epochs to trap parameters more easily at saddle points. We selected WRN28-10 as the test network and chose cifar10 and cifar100 as test datasets, an identical group to the original paper of WRN and Cosine Annealing Scheduling(CAS). We compared the ability to bypass saddle points of Asymmetric Momentum with different priorities. Finally, using WRN28-10 on Cifar100, we achieved a peak average test accuracy of 80.78% around 120 epoch. For comparison, the original WRN paper reported 80.75%, while CAS was at 80.42%, all at 200 epoch. This means that while potentially increasing accuracy, we use nearly half convergence time. Our demonstration code is available at https://github.com/hakumaicc/Asymmetric-Momentum-LCAM

READ FULL TEXT
research
08/09/2021

On the Hyperparameters in Stochastic Gradient Descent with Momentum

Following the same routine as [SSJ20], we continue to present the theore...
research
08/02/2019

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

Although adaptive gradient methods (AGMs) have fast speed in training de...
research
10/11/2019

Decaying momentum helps neural network training

Momentum is a simple and popular technique in deep learning for gradient...
research
02/16/2021

GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

Changes in neural architectures have fostered significant breakthroughs ...
research
03/31/2021

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

It is well-known that stochastic gradient noise (SGN) acts as implicit r...
research
09/05/2023

AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis

This paper proposes an efficient optimizer called AdaPlus which integrat...
research
08/17/2021

Compressing gradients by exploiting temporal correlation in momentum-SGD

An increasing bottleneck in decentralized optimization is communication....

Please sign up or login with your details

Forgot password? Click here to reset