Stable Weight Decay Regularization

11/23/2020
by   Zeke Xie, et al.
9

Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use L_2 regularization as the default implementation of weight decay. <cit.> demonstrated that L_2 regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including L_2 regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the L_2 regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over L_2 regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.

READ FULL TEXT

page 5

page 9

research
03/25/2020

Volumization as a Natural Generalization of Weight Decay

We propose a novel regularization method, called volumization, for neura...
research
11/14/2017

Fixing Weight Decay Regularization in Adam

We note that common implementations of adaptive gradient algorithms, suc...
research
02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...
research
12/27/2020

Understanding Decoupled and Early Weight Decay

Weight decay (WD) is a traditional regularization technique in deep lear...
research
10/29/2018

Three Mechanisms of Weight Decay Regularization

Weight decay is one of the standard tricks in the neural network toolbox...
research
07/21/2019

Adaptive Weight Decay for Deep Neural Networks

Regularization in the optimization of deep neural networks is often crit...
research
08/25/2021

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

Adaptive gradient methods such as Adam have gained increasing popularity...

Please sign up or login with your details

Forgot password? Click here to reset