Understanding Decoupled and Early Weight Decay

12/27/2020
by   Johan Bjorck, et al.
20

Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an l_2 penalty to the loss. This technique has become increasingly popular and is referred to as decoupled WD. The goal of this paper is to investigate these two recent empirical observations. We demonstrate that by applying WD only at the start, the network norm stays small throughout training. This has a regularizing effect as the effective gradient updates become larger. However, traditional generalizations metrics fail to capture this effect of WD, and we show how a simple scale-invariant metric can. We also show how the growth of network weights is heavily influenced by the dataset and its generalization properties. For decoupled WD, we perform experiments in NLP and RL where adaptive optimizers are the norm. We demonstrate that the primary issue that decoupled WD alleviates is the mixing of gradients from the objective function and the l_2 penalty in the buffers of Adam (which stores the estimates of the first-order moment). Adaptivity itself is not problematic and decoupled WD ensures that the gradients from the l_2 term cannot "drown out" the true objective, facilitating easier hyperparameter tuning.

READ FULL TEXT

page 2

page 3

page 13

page 14

page 15

page 16

page 18

page 19

research
11/23/2020

Stable Weight Decay Regularization

Weight decay is a popular regularization technique for training of deep ...
research
09/30/2022

Adaptive Weight Decay: On The Fly Weight Decay Tuning for Improving Robustness

We introduce adaptive weight decay, which automatically tunes the hyper-...
research
10/29/2018

Three Mechanisms of Weight Decay Regularization

Weight decay is one of the standard tricks in the neural network toolbox...
research
10/06/2022

A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets

Weight decay is one of the most widely used forms of regularization in d...
research
03/27/2022

Long-Tailed Recognition via Weight Balancing

In the real open world, data tends to follow long-tailed class distribut...
research
02/24/2020

The Early Phase of Neural Network Training

Recent studies have shown that many important aspects of neural network ...
research
06/10/2022

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 )...

Please sign up or login with your details

Forgot password? Click here to reset