Adaptive Weight Decay: On The Fly Weight Decay Tuning for Improving Robustness

09/30/2022
by   Amin Ghiasi, et al.
0

We introduce adaptive weight decay, which automatically tunes the hyper-parameter for weight decay during each training iteration. For classification problems, we propose changing the value of the weight decay hyper-parameter on the fly based on the strength of updates from the classification loss (i.e., gradient of cross-entropy), and the regularization loss (i.e., ℓ_2-norm of the weights). We show that this simple modification can result in large improvements in adversarial robustness – an area which suffers from robust overfitting – without requiring extra data. Specifically, our reformulation results in 20 for CIFAR-100, and 10 traditional weight decay. In addition, this method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms, which the latter contributes to robustness to overfitting to label noise, and pruning.

READ FULL TEXT

page 2

page 6

page 9

page 17

page 18

research
11/14/2017

Fixing Weight Decay Regularization in Adam

We note that common implementations of adaptive gradient algorithms, suc...
research
08/12/2021

Logit Attenuating Weight Normalization

Over-parameterized deep networks trained using gradient-based optimizers...
research
12/27/2020

Understanding Decoupled and Early Weight Decay

Weight decay (WD) is a traditional regularization technique in deep lear...
research
10/03/2022

Omnigrok: Grokking Beyond Algorithmic Data

Grokking, the unusual phenomenon for algorithmic datasets where generali...
research
03/26/2018

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Although deep learning has produced dazzling successes for applications ...
research
02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...
research
10/01/2019

Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory

We empirically evaluate common assumptions about neural networks that ar...

Please sign up or login with your details

Forgot password? Click here to reset