
Fixing Weight Decay Regularization in Adam
We note that common implementations of adaptive gradient algorithms, suc...
11/14/2017 ∙ by Ilya Loshchilov, et al. ∙ 0 ∙ shareread it

Adding noise to the input of a model trained with a regularized objective
Regularization is a well studied problem in the context of neural networ...
04/16/2011 ∙ by Salah Rifai, et al. ∙ 0 ∙ shareread it

Predictors of shortterm decay of cell phone contacts in a large scale communication network
Under what conditions is an edge present in a social network at time t l...
02/09/2011 ∙ by Troy Raeder, et al. ∙ 0 ∙ shareread it

Do deep nets really need weight decay and dropout?
The impressive success of modern deep neural networks on computer vision...
02/20/2018 ∙ by Alex HernándezGarcía, et al. ∙ 0 ∙ shareread it

Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence
Regularization is typically understood as improving generalization by al...
05/30/2019 ∙ by Aditya Golatkar, et al. ∙ 0 ∙ shareread it

Feature Incay for Representation Regularization
Softmax loss is widely used in deep neural networks for multiclass clas...
05/29/2017 ∙ by Yuhui Yuan, et al. ∙ 0 ∙ shareread it

Decoding BetaDecay Systematics: A Global Statistical Model for Beta^ Halflives
Statistical modeling of nuclear data provides a novel approach to nuclea...
06/17/2008 ∙ by N. J. Costiris, et al. ∙ 0 ∙ shareread it
Three Mechanisms of Weight Decay Regularization
Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of L_2 regularization. Literal weight decay has been shown to outperform L_2 regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and KFAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the inputoutput Jacobian norm, and (3) reducing the effective damping coefficient for secondorder optimization. Our results provide insight into how to improve the regularization of neural networks.
READ FULL TEXT