Implicit Regularization of Normalization Methods

11/18/2019
by   Xiaoxia Wu, et al.
17

Normalization methods such as batch normalization are commonly used in overparametrized models like neural networks. Here, we study the weight normalization (WN) method (Salimans Kingma, 2016) and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least squares regression and some more general loss functions. WN and rPGD reparametrize the weights with a scale g and a unit vector such that the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. We show that these methods adaptively regularize the weights and converge with exponential rate to the minimum ℓ_2 norm solution (or close to it) even for initializations far from zero. This is different from the behavior of gradient descent, which only converges to the min norm solution when started at zero, and is more sensitive to initialization. Some of our proof techniques are different from many related works; for instance we find explicit invariants along the gradient flow paths. We verify our results experimentally and suggest that there may be a similar phenomenon for nonlinear problems such as matrix sensing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/12/2019

Theory III: Dynamics and Generalization in Deep Networks

We review recent observations on the dynamical systems induced by gradie...
research
05/09/2023

Robust Implicit Regularization via Weight Normalization

Overparameterized models may have many interpolating solutions; implicit...
research
03/02/2021

Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization

Batch Normalization (BN) is a commonly used technique to accelerate and ...
research
07/09/2019

Scaling Limit of Neural Networks with the Xavier Initialization and Convergence to a Global Minimum

We analyze single-layer neural networks with the Xavier initialization i...
research
03/06/2019

Why Learning of Large-Scale Neural Networks Behaves Like Convex Optimization

In this paper, we present some theoretical work to explain why simple gr...
research
11/03/2015

Understanding symmetries in deep networks

Recent works have highlighted scale invariance or symmetry present in th...
research
12/21/2021

More is Less: Inducing Sparsity via Overparameterization

In deep learning it is common to overparameterize the neural networks, t...

Please sign up or login with your details

Forgot password? Click here to reset