
Theory III: Dynamics and Generalization in Deep Networks
We review recent observations on the dynamical systems induced by gradie...
read it

Gradient Descent Provably Optimizes Overparameterized Neural Networks
One of the mystery in the success of neural networks is randomly initial...
read it

Implicit Regularization in Matrix Factorization
We study implicit regularization when optimizing an underdetermined quad...
read it

Scaling Limit of Neural Networks with the Xavier Initialization and Convergence to a Global Minimum
We analyze singlelayer neural networks with the Xavier initialization i...
read it

Why Learning of LargeScale Neural Networks Behaves Like Convex Optimization
In this paper, we present some theoretical work to explain why simple gr...
read it

Understanding symmetries in deep networks
Recent works have highlighted scale invariance or symmetry present in th...
read it

Learning Sparse Neural Networks through L_0 Regularization
We propose a practical method for L_0 norm regularization for neural net...
read it
Implicit Regularization of Normalization Methods
Normalization methods such as batch normalization are commonly used in overparametrized models like neural networks. Here, we study the weight normalization (WN) method (Salimans Kingma, 2016) and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least squares regression and some more general loss functions. WN and rPGD reparametrize the weights with a scale g and a unit vector such that the objective function becomes nonconvex. We show that this nonconvex formulation has beneficial regularization effects compared to gradient descent on the original objective. We show that these methods adaptively regularize the weights and converge with exponential rate to the minimum ℓ_2 norm solution (or close to it) even for initializations far from zero. This is different from the behavior of gradient descent, which only converges to the min norm solution when started at zero, and is more sensitive to initialization. Some of our proof techniques are different from many related works; for instance we find explicit invariants along the gradient flow paths. We verify our results experimentally and suggest that there may be a similar phenomenon for nonlinear problems such as matrix sensing.
READ FULL TEXT
Comments
There are no comments yet.