Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

06/04/2018
by   Simon S. Du, et al.
0

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes η_t = O(t^-( 1/2+δ)) (0<δ<1/2) automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank-1 asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2021

Convergence of gradient descent for learning linear neural networks

We study the convergence properties of gradient descent for training dee...
research
01/28/2022

Training invariances and the low-rank phenomenon: beyond linear networks

The implicit bias induced by the training of neural networks has become ...
research
02/05/2018

Learning Compact Neural Networks with Regularization

We study the impact of regularization for learning neural networks. Our ...
research
09/04/2023

Asymmetric matrix sensing by gradient descent with small random initialization

We study matrix sensing, which is the problem of reconstructing a low-ra...
research
03/22/2022

Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography

We propose a distributed Quantum State Tomography (QST) protocol, named ...
research
06/24/2015

Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation

It has been observed in a variety of contexts that gradient descent meth...
research
11/22/2021

Depth Without the Magic: Inductive Bias of Natural Gradient Descent

In gradient descent, changing how we parametrize the model can lead to d...

Please sign up or login with your details

Forgot password? Click here to reset