SGD with large step sizes learns sparse features

10/11/2022
by   Maksym Andriushchenko, et al.
0

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows to shed a new light on some common practice and observed phenomena when training neural networks. The code of our experiments is available at https://github.com/tml-epfl/sgd-sparse-features.

READ FULL TEXT
research
05/13/2016

Barzilai-Borwein Step Size for Stochastic Gradient Descent

One of the major issues in stochastic gradient descent (SGD) methods is ...
research
07/18/2022

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

There is mounting empirical evidence of emergent phenomena in the capabi...
research
05/20/2022

PSO-Convolutional Neural Networks with Heterogeneous Learning Rate

Convolutional Neural Networks (ConvNets or CNNs) have been candidly depl...
research
07/24/2019

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

While stochastic gradient descent (SGD) and variants have been surprisin...
research
02/07/2020

How to train your neural ODE

Training neural ODEs on large datasets has not been tractable due to the...
research
02/10/2020

Semi-Implicit Back Propagation

Neural network has attracted great attention for a long time and many re...
research
05/30/2022

On Avoiding Local Minima Using Gradient Descent With Large Learning Rates

It has been widely observed in training of neural networks that when app...

Please sign up or login with your details

Forgot password? Click here to reset