On Avoiding Local Minima Using Gradient Descent With Large Learning Rates

05/30/2022
by   Amirkeivan Mohtashami, et al.
0

It has been widely observed in training of neural networks that when applying gradient descent (GD), a large step size is essential for obtaining superior models. However, the effect of large step sizes on the success of GD is not well understood theoretically. We argue that a complete understanding of the mechanics leading to GD's success may indeed require considering effects of using a large step size. To support this claim, we prove on a certain class of functions that GD with large step size follows a different trajectory than GD with a small step size, leading to convergence to the global minimum. We also demonstrate the difference in trajectories for small and large learning rates when GD is applied on a neural network, observing effects of an escape from a local minimum with a large step size, which shows this behavior is indeed relevant in practice. Finally, through a novel set of experiments, we show even though stochastic noise is beneficial, it is not enough to explain success of SGD and a large learning rate is essential for obtaining the best performance even in stochastic settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/08/2015

Speed learning on the fly

The practical performance of online stochastic gradient descent algorith...
research
05/22/2018

Step Size Matters in Deep Learning

Training a neural network with the gradient descent algorithm gives rise...
research
06/08/2022

On Gradient Descent Convergence beyond the Edge of Stability

Gradient Descent (GD) is a powerful workhorse of modern machine learning...
research
09/29/2018

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Adam is shown not being able to converge to the optimal solution in cert...
research
10/11/2022

SGD with large step sizes learns sparse features

We showcase important features of the dynamics of the Stochastic Gradien...
research
03/05/2023

Revisiting the Noise Model of Stochastic Gradient Descent

The stochastic gradient noise (SGN) is a significant factor in the succe...
research
11/02/2022

Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points

This work shows that applying Gradient Descent (GD) with a fixed step si...

Please sign up or login with your details

Forgot password? Click here to reset