Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate

11/04/2020
by   Jingfeng Wu, et al.
14

Understanding the algorithmic regularization effect of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different directional bias: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2022

The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models

We study the Stochastic Gradient Descent (SGD) algorithm in nonparametri...
research
12/07/2020

Stochastic Gradient Descent with Large Learning Rate

As a simple and efficient optimization method in deep learning, stochast...
research
01/28/2021

On the Origin of Implicit Regularization in Stochastic Gradient Descent

For infinitesimal learning rates, stochastic gradient descent (SGD) foll...
research
07/10/2019

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Stochastic gradient descent with a large initial learning rate is a wide...
research
06/14/2023

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

The success of SGD in deep learning has been ascribed by prior works to ...
research
06/09/2020

Isotropic SGD: a Practical Approach to Bayesian Posterior Sampling

In this work we define a unified mathematical framework to deepen our un...
research
03/31/2021

Empirically explaining SGD from a line search perspective

Optimization in Deep Learning is mainly guided by vague intuitions and s...

Please sign up or login with your details

Forgot password? Click here to reset