Parameter Averaging for SGD Stabilizes the Implicit Bias towards Flat Regions

02/18/2023
by   Atsushi Nitanda, et al.
0

Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2017

Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains

We consider the minimization of an objective function given access to un...
research
04/16/2018

Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling

Stochastic gradient methods enable learning probabilistic models from la...
research
04/29/2022

The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models

We study the Stochastic Gradient Descent (SGD) algorithm in nonparametri...
research
06/20/2023

Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent

Gaussian processes are a powerful framework for quantifying uncertainty ...
research
10/06/2022

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Gradient regularization (GR) is a method that penalizes the gradient nor...
research
01/31/2019

Improving SGD convergence by tracing multiple promising directions and estimating distance to minimum

Deep neural networks are usually trained with stochastic gradient descen...
research
10/21/2017

Stochastic Backward Euler: An Implicit Gradient Descent Algorithm for k-means Clustering

In this paper, we propose an implicit gradient descent algorithm for the...

Please sign up or login with your details

Forgot password? Click here to reset