Log In Sign Up

Logarithmic landscape and power-law escape rate of SGD

by   Takashi Mori, et al.

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution P_ss(θ) of the network parameters θ follows a power-law with respect to the loss function L(θ), i.e. P_ss(θ)∝ L(θ)^-ϕ with the exponent ϕ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height Δ L=L(θ^s)-L(θ^*) between a minimum θ^* and a saddle θ^s but by the logarithmized loss barrier height Δlog L=log[L(θ^s)/L(θ^*)]. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.


page 1

page 2

page 3

page 4


When does SGD favor flat minima? A quantitative characterization via linear stability

The observation that stochastic gradient descent (SGD) favors flat minim...

Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Generalization is one of the most important problems in deep learning (D...

The Sobolev Regularization Effect of Stochastic Gradient Descent

The multiplicative structure of parameters and input data in the first l...

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

Stochastic gradient descent (SGD) is one of the most popular algorithms ...

A Walk with SGD

Exploring why stochastic gradient descent (SGD) based optimization metho...

How Many Factors Influence Minima in SGD?

Stochastic gradient descent (SGD) is often applied to train Deep Neural ...

Training trajectories, mini-batch losses and the curious role of the learning rate

Stochastic gradient descent plays a fundamental role in nearly all appli...