Logarithmic landscape and power-law escape rate of SGD

05/20/2021
by   Takashi Mori, et al.
0

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution P_ss(θ) of the network parameters θ follows a power-law with respect to the loss function L(θ), i.e. P_ss(θ)∝ L(θ)^-ϕ with the exponent ϕ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height Δ L=L(θ^s)-L(θ^*) between a minimum θ^* and a saddle θ^s but by the logarithmized loss barrier height Δlog L=log[L(θ^s)/L(θ^*)]. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset