Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

05/22/2017 ∙ by Chris Junchi Li, et al. ∙ 0

In this paper, we study the stochastic gradient descent method in analyzing nonconvex statistical optimization problems from a diffusion approximation point of view. Using the theory of large deviation of random dynamical system, we prove in the small stepsize regime and the presence of omnidirectional noise the following: starting from a local minimizer (resp. saddle point) the SGD iteration escapes in a number of iteration that is exponentially (resp. linearly) dependent on the inverse stepsize. We take the deep neural network as an example to study this phenomenon. Based on a new analysis of the mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory substantiate a very recent empirical results by keskar2016large, suggesting that large batch sizes in training deep learning for synchronous optimization leads to poor generalization error.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.