Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

05/22/2017 ∙ by Chris Junchi Li, et al. ∙ 0

In this paper, we study the stochastic gradient descent method in analyzing nonconvex statistical optimization problems from a diffusion approximation point of view. Using the theory of large deviation of random dynamical system, we prove in the small stepsize regime and the presence of omnidirectional noise the following: starting from a local minimizer (resp. saddle point) the SGD iteration escapes in a number of iteration that is exponentially (resp. linearly) dependent on the inverse stepsize. We take the deep neural network as an example to study this phenomenon. Based on a new analysis of the mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory substantiate a very recent empirical results by keskar2016large, suggesting that large batch sizes in training deep learning for synchronous optimization leads to poor generalization error.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.