DeepAI AI Chat
Log In Sign Up

Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

by   Chris Junchi Li, et al.

In this paper, we study the stochastic gradient descent method in analyzing nonconvex statistical optimization problems from a diffusion approximation point of view. Using the theory of large deviation of random dynamical system, we prove in the small stepsize regime and the presence of omnidirectional noise the following: starting from a local minimizer (resp. saddle point) the SGD iteration escapes in a number of iteration that is exponentially (resp. linearly) dependent on the inverse stepsize. We take the deep neural network as an example to study this phenomenon. Based on a new analysis of the mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory substantiate a very recent empirical results by keskar2016large, suggesting that large batch sizes in training deep learning for synchronous optimization leads to poor generalization error.


page 1

page 2

page 3

page 4


Linear Convergence of Accelerated Stochastic Gradient Descent for Nonconvex Nonsmooth Optimization

In this paper, we study the stochastic gradient descent (SGD) method for...

Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes

Solving statistical learning problems often involves nonconvex optimizat...

Poor starting points in machine learning

Poor (even random) starting points for learning/training/optimization ar...

Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations

Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely ap...

Stochastic Item Descent Method for Large Scale Equal Circle Packing Problem

Stochastic gradient descent (SGD) is a powerful method for large-scale o...

Stability Analysis of Sharpness-Aware Minimization

Sharpness-aware minimization (SAM) is a recently proposed training metho...