An Alternative View: When Does SGD Escape Local Minima?

02/17/2018
by   Robert Kleinberg, et al.
0

Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function f has many bad local minima or saddle points, as long as for every point x, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution x^*, SGD will get close to, and then stay around x^* with constant probability. More specifically, SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2022

Tackling benign nonconvexity with smoothing and stochastic gradients

Non-convex optimization problems are ubiquitous in machine learning, esp...
research
06/24/2020

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Stochastic gradient descent (SGD) and its variants are mainstream method...
research
12/04/2020

Effect of the initial configuration of weights on the training and function of artificial neural networks

The function and performance of neural networks is largely determined by...
research
08/12/2017

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Over the past few years, softmax and SGD have become a commonly used com...
research
05/20/2023

Evolutionary Algorithms in the Light of SGD: Limit Equivalence, Minima Flatness, and Transfer Learning

Whenever applicable, the Stochastic Gradient Descent (SGD) has shown its...
research
11/12/2020

Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian

Over the last decade, a single algorithm has changed many facets of our ...
research
06/04/2019

Embedded hyper-parameter tuning by Simulated Annealing

We propose a new metaheuristic training scheme that combines Stochastic ...

Please sign up or login with your details

Forgot password? Click here to reset