SGD Converges to Global Minimum in Deep Learning via Star-convex Path

01/02/2019
by   Yi Zhou, et al.
18

Stochastic gradient descent (SGD) has been found to be surprisingly effective in training a variety of deep neural networks. However, there is still a lack of understanding on how and why SGD can train these complex networks towards a global minimum. In this study, we establish the convergence of SGD to a global minimum for nonconvex optimization problems that are commonly encountered in neural network training. Our argument exploits the following two important properties: 1) the training loss can achieve zero value (approximately), which has been widely observed in deep learning; 2) SGD follows a star-convex path, which is verified by various experiments in this paper. In such a context, our analysis shows that SGD, although has long been considered as a randomized algorithm, converges in an intrinsically deterministic manner to a global minimum.

READ FULL TEXT

page 7

page 8

research
11/20/2020

Convergence Analysis of Homotopy-SGD for non-convex optimization

First-order stochastic methods for solving large-scale non-convex optimi...
research
05/14/2019

Robust Neural Network Training using Periodic Sampling over Model Weights

Deep neural networks provide best-in-class performance for a number of c...
research
06/08/2015

Path-SGD: Path-Normalized Optimization in Deep Neural Networks

We revisit the choice of SGD for training deep neural networks by recons...
research
05/04/2021

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

Stochastic gradient descent (SGD) is one of the most popular algorithms ...
research
12/24/2022

Visualizing Information Bottleneck through Variational Inference

The Information Bottleneck theory provides a theoretical and computation...
research
02/28/2017

Learning What Data to Learn

Machine learning is essentially the sciences of playing with data. An ad...
research
05/31/2023

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

While stochastic gradient descent (SGD) is still the most popular optimi...

Please sign up or login with your details

Forgot password? Click here to reset