Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

11/21/2018
by   Difan Zou, et al.
18

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2019

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

We study the training and generalization of deep neural networks (DNNs) ...
research
06/28/2019

Neural ODEs as the Deep Limit of ResNets with constant weights

In this paper we prove that, in the deep limit, the stochastic gradient ...
research
04/06/2023

Training a Two Layer ReLU Network Analytically

Neural networks are usually trained with different variants of gradient ...
research
01/28/2022

Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks

We study the overparametrization bounds required for the global converge...
research
11/28/2021

Generalization Performance of Empirical Risk Minimization on Over-parameterized Deep ReLU Nets

In this paper, we study the generalization performance of global minima ...
research
12/25/2018

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data whic...
research
05/31/2021

Combining resampling and reweighting for faithful stochastic optimization

Many machine learning and data science tasks require solving non-convex ...

Please sign up or login with your details

Forgot password? Click here to reset