Gradient Descent Provably Optimizes Over-parameterized Neural Networks

by   Simon S. Du, et al.

One of the mystery in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and the data is non-degenerate, randomly initialized gradient descent converges a globally optimal solution with a linear convergence rate for the quadratic loss function. Our analysis is based on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.


page 1

page 2

page 3

page 4


Provable Convergence of Nesterov Accelerated Method for Over-Parameterized Neural Networks

Despite the empirical success of deep learning, it still lacks theoretic...

On the Proof of Global Convergence of Gradient Descent for Deep ReLU Networks with Linear Widths

This paper studies the global convergence of gradient descent for deep R...

Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Although deep learning has shown its powerful performance in many applic...

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

We revisit the problem of learning a single neuron with ReLU activation ...

Improved Convergence Guarantees for Shallow Neural Networks

We continue a long line of research aimed at proving convergence of dept...

Global Convergence Rate of Deep Equilibrium Models with General Activations

In a recent paper, Ling et al. investigated the over-parametrized Deep E...

A Convergence Theory for Deep Learning via Over-Parameterization

Deep neural networks (DNNs) have demonstrated dominating performance in ...

Please sign up or login with your details

Forgot password? Click here to reset