Overparameterization of deep ResNet: zero loss and mean-field analysis

05/30/2021
by   Zhiyan Ding, et al.
0

Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of neurons in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a partial differential equation (PDE) that characterizes gradient flow for a probability distribution in the large-NN limit. Next, we show that the solution to the PDE converges in the training time to a zero-loss solution. Together, these results imply that training of the ResNet also gives a near-zero loss if the Resnet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2021

On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime

Finding the optimal configuration of parameters in ResNet is a nonconvex...
research
03/11/2020

A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

Training deep neural networks with stochastic gradient descent (SGD) can...
research
05/10/2018

Scaling limit of the Stein variational gradient descent part I: the mean field regime

We study an interacting particle system in R^d motivated by Stein variat...
research
11/11/2019

Stronger Convergence Results for Deep Residual Networks: Network Width Scales Linearly with Training Data Size

Deep neural networks are highly expressive machine learning models with ...
research
07/07/2020

Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)

Gradient descent yields zero training loss in polynomial time for deep n...
research
10/29/2021

Limiting fluctuation and trajectorial stability of multilayer neural networks with mean field training

The mean field (MF) theory of multilayer neural networks centers around ...
research
08/28/2020

Predicting Training Time Without Training

We tackle the problem of predicting the number of optimization steps tha...

Please sign up or login with your details

Forgot password? Click here to reset