Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks

06/05/2022
by   Mingze Wang, et al.
0

The convergence of GD and SGD when training mildly parameterized neural networks starting from random initialization is studied. For a broad range of models and loss functions, including the most commonly used square loss and cross entropy loss, we prove an “early stage convergence” result. We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast. Furthurmore, for exponential type loss functions, and under some assumptions on the training data, we show global convergence of GD. Instead of relying on extreme over-parameterization, our study is based on a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient. The results on activation patterns, which we call “neuron partition”, help build intuitions for understanding the behavior of neural networks' training dynamics, and may be of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2021

The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality

Understanding the computational complexity of training simple neural net...
research
06/11/2019

An Improved Analysis of Training Over-parameterized Deep Neural Networks

A recent line of research has shown that gradient-based algorithms with ...
research
06/26/2022

Bounding the Width of Neural Networks via Coupled Initialization – A Worst Case Analysis

A common method in training neural networks is to initialize all the wei...
research
01/19/2023

Convergence beyond the over-parameterized regime using Rayleigh quotients

In this paper, we present a new strategy to prove the convergence of dee...
research
07/14/2020

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

The ability of neural networks to provide `best in class' approximation ...
research
07/16/2021

Entropic alternatives to initialization

Local entropic loss functions provide a versatile framework to define ar...
research
11/27/2019

Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis

A remarkable recent discovery in machine learning has been that deep neu...

Please sign up or login with your details

Forgot password? Click here to reset