The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models

by   Chao Ma, et al.

A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes when the target function can be accurately approximated by a relatively small number of neurons. It is found that for Xavier-like initialization, there are two distinctive phases in the dynamic behavior of GD in the under-parametrized regime: An early phase in which the GD dynamics follows closely that of the corresponding random feature model and the neurons are effectively quenched, followed by a late phase in which the neurons are divided into two groups: a group of a few "activated" neurons that dominate the dynamics and a group of background (or "quenched") neurons that support the continued activation and deactivation process. This neural network-like behavior is continued into the mildly over-parametrized regime, where it undergoes a transition to a random feature-like behavior. The quenching-activation process seems to provide a clear mechanism for "implicit regularization". This is qualitatively different from the dynamics associated with the "mean-field" scaling where all neurons participate equally and there does not appear to be qualitative changes when the network parameters are changed.


page 1

page 2

page 3

page 4


Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates

We consider optimizing two-layer neural networks in the mean-field regim...

Phase diagram for two-layer ReLU neural networks at infinite-width limit

How neural network behaves during the training over different choices of...

Analysis of feature learning in weight-tied autoencoders via the mean field lens

Autoencoders are among the earliest introduced nonlinear models for unsu...

Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

Can multilayer neural networks -- typically constructed as highly comple...

Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

The behavior of the gradient descent (GD) algorithm is analyzed for a de...

The Slow Deterioration of the Generalization Error of the Random Feature Model

The random feature model exhibits a kind of resonance behavior when the ...