Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network

09/30/2019
by   Yuandong Tian, et al.
37

To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called interpolation setting, there exists many-to-one alignment between student and teacher nodes in the lowest layer under mild conditions. This suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization plays two roles: (1) it is a necessary condition for alignment to happen at the critical points, and (2) in training dynamics, it helps student nodes cover more teacher nodes with fewer iterations. Both improve generalization. Experiments justify our finding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2019

Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

We analyze the dynamics of training deep ReLU networks and their implica...
research
06/11/2021

On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting

Deep learning empirically achieves high performance in many applications...
research
06/18/2019

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Deep neural networks achieve stellar generalisation even when they have ...
research
04/29/2021

Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft Committee Machines

Over-parametrized deep neural networks trained by stochastic gradient de...
research
01/25/2019

Generalisation dynamics of online learning in over-parameterised neural networks

Deep neural networks achieve stellar generalisation on a variety of prob...
research
02/04/2021

A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network

While over-parameterization is widely believed to be crucial for the suc...
research
10/04/2020

Understanding How Over-Parametrization Leads to Acceleration: A case of learning a single teacher neuron

Over-parametrization has become a popular technique in deep learning. It...

Please sign up or login with your details

Forgot password? Click here to reset