The Benefits of Over-parameterization at Initialization in Deep ReLU Networks

01/11/2019
by   Devansh Arpit, et al.
17

It has been noted in existing literature that over-parameterization in ReLU networks generally leads to better performance. While there could be several reasons for this, we investigate desirable network properties at initialization which may be enjoyed by ReLU networks. Without making any assumption, we derive a lower bound on the layer width of deep ReLU networks whose weights are initialized from a certain distribution, such that with high probability, i) the norm of hidden activation of all layers are roughly equal to the norm of the input, and, ii) the norm of parameter gradient for all the layers are roughly the same. In this way, sufficiently wide deep ReLU nets with appropriate initialization can inherently preserve the forward flow of information and also avoid the gradient exploding/vanishing problem. We further show that these results hold for an infinite number of data samples, in which case the finite lower bound depends on the input dimensionality and the depth of the network. In the case of deep ReLU networks with weight vectors normalized by their norm, we derive an initialization required to tap the aforementioned benefits from over-parameterization without which network fails to learn for large depth.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2022

On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias

We study the dynamics and implicit bias of gradient flow (GF) on univari...
research
01/24/2021

On the Proof of Global Convergence of Gradient Descent for Deep ReLU Networks with Linear Widths

This paper studies the global convergence of gradient descent for deep R...
research
03/15/2019

Dying ReLU and Initialization: Theory and Numerical Examples

The dying ReLU refers to the problem when ReLU neurons become inactive a...
research
10/31/2017

Approximating Continuous Functions by ReLU Nets of Minimal Width

This article concerns the expressive power of depth in deep feed-forward...
research
07/23/2019

Trainability and Data-dependent Initialization of Over-parameterized ReLU Neural Networks

A neural network is said to be over-specified if its representational po...
research
03/23/2021

Initializing ReLU networks in an expressive subspace of weights

Using a mean-field theory of signal propagation, we analyze the evolutio...
research
12/19/2014

Random Walk Initialization for Training Very Deep Feedforward Networks

Training very deep networks is an important open problem in machine lear...

Please sign up or login with your details

Forgot password? Click here to reset