How to Start Training: The Effect of Initialization and Architecture

03/05/2018
by   Boris Hanin, et al.
0

We investigate the effects of initialization and architecture on the start of training in deep ReLU nets. We identify two common failure modes for early training in which the mean and variance of activations are poorly behaved. For each failure mode, we give a rigorous proof of when it occurs at initialization and how to avoid it. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in. The second failure mode, exponentially large variance of activation length, can be avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2020

Fractional moment-preserving initialization schemes for training fully-connected neural networks

A common approach to initialization in deep neural networks is to sample...
research
02/09/2018

Fibres of Failure: Classifying errors in predictive processes

We describe Fibres of Failure (FiFa), a method to classify failure modes...
research
08/16/2019

Effect of Activation Functions on the Training of Overparametrized Neural Nets

It is well-known that overparametrized neural networks trained using gra...
research
01/11/2018

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

We give a rigorous analysis of the statistical behavior of gradients in ...
research
08/15/2018

Collapse of Deep and Narrow Neural Nets

Recent theoretical work has demonstrated that deep neural networks have ...
research
01/17/2023

Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

We study the gradients of a maxout network with respect to inputs and pa...
research
08/31/2020

Extreme Memorization via Scale of Initialization

We construct an experimental setup in which changing the scale of initia...

Please sign up or login with your details

Forgot password? Click here to reset