DeepAI AI Chat
Log In Sign Up

How to Start Training: The Effect of Initialization and Architecture

by   Boris Hanin, et al.
Texas A&M University

We investigate the effects of initialization and architecture on the start of training in deep ReLU nets. We identify two common failure modes for early training in which the mean and variance of activations are poorly behaved. For each failure mode, we give a rigorous proof of when it occurs at initialization and how to avoid it. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in. The second failure mode, exponentially large variance of activation length, can be avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.


page 1

page 2

page 3

page 4


Fractional moment-preserving initialization schemes for training fully-connected neural networks

A common approach to initialization in deep neural networks is to sample...

Fibres of Failure: Classifying errors in predictive processes

We describe Fibres of Failure (FiFa), a method to classify failure modes...

Effect of Activation Functions on the Training of Overparametrized Neural Nets

It is well-known that overparametrized neural networks trained using gra...

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

We give a rigorous analysis of the statistical behavior of gradients in ...

Collapse of Deep and Narrow Neural Nets

Recent theoretical work has demonstrated that deep neural networks have ...

Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

We study the gradients of a maxout network with respect to inputs and pa...

Extreme Memorization via Scale of Initialization

We construct an experimental setup in which changing the scale of initia...