A Mean Field Theory of Batch Normalization

02/21/2019
by   Greg Yang, et al.
0

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

READ FULL TEXT
research
02/28/2017

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

A long-standing obstacle to progress in deep learning is the problem of ...
research
12/24/2017

Mean Field Residual Networks: On the Edge of Chaos

We study randomly initialized residual networks using mean field theory ...
research
06/01/2023

Spreads in Effective Learning Rates: The Perils of Batch Normalization During Early Training

Excursions in gradient magnitude pose a persistent challenge when traini...
research
12/19/2019

Mean field theory for deep dropout networks: digging up gradient backpropagation deeply

In recent years, the mean field theory has been applied to the study of ...
research
08/01/2022

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Training deep neural networks is a very demanding task, especially chall...
research
11/07/2018

Characterizing Well-behaved vs. Pathological Deep Neural Network Architectures

We introduce a principled approach, requiring only mild assumptions, for...
research
06/14/2018

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

In recent years, state-of-the-art methods in computer vision have utiliz...

Please sign up or login with your details

Forgot password? Click here to reset