Batch Normalization Biases Deep Residual Networks Towards Shallow Paths

02/24/2020
by   Soham De, et al.
0

Batch normalization has multiple benefits. It improves the conditioning of the loss landscape, and is a surprisingly effective regularizer. However, the most important benefit of batch normalization arises in residual networks, where it dramatically increases the largest trainable depth. We identify the origin of this benefit: At initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor proportional to the square root of the network depth. This ensures that, early in training, the function computed by deep normalized residual networks is dominated by shallow paths with well-behaved gradients. We use this insight to develop a simple initialization scheme which can train very deep residual networks without normalization. We also clarify that, although batch normalization does enable stable training with larger learning rates, this benefit is only useful when one wishes to parallelize training over large batch sizes. Our results help isolate the distinct benefits of batch normalization in different architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2021

A Robust Initialization of Residual Blocks for Effective ResNet Training without Batch Normalization

Batch Normalization is an essential component of all state-of-the-art ne...
research
10/05/2022

Dynamical Isometry for Residual Networks

The training success, training speed and generalization ability of neura...
research
11/08/2016

The Loss Surface of Residual Networks: Ensembles and the Role of Batch Normalization

Deep Residual Networks present a premium in performance in comparison to...
research
12/15/2017

Gradients explode - Deep Networks are shallow - ResNet explained

Whereas it is believed that techniques such as Adam, batch normalization...
research
06/01/2023

Spreads in Effective Learning Rates: The Perils of Batch Normalization During Early Training

Excursions in gradient magnitude pose a persistent challenge when traini...
research
11/07/2018

Characterizing Well-behaved vs. Pathological Deep Neural Network Architectures

We introduce a principled approach, requiring only mild assumptions, for...
research
05/15/2022

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

L2 regularization for weights in neural networks is widely used as a sta...

Please sign up or login with your details

Forgot password? Click here to reset