The Shattered Gradients Problem: If resnets are the answer, then what is the question?

02/28/2017
by   David Balduzzi, et al.
0

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.

READ FULL TEXT

page 2

page 5

page 7

research
02/21/2019

A Mean Field Theory of Batch Normalization

We develop a mean field theory for batch normalization in fully-connecte...
research
12/15/2017

Gradients explode - Deep Networks are shallow - ResNet explained

Whereas it is believed that techniques such as Adam, batch normalization...
research
11/07/2018

Characterizing Well-behaved vs. Pathological Deep Neural Network Architectures

We introduce a principled approach, requiring only mild assumptions, for...
research
08/18/2021

Generalizing MLPs With Dropouts, Batch Normalization, and Skip Connections

A multilayer perceptron (MLP) is typically made of multiple fully connec...
research
01/17/2023

Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

We study the gradients of a maxout network with respect to inputs and pa...
research
01/31/2017

Skip Connections Eliminate Singularities

Skip connections made the training of very deep networks possible and ha...
research
06/01/2023

Spreads in Effective Learning Rates: The Perils of Batch Normalization During Early Training

Excursions in gradient magnitude pose a persistent challenge when traini...

Please sign up or login with your details

Forgot password? Click here to reset