Demystifying ResNet

11/03/2016
by   Sihan Li, et al.
0

The Residual Network (ResNet), proposed in He et al. (2015), utilized shortcut connections to significantly reduce the difficulty of training, which resulted in great performance boosts in terms of both training and generalization error. It was empirically observed in He et al. (2015) that stacking more layers of residual blocks with shortcut 2 results in smaller training error, while it is not true for shortcut of length 1 or 3. We provide a theoretical explanation for the uniqueness of shortcut 2. We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization algorithm is hard to escape. The shortcut 1, however, is essentially equivalent to no shortcuts, which has a condition number exploding to infinity as the number of layers grows. We further argue that as the number of layers tends to infinity, it suffices to only look at the loss function at the zero initial point. Extensive experiments are provided accompanying our theoretical results. We show that initializing the network to small weights with shortcut 2 achieves significantly better results than random Gaussian (Xavier) initialization, orthogonal initialization, and shortcuts of deeper depth, from various perspectives ranging from final loss, learning dynamics and stability, to the behavior of the Hessian along the learning process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2021

ZerO Initialization: Initializing Residual Networks with only Zeros and Ones

Deep neural networks are usually initialized with random weights, with a...
research
10/24/2020

Stable ResNet

Deep ResNet architectures have achieved state of the art performance on ...
research
10/07/2019

Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

The skip-connections used in residual networks have become a standard ar...
research
09/15/2022

Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization)

We study the average robustness notion in deep neural networks in (selec...
research
05/19/2019

A type of generalization error induced by initialization in deep neural networks

How different initializations and loss functions affect the learning of ...
research
06/25/2021

Jitter: Random Jittering Loss Function

Regularization plays a vital role in machine learning optimization. One ...
research
02/06/2019

Are All Layers Created Equal?

Understanding learning and generalization of deep architectures has been...

Please sign up or login with your details

Forgot password? Click here to reset