Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)

by   Yuqing Li, et al.

Gradient descent yields zero training loss in polynomial time for deep neural networks despite non-convex nature of the objective function. The behavior of network in the infinite width limit trained by gradient descent can be described by the Neural Tangent Kernel (NTK) introduced in <cit.>. In this paper, we study dynamics of the NTK for finite width Deep Residual Network (ResNet) using the neural tangent hierarchy (NTH) proposed in <cit.>. For a ResNet with smooth and Lipschitz activation function, we reduce the requirement on the layer width m with respect to the number of training samples n from quartic to cubic. Our analysis suggests strongly that the particular skip-connection structure of ResNet is the main reason for its triumph over fully-connected network.


page 1

page 2

page 3

page 4


Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

The evolution of a deep neural network trained by the gradient descent c...

Kinetic Theory for Residual Neural Networks

Deep residual neural networks (ResNet) are performing very well for many...

Gradient Descent Finds Global Minima of Deep Neural Networks

Gradient descent finds a global minimum in training deep neural networks...

Towards Understanding the Importance of Shortcut Connections in Residual Networks

Residual Network (ResNet) is undoubtedly a milestone in deep learning. R...

Nonlinear Weighted Directed Acyclic Graph and A Priori Estimates for Neural Networks

In an attempt to better understand structural benefits and generalizatio...

A Note on Connectivity of Sublevel Sets in Deep Learning

It is shown that for deep neural networks, a single wide layer of width ...

Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime

We provide quantitative bounds measuring the L^2 difference in function ...