Residual Tangent Kernels

by   Etai Littwin, et al.

A recent body of work has focused on the theoretical study of neural networks at the regime of large width. Specifically, it was shown that training infinitely-wide and properly scaled vanilla ReLU networks using the L2 loss is equivalent to kernel regression using the Neural Tangent Kernel, which is independent of the initialization instance, and remains constant during training. In this work, we derive the form of the limiting kernel for architectures incorporating bypass connections, namely residual networks (ResNets), as well as to densely connected networks (DenseNets). In addition, we derive finite width corrections for both cases. Our analysis reveals that deep practical residual architectures might operate much closer to the “kernel” regime than their vanilla counterparts: while in networks that do not use skip connections, convergence to the limiting kernels requires one to fix depth while increasing the layers' width, in both ResNets and DenseNets, convergence to the limiting kernel may occur for infinite deep and wide networks, provided proper initialization.


page 1

page 2

page 3

page 4


Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

The evolution of a deep neural network trained by the gradient descent c...

Finite Depth and Width Corrections to the Neural Tangent Kernel

We prove the precise scaling, at finite depth and width, for the mean an...

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

At initialization, artificial neural networks (ANNs) are equivalent to G...

Freeze and Chaos for DNNs: an NTK view of Batch Normalization, Checkerboard and Boundary Effects

In this paper, we analyze a number of architectural features of Deep Neu...

Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? – A Neural Tangent Kernel Perspective

Deep residual networks (ResNets) have demonstrated better generalization...

Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime

We provide quantitative bounds measuring the L^2 difference in function ...

Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

Training very deep neural networks is still an extremely challenging tas...