Weighted Neural Tangent Kernel: A Generalized and Improved Network-Induced Kernel

03/22/2021
by   Lei Tan, et al.
0

The Neural Tangent Kernel (NTK) has recently attracted intense study, as it describes the evolution of an over-parameterized Neural Network (NN) trained by gradient descent. However, it is now well-known that gradient descent is not always a good optimizer for NNs, which can partially explain the unsatisfactory practical performance of the NTK regression estimator. In this paper, we introduce the Weighted Neural Tangent Kernel (WNTK), a generalized and improved tool, which can capture an over-parameterized NN's training dynamics under different optimizers. Theoretically, in the infinite-width limit, we prove: i) the stability of the WNTK at initialization and during training, and ii) the equivalence between the WNTK regression estimator and the corresponding NN estimator with different learning rates on different parameters. With the proposed weight update algorithm, both empirical and analytical WNTKs outperform the corresponding NTKs in numerical experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

11/11/2021

On the Equivalence between Neural Network and Support Vector Machine

Recent research shows that the dynamics of an infinitely wide neural net...
09/18/2019

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

The evolution of a deep neural network trained by the gradient descent c...
11/04/2020

Which Minimizer Does My Neural Network Converge To?

The loss surface of an overparameterized neural network (NN) possesses m...
05/29/2019

On the Inductive Bias of Neural Tangent Kernels

State-of-the-art neural networks are heavily over-parameterized, making ...
06/11/2021

Neural Optimization Kernel: Towards Robust Deep Learning

Recent studies show a close connection between neural networks (NN) and ...
10/19/2019

Neural Spectrum Alignment

Expressiveness of deep models was recently addressed via the connection ...
06/24/2020

When Do Neural Networks Outperform Kernel Methods?

For a certain scaling of the initialization of stochastic gradient desce...