On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization
In recent years, a critical initialization scheme with orthogonal initialization deep nonlinear networks has been proposed. The orthogonal weights are crucial to achieve dynamical isometry for random networks, where entire spectrum of singular values of an input-output Jacobian are around one. The strong empirical evidence that orthogonal initialization in linear networks and the linear regime of non-linear networks can speed up training than Gaussian networks raise great interests. One recent work has proven the benefit of orthogonal initialization in linear networks. However, the dynamics behind it have not been revealed on non-linear networks. In this work, we study the Neural Tangent Kernel (NTK), which describes the gradient descent training of wide networks, on orthogonal, wide, fully-connect, and nonlinear networks. We prove that NTK of Gaussian and Orthogonal weights are equal when the network width is infinite, resulting in a conclusion that orthogonal initialization can speed up training is a finite-width effect in the small learning rate region. Then we find that during training, the NTK of infinite-width networks with orthogonal initialization stay constant theoretically and vary at a rate of the same order empirically as Gaussian ones, as the width tends to infinity. Finally, we conduct a thorough empirical investigation of training speed on CIFAR10 datasets and show the benefit of orthogonal initialization lies in the large learning rate and depth region in a linear regime of nonlinear networks.
READ FULL TEXT