
Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel
Stochastic Gradient Descent (SGD) is widely used to train deep neural ne...
read it

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
Several recent trends in machine learning theory and practice, from the ...
read it

Limitations of Lazy Training of Twolayers Neural Networks
We study the supervised learning problem under either of the following t...
read it

Stochastic Gradient Descent in Hilbert Scales: Smoothness, Preconditioning and Earlier Stopping
Stochastic Gradient Descent (SGD) has become the method of choice for so...
read it

Neural networks as Interacting Particle Systems: Asymptotic convexity of the Loss Landscape and Universal Scaling of the Approximation Error
Neural networks, a central tool in machine learning, have demonstrated r...
read it

Graph Drawing by Stochastic Gradient Descent
A popular method of forcedirected graph drawing is multidimensional sca...
read it

The Local Elasticity of Neural Networks
This paper presents a phenomenon in neural networks that we refer to as ...
read it
When Do Neural Networks Outperform Kernel Methods?
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, twolayers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGDtrained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If feature vectors are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best lowdimensional representation. Here we show that this curse of dimensionality becomes milder if the feature vectors display the same lowdimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present a model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent lowdimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.
READ FULL TEXT
Comments
There are no comments yet.