Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

by   Eran Malach, et al.

We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small even when gradient descent can achieve arbitrarily high accuracy. Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.


page 1

page 2

page 3

page 4


Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Recent work has revealed that overparameterized networks trained by grad...

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

We consider the problem of learning a one-hidden-layer neural network wi...

Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum

Wide neural networks are biased towards learning certain functions, infl...

An Exact Kernel Equivalence for Finite Classification Models

We explore the equivalence between neural networks and kernel methods by...

Diving into the shallows: a computational perspective on large-scale shallow learning

In this paper we first identify a basic limitation in gradient descent-b...

Kernel quadrature by applying a point-wise gradient descent method to discrete energies

We propose a method for generating nodes for kernel quadrature by a poin...

Please sign up or login with your details

Forgot password? Click here to reset