To understand deep learning we need to understand kernel learning

by   Mikhail Belkin, et al.

Generalization performance of classifiers in deep learning has recently become a subject of intense study. Heavily over-parametrized deep models tend to fit training data exactly. Despite this overfitting, they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using real-world and synthetic datasets, we establish that kernel classifiers trained to have zero classification error (overfitting) or even zero regression error (interpolation) perform very well on test data. We proceed to prove lower bounds on the norm of overfitted solutions for smooth kernels, showing that they increase nearly exponentially with the data size. Since the available generalization bounds depend polynomially on the norm of the solution, this implies that the existing generalization bounds diverge as data increases. We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. The observation that the performance of overfitted Laplacian and Gaussian classifiers on the test is quite similar, suggests that generalization is tied to the properties of the kernel function rather than the optimization process. We see that some key phenomena of deep learning are manifested similarly in kernel methods in the overfitted regime. We argue that progress on understanding deep learning will be difficult, until more analytically tractable "shallow" kernel methods are better understood. The combination of the experimental and theoretical results presented in this paper indicates a need for a new theoretical basis for understanding classical kernel methods.


page 1

page 2

page 3

page 4


Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

The practical success of overparameterized neural networks has motivated...

The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

To understand how deep learning works, it is crucial to understand the t...

An analysis of training and generalization errors in shallow and deep networks

An open problem around deep networks is the apparent absence of over-fit...

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

Many modern machine learning models are trained to achieve zero or near-...

Learning kernels that adapt to GPU

In recent years machine learning methods that nearly interpolate the dat...

Just Interpolate: Kernel "Ridgeless" Regression Can Generalize

In the absence of explicit regularization, Kernel "Ridgeless" Regression...

Understanding overfitting peaks in generalization error: Analytical risk curves for l_2 and l_1 penalized interpolation

Traditionally in regression one minimizes the number of fitting paramete...

Please sign up or login with your details

Forgot password? Click here to reset