Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

by   Pedro Domingos, et al.

Deep learning's successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel. This improved understanding should lead to better learning algorithms.


page 1

page 2

page 3

page 4


Frank-Wolfe optimization for deep networks

Deep neural networks is today one of the most popular choices in classif...

Neural Networks can Learn Representations with Gradient Descent

Significant theoretical work has established that in specific regimes, n...

Sparse, guided feature connections in an Abstract Deep Network

We present a technique for developing a network of re-used features, whe...

A fast point solver for deep nonlinear function approximators

Deep kernel processes (DKPs) generalise Bayesian neural networks, but do...

Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

We study the relative power of learning with gradient descent on differe...

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

An open question in the Deep Learning community is why neural networks t...

Deep Context-Aware Kernel Networks

Context plays a crucial role in visual recognition as it provides comple...

Code Repositories