On the Power of Shallow Learning

by   James B. Simon, et al.

A deluge of recent work has explored equivalences between wide neural networks and kernel methods. A central theme is that one can analytically find the kernel corresponding to a given wide network architecture, but despite major implications for architecture design, no work to date has asked the converse question: given a kernel, can one find a network that realizes it? We affirmatively answer this question for fully-connected architectures, completely characterizing the space of achievable kernels. Furthermore, we give a surprising constructive proof that any kernel of any wide, deep, fully-connected net can also be achieved with a network with just one hidden layer and a specially-designed pointwise activation function. We experimentally verify our construction and demonstrate that, by just choosing the activation function, we can design a wide shallow network that mimics the generalization performance of any wide, deep, fully-connected network.


page 1

page 2

page 3

page 4


On the rate of convergence of fully connected very deep neural network regression estimates

Recent results in nonparametric regression show that deep learning, i.e....

Activation Functions: Do They Represent A Trade-Off Between Modular Nature of Neural Networks And Task Performance

Current research suggests that the key factors in designing neural netwo...

Deep Equals Shallow for ReLU Networks in Kernel Regimes

Deep networks are often considered to be more expressive than shallow on...

The loss surface of deep and wide neural networks

While the optimization problem behind deep neural networks is highly non...

Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks

Deep residual network architectures have been shown to achieve superior ...

On Exact Computation with an Infinitely Wide Neural Net

How well does a classic deep net architecture like AlexNet or VGG19 clas...

On architectural choices in deep learning: From network structure to gradient convergence and parameter estimation

We study mechanisms to characterize how the asymptotic convergence of ba...