Disentangling trainability and generalization in deep learning

by   Lechao Xiao, et al.

A fundamental goal in deep learning is the characterization of trainability and generalization of neural networks as a function of their architecture and hyperparameters. In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably. To do this, we leverage recent advances that have separately shown: (1) that in the wide network limit, random networks before training are Gaussian Processes governed by a kernel known as the Neural Network Gaussian Process (NNGP) kernel, (2) that at large depths the spectrum of the NNGP kernel simplifies considerably and becomes "weakly data-dependent" and (3) that gradient descent training of wide neural networks is described by a kernel called the Neural Tangent Kernel (NTK) that is related to the NNGP. Here we show that in the large depth limit the spectrum of the NTK simplifies in much the same way as that of the NNGP kernel. By analyzing this spectrum, we arrive at a precise characterization of trainability and a necessary condition for generalization across a range of architectures including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). In particular, we find that there are large regions of hyperparameter space where networks can only memorize the training set in the sense they reach perfect training accuracy but completely fail to generalize outside the training set, in contrast with several recent results. By comparing CNNs with- and without-global average pooling, we show that CNNs without average pooling have very nearly identical learning dynamics to FCNs while CNNs with pooling contain a correction that alters its generalization performance. We perform a thorough empirical investigation of these theoretical results and finding excellent agreement on real datasets.


Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

A longstanding goal in deep learning research has been to precisely char...

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

A fundamental question in modern machine learning is how deep neural net...

Modify Training Directions in Function Space to Reduce Generalization Error

We propose theoretical analyses of a modified natural gradient descent m...

Understanding Generalization and Optimization Performance of Deep CNNs

This work aims to provide understandings on the remarkable success of de...

Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

There is a previously identified equivalence between wide fully connecte...

Properties of the After Kernel

The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel de...

A Recipe for Global Convergence Guarantee in Deep Neural Networks

Existing global convergence guarantees of (stochastic) gradient descent ...

Please sign up or login with your details

Forgot password? Click here to reset