Compressing invariant manifolds in neural nets

07/22/2020
by   Jonas Paccolat, et al.
0

We study how neural networks compress uninformative input space in models where data lie in d dimensions, but whose label only vary within a linear manifold of dimension d_∥ < d. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the uninformative d_⊥=d-d_∥ space is compressed by a factor λ∼√(p), where p is the size of the training set. We quantify the benefit of such a compression on the test error ϵ. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that ϵ∼ p^-β, with β_Lazy = d / (3d-2). Compression improves the learning curves so that β_Feature = (2d-1)/(3d-2) if d_∥ = 1 and β_Feature = (d + d_⊥/2)/(3d-2) if d_∥ > 1. We test these predictions for a stripe model where boundaries are parallel interfaces (d_∥=1) as well as for a cylindrical boundary (d_∥=2). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST. The great similarities found in these two cases support that compression is central to the training of MNIST, and puts forward kernel-PCA on the evolving NTK as a useful diagnostic of compression in deep nets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2019

Disentangling feature and lazy learning in deep neural networks: an empirical study

Two distinct limits for deep learning as the net width h→∞ have been pro...
research
01/31/2023

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Understanding when the noise in stochastic gradient descent (SGD) affect...
research
04/22/2022

On Feature Learning in Neural Networks with Global Convergence Guarantees

We study the optimization of wide neural networks (NNs) via gradient flo...
research
06/17/2020

How isotropic kernels learn simple invariants

We investigate how the training curve of isotropic kernel methods depend...
research
05/06/2021

Relative stability toward diffeomorphisms in deep nets indicates performance

Understanding why deep nets can classify data in large dimensions remain...
research
05/21/2021

Properties of the After Kernel

The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel de...
research
07/09/2021

Model compression as constrained optimization, with application to neural nets. Part V: combining compressions

Model compression is generally performed by using quantization, low-rank...

Please sign up or login with your details

Forgot password? Click here to reset