Compressing invariant manifolds in neural nets
We study how neural networks compress uninformative input space in models where data lie in d dimensions, but whose label only vary within a linear manifold of dimension d_∥ < d. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the uninformative d_⊥=d-d_∥ space is compressed by a factor λ∼√(p), where p is the size of the training set. We quantify the benefit of such a compression on the test error ϵ. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that ϵ∼ p^-β, with β_Lazy = d / (3d-2). Compression improves the learning curves so that β_Feature = (2d-1)/(3d-2) if d_∥ = 1 and β_Feature = (d + d_⊥/2)/(3d-2) if d_∥ > 1. We test these predictions for a stripe model where boundaries are parallel interfaces (d_∥=1) as well as for a cylindrical boundary (d_∥=2). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST. The great similarities found in these two cases support that compression is central to the training of MNIST, and puts forward kernel-PCA on the evolving NTK as a useful diagnostic of compression in deep nets.
READ FULL TEXT