
A jamming transition from under to overparametrization affects loss landscape and generalization
We argue that in fullyconnected networks a phase transition delimits th...
read it

The jamming transition as a paradigm to understand the loss landscape of deep neural networks
Deep learning has been immensely successful at a variety of tasks, rangi...
read it

A Hessian Based Complexity Measure for Deep Networks
Deep (neural) networks have been applied productively in a wide range of...
read it

Implicit regularization for deep neural networks driven by an OrnsteinUhlenbeck like process
We consider deep networks, trained via stochastic gradient descent to mi...
read it

The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training
Modern neural networks are often operated in a strongly overparametrized...
read it

Deep Networks and the Multiple Manifold Problem
We study the multiple manifold problem, a binary classification task mod...
read it

The intriguing role of module criticality in the generalization of deep networks
We study the phenomenon that some modules of deep neural networks (DNNs)...
read it
Scaling description of generalization with number of parameters in deep learning
We provide a description for the evolution of the generalization performance of fixeddepth fullyconnected deep neural networks, as a function of their number of parameters N. In the setup where the number of data points is larger than the input dimension, as N gets large, we observe that increasing N at fixed depth reduces the fluctuations of the output function f_N induced by initial conditions, with f_Nf̅_N∼ N^1/4 where f̅_N denotes an average over initial conditions. We explain this asymptotic behavior in terms of the fluctuations of the socalled Neural Tangent Kernel that controls the dynamics of the output function. For the task of classification, we predict these fluctuations to increase the true test error ϵ as ϵ_Nϵ_∞∼ N^1/2 + O( N^3/4). This prediction is consistent with our empirical results on the MNIST dataset and it explains in a concrete case the puzzling observation that the predictive power of deep networks improves as the number of fitting parameters grows. This asymptotic description breaks down at a socalled jamming transition which takes place at a critical N=N^*, below which the training error is nonzero. In the absence of regularization, we observe an apparent divergence f_N∼ (NN^*)^α and provide a simple argument suggesting α=1, consistent with empirical observations. This result leads to a plausible explanation for the cusp in test error known to occur at N^*. Overall, our analysis suggests that once models are averaged, the optimal model complexity is reached just beyond the point where the data can be perfectly fitted, a result of practical importance that needs to be tested in a wide range of architectures and data set.
READ FULL TEXT
Comments
There are no comments yet.