Theory of Deep Learning III: explaining the non-overfitting puzzle

12/30/2017
by   Tomaso Poggio, et al.
0

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient. Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for "low noise" datasets. The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

READ FULL TEXT

page 1

page 5

research
03/12/2019

Theory III: Dynamics and Generalization in Deep Networks

We review recent observations on the dynamical systems induced by gradie...
research
06/29/2018

Theory IIIb: Generalization in Deep Networks

A main puzzle of deep neural networks (DNNs) revolves around the apparen...
research
10/04/2022

The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

We consider Sharpness-Aware Minimization (SAM), a gradient-based optimiz...
research
08/25/2019

Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

While deep learning is successful in a number of applications, it is not...
research
02/01/2023

Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression

In deep learning, often the training process finds an interpolator (a so...
research
03/16/2021

Deep learning: a statistical viewpoint

The remarkable practical success of deep learning has revealed some majo...
research
02/18/2020

Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

A recent line of research has provided convergence guarantees for gradie...

Please sign up or login with your details

Forgot password? Click here to reset