Shaping the learning landscape in neural networks around wide flat minima

by   Carlo Baldassi, et al.

Learning in Deep Neural Networks (DNN) takes place by minimizing a non-convex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points, and that such minimizers are often satisfactory at avoiding overfitting. How these two features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far reaching open question. In this paper we study basic non-convex neural network models which learn random patterns, and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy driven greedy and message passing algorithms which focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian and their generalization performance on real data.


Entropic gradient descent algorithms and wide flat minima

The properties of flat minima in the empirical risk landscape of neural ...

Anomalous diffusion dynamics of learning in deep neural networks

Learning in deep neural networks (DNNs) is implemented through minimizin...

Wide flat minima and optimal generalization in classifying high-dimensional Gaussian mixtures

We analyze the connection between minimizers with good generalizing prop...

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Despite the non-convex nature of their loss functions, deep neural netwo...

Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Despite the fact that the loss functions of deep neural networks are hig...

Effect of the initial configuration of weights on the training and function of artificial neural networks

The function and performance of neural networks is largely determined by...

The Loss Surfaces of Multilayer Networks

We study the connection between the highly non-convex loss function of a...

Please sign up or login with your details

Forgot password? Click here to reset