Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

03/23/2020
by   Charles G. Frye, et al.
0

Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstrating that neural network losses enjoy a no-bad-local-minima property and an abundance of saddle points. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks.

READ FULL TEXT
research
07/18/2019

Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

Gradient-based algorithms are effective for many machine learning tasks,...
research
06/12/2019

Critical Point Finding with Newton-MR by Analogy to Computing Square Roots

Understanding of the behavior of algorithms for resolving the optimizati...
research
05/24/2023

On progressive sharpening, flat minima and generalisation

We present a new approach to understanding the relationship between loss...
research
03/18/2020

Block Layer Decomposition schemes for training Deep Neural Networks

Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on t...
research
06/12/2020

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Despite the widespread use of gradient-based algorithms for optimizing h...
research
05/08/2020

The critical locus of overparameterized neural networks

Many aspects of the geometry of loss functions in deep learning remain m...
research
05/20/2019

Shaping the learning landscape in neural networks around wide flat minima

Learning in Deep Neural Networks (DNN) takes place by minimizing a non-c...

Please sign up or login with your details

Forgot password? Click here to reset