On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width

01/14/2020
by   Etai Littwin, et al.
0

The Hessian of neural networks can be decomposed into a sum of two matrices: (i) the positive semidefinite generalized Gauss-Newton matrix G, and (ii) the matrix H containing negative eigenvalues. We observe that for wider networks, minimizing the loss with the gradient descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G. To explain this phenomenon, we show that when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2019

Negative eigenvalues of the Hessian in deep neural networks

The loss function of deep networks is known to be non-convex but the pre...
research
02/19/2018

BDA-PCH: Block-Diagonal Approximation of Positive-Curvature Hessian for Training Neural Networks

We propose a block-diagonal approximation of the positive-curvature Hess...
research
12/26/2017

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

We show that the (stochastic) gradient descent algorithm provides an imp...
research
12/02/2019

On the Delta Method for Uncertainty Approximation in Deep Learning

The Delta method is a well known procedure used to quantify uncertainty ...
research
05/08/2020

The critical locus of overparameterized neural networks

Many aspects of the geometry of loss functions in deep learning remain m...
research
02/12/2021

Applicability of Random Matrix Theory in Deep Learning

We investigate the local spectral statistics of the loss surface Hessian...
research
01/22/2018

Rover Descent: Learning to optimize by learning to navigate on prototypical loss surfaces

Learning to optimize - the idea that we can learn from data algorithms t...

Please sign up or login with your details

Forgot password? Click here to reset