Deep learning, stochastic gradient descent and diffusion maps

04/04/2022
by   Carmina Fjellström, et al.
8

Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular of the landscape traced out by SGD, by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discovery (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration we use diffusion maps introduced by R. Coifman and coauthors.

READ FULL TEXT

page 8

page 9

research
03/01/2018

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent

Understanding the generalization of deep learning has raised lots of con...
research
09/09/2023

Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape

Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm use...
research
09/29/2018

Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep learning

Although stochastic gradient descent (SGD) is a driving force behind the...
research
01/31/2022

On the Power-Law Spectrum in Deep Learning: A Bridge to Protein Science

It is well-known that the Hessian matters to optimization, generalizatio...
research
11/06/2016

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

This paper proposes a new optimization algorithm called Entropy-SGD for ...
research
11/12/2020

Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian

Over the last decade, a single algorithm has changed many facets of our ...
research
12/20/2014

Explorations on high dimensional landscapes

Finding minima of a real valued non-convex function over a high dimensio...

Please sign up or login with your details

Forgot password? Click here to reset