Deep learning, stochastic gradient descent and diffusion maps
Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular of the landscape traced out by SGD, by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discovery (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration we use diffusion maps introduced by R. Coifman and coauthors.
READ FULL TEXT