Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

11/06/2016
by   Pratik Chaudhari, et al.
Microsoft
NYU college
Politecnico di Torino
0

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/14/2020

Entropic gradient descent algorithms and wide flat minima

The properties of flat minima in the empirical risk landscape of neural ...
07/25/2021

SGD May Never Escape Saddle Points

Stochastic gradient descent (SGD) has been deployed to solve highly non-...
01/12/2022

On generalization bounds for deep networks based on loss surface implicit regularization

The classical statistical learning theory says that fitting too many par...
04/04/2022

Deep learning, stochastic gradient descent and diffusion maps

Stochastic gradient descent (SGD) is widely used in deep learning due to...
01/20/2022

Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape

In this paper, we study the sharpness of a deep learning (DL) loss lands...
06/16/2020

Flatness is a False Friend

Hessian based measures of flatness, such as the trace, Frobenius and spe...
01/31/2019

Improving SGD convergence by tracing multiple promising directions and estimating distance to minimum

Deep neural networks are usually trained with stochastic gradient descen...

Code Repositories

chainer-entropy-adam

Chainer-based implementation of Entropy-Adam https://arxiv.org/abs/1611.01838


view repo

Please sign up or login with your details

Forgot password? Click here to reset