Natural Langevin Dynamics for Neural Networks

by   Gaétan Marceau-Caron, et al.

One way to avoid overfitting in machine learning is to use model parameters distributed according to a Bayesian posterior given the data, rather than the maximum likelihood estimator. Stochastic gradient Langevin dynamics (SGLD) is one algorithm to approximate such Bayesian posteriors for large models and datasets. SGLD is a standard stochastic gradient descent to which is added a controlled amount of noise, specifically scaled so that the parameter converges in law to the posterior distribution [WT11, TTV16]. The posterior predictive distribution can be approximated by an ensemble of samples from the trajectory. Choice of the variance of the noise is known to impact the practical behavior of SGLD: for instance, noise should be smaller for sensitive parameter directions. Theoretically, it has been suggested to use the inverse Fisher information matrix of the model as the variance of the noise, since it is also the variance of the Bayesian posterior [PT13, AKW12, GC11]. But the Fisher matrix is costly to compute for large- dimensional models. Here we use the easily computed Fisher matrix approximations for deep neural networks from [MO16, Oll15]. The resulting natural Langevin dynamics combines the advantages of Amari's natural gradient descent and Fisher-preconditioned Langevin dynamics for large neural networks. Small-scale experiments on MNIST show that Fisher matrix preconditioning brings SGLD close to dropout as a regularizing technique.


page 1

page 2

page 3

page 4


True Asymptotic Natural Gradient Optimization

We introduce a simple algorithm, True Asymptotic Natural Gradient Optimi...

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

In this paper we address the following question: Can we approximately sa...

Stochastic natural gradient descent draws posterior samples in function space

Natural gradient descent (NGD) minimises the cost function on a Riemanni...

TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion

This work proposes a time-efficient Natural Gradient Descent method, cal...

Machine learning in and out of equilibrium

The algorithms used to train neural networks, like stochastic gradient d...

Optimal friction matrix for underdamped Langevin sampling

A systematic procedure for optimising the friction coefficient in underd...

Explicit natural gradient updates for Cholesky factor in Gaussian variational approximation

Stochastic gradient methods have enabled variational inference for high-...

Please sign up or login with your details

Forgot password? Click here to reset