Partitioned integrators for thermodynamic parameterization of neural networks

08/30/2019
by   Benedict Leimkuhler, et al.
2

Stochastic Gradient Langevin Dynamics, the "unadjusted Langevin algorithm", and Adaptive Langevin Dynamics (also known as Stochastic Gradient Nosé-Hoover dynamics) are examples of existing thermodynamic parameterization methods in use for machine learning, but these can be substantially improved. We find that by partitioning the parameters based on natural layer structure we obtain schemes with rapid convergence for data sets with complicated loss landscapes. We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including LaLa (a multi-layer Langevin algorithm), AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (i) faster, (ii) more accurate, and (iii) more robust than standard algorithms incorporated into machine learning frameworks, in particular for data sets with complicated loss landscapes. Moreover, we show in numerical studies that methods based on sampling excite many degrees of freedom. The equipartition property, which is a consequence of their ergodicity, means that these methods keep in play an ensemble of low-loss states during the training process. By drawing parameter states from a sufficiently rich distribution of nearby candidate states, we show that the thermodynamic schemes produce smoother classifiers, improve generalization and reduce overfitting compared to traditional optimizers.

READ FULL TEXT

page 4

page 18

page 21

page 22

page 23

research
01/28/2022

Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks

We study the overparametrization bounds required for the global converge...
research
06/11/2019

An Improved Analysis of Training Over-parameterized Deep Neural Networks

A recent line of research has shown that gradient-based algorithms with ...
research
06/04/2021

Fluctuation-dissipation Type Theorem in Stochastic Linear Learning

The fluctuation-dissipation theorem (FDT) is a simple yet powerful conse...
research
12/23/2015

Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks

Effective training of deep neural networks suffers from two main issues....
research
07/19/2018

A unified theory of adaptive stochastic gradient descent as Bayesian filtering

There are a diverse array of schemes for adaptive stochastic gradient de...
research
08/03/2019

Ensemble Neural Networks (ENN): A gradient-free stochastic method

In this study, an efficient stochastic gradient-free method, the ensembl...
research
09/05/2022

Statistical Comparisons of Classifiers by Generalized Stochastic Dominance

Although being a question in the very methodological core of machine lea...

Please sign up or login with your details

Forgot password? Click here to reset