Iterate Averaging Helps: An Alternative Perspective in Deep Learning

03/02/2020
by   Diego Granziol, et al.
0

Iterate averaging has a rich history in optimisation, but has only very recently been popularised in deep learning. We investigate its effects in a deep learning context, and argue that previous explanations on its efficacy, which place a high importance on the local geometry (flatness vs sharpness) of final solutions, are not necessarily relevant. We instead argue that the robustness of iterate averaging towards the typically very high estimation noise in deep learning and the various regularisation effects averaging exert, are the key reasons for the performance gain, indeed this effect is made even more prominent due to the over-parameterisation of modern networks. Inspired by this, we propose Gadam, which combines Adam with iterate averaging to address one of key problems of adaptive optimisers that they often generalise worse. Without compromising adaptivity and with minimal additional computational burden, we show that Gadam (and its variant GadamX) achieve a generalisation performance that is consistently superior to tuned SGD and is even on par or better compared to SGD with iterate averaging on various image classification (CIFAR 10/100 and ImageNet 32×32) and language tasks (PTB).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2018

Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function...
research
10/20/2020

Dual Averaging is Surprisingly Effective for Deep Learning Optimization

First-order stochastic optimization methods are currently the most widel...
research
07/13/2020

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
research
02/22/2019

Beating SGD Saturation with Tail-Averaging and Minibatching

While stochastic gradient descent (SGD) is one of the major workhorses i...
research
04/30/2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Deep learning at scale is dominated by communication time. Distributing ...
research
06/07/2018

Re-evaluating evaluation

Progress in machine learning is measured by careful evaluation on proble...
research
10/16/2018

Collaborative Deep Learning Across Multiple Data Centers

Valuable training data is often owned by independent organizations and l...

Please sign up or login with your details

Forgot password? Click here to reset