Stochastic Weight Averaging Revisited

01/03/2022
by   Hao Guo, et al.
1

Stochastic weight averaging (SWA) is recognized as a simple while one effective approach to improve the generalization of stochastic gradient descent (SGD) for training deep neural networks (DNNs). A common insight to explain its success is that averaging weights following an SGD process equipped with cyclical or high constant learning rates can discover wider optima, which then lead to better generalization. We give a new insight that does not concur with the above one. We characterize that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges, and the operation of weight averaging only contributes to variance reduction. This new insight suggests practical guides on better algorithm design. As an instantiation, we show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization. Our findings are corroborated by extensive experiments across different network architectures, including a baseline CNN, PreResNet-164, WideResNet-28-10, VGG16, ResNet-50, ResNet-152, DenseNet-161, and different datasets including CIFAR-10,100, and Imagenet.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2018

Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function...
research
04/23/2023

Hierarchical Weight Averaging for Deep Neural Networks

Despite the simplicity, stochastic gradient descent (SGD)-like algorithm...
research
05/26/2022

Trainable Weight Averaging for Fast Convergence and Better Generalization

Stochastic gradient descent (SGD) and its variants are commonly consider...
research
04/30/2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Deep learning at scale is dominated by communication time. Distributing ...
research
02/07/2019

A Simple Baseline for Bayesian Uncertainty in Deep Learning

We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose ...
research
05/19/2022

Diverse Weight Averaging for Out-of-Distribution Generalization

Standard neural networks struggle to generalize under distribution shift...
research
12/20/2017

Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such a...

Please sign up or login with your details

Forgot password? Click here to reset