Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

09/29/2022
by   Jean Kaddour, et al.
0

Training vision or language models on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to  68 and  30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. We also provide the code and model checkpoint trajectory to reproduce the results and facilitate research on reusing historical weights for faster convergence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/07/2020

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm ...
research
03/12/2019

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...
research
04/06/2023

PopulAtion Parameter Averaging (PAPA)

Ensemble methods combine the predictions of multiple models to improve p...
research
09/14/2017

ImageNet Training in Minutes

Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 G...
research
05/19/2022

Diverse Weight Averaging for Out-of-Distribution Generalization

Standard neural networks struggle to generalize under distribution shift...
research
10/08/2021

Speeding up Deep Model Training by Sharing Weights and Then Unsharing

We propose a simple and efficient approach for training the BERT model. ...
research
11/18/2021

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task

Despite their success, modern language models are fragile. Even small ch...

Please sign up or login with your details

Forgot password? Click here to reset