PopulAtion Parameter Averaging (PAPA)

Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural networks can be combined into one by averaging their weights (model soups). However, this usually performs significantly worse than ensembling. Weight averaging is only beneficial when weights are similar enough (in weight or feature space) to average well but different enough to benefit from combining them. Based on this idea, we propose PopulAtion Parameter Averaging (PAPA): a method that combines the generality of ensembling with the efficiency of weight averaging. PAPA leverages a population of diverse models (trained on different data orders, augmentations, and regularizations) while occasionally (not too often, not too rarely) replacing the weights of the networks with the population average of the weights. PAPA reduces the performance gap between averaging and ensembling, increasing the average accuracy of a population of models by up to 1.1 CIFAR-10, 2.4 independent (non-averaged) models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2023

Hierarchical Weight Averaging for Deep Neural Networks

Despite the simplicity, stochastic gradient descent (SGD)-like algorithm...
research
05/19/2022

Diverse Weight Averaging for Out-of-Distribution Generalization

Standard neural networks struggle to generalize under distribution shift...
research
03/10/2022

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

The conventional recipe for maximizing model accuracy is to (1) train mu...
research
08/23/2022

Lottery Pools: Winning More by Interpolating Tickets without Increasing Training or Inference Cost

Lottery tickets (LTs) is able to discover accurate and sparse subnetwork...
research
09/29/2022

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

Training vision or language models on large datasets can take days, if n...
research
01/07/2020

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm ...
research
02/13/2019

Anytime Tail Averaging

Tail averaging consists in averaging the last examples in a stream. Comm...

Please sign up or login with your details

Forgot password? Click here to reset