Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

03/10/2022
by   Mitchell Wortsman, et al.
10

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs – we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. As a highlight, the resulting ViT-G model attains 90.94 accuracy on ImageNet, a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically.

READ FULL TEXT

page 5

page 22

research
09/04/2021

Robust fine-tuning of zero-shot models

Large pre-trained models such as CLIP offer consistent accuracy across a...
research
01/19/2022

Enhanced Performance of Pre-Trained Networks by Matched Augmentation Distributions

There exists a distribution discrepancy between training and testing, in...
research
12/08/2019

Individual predictions matter: Assessing the effect of data ordering in training fine-tuned CNNs for medical imaging

We reproduced the results of CheXNet with fixed hyperparameters and 50 d...
research
11/29/2022

Context-Aware Robust Fine-Tuning

Contrastive Language-Image Pre-trained (CLIP) models have zero-shot abil...
research
04/06/2023

PopulAtion Parameter Averaging (PAPA)

Ensemble methods combine the predictions of multiple models to improve p...
research
09/14/2020

Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction

Fine-tuning pre-trained models have achieved impressive performance on s...
research
06/07/2023

Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Foundation models are first pre-trained on vast unsupervised datasets an...

Please sign up or login with your details

Forgot password? Click here to reset