DeepAI
Log In Sign Up

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

03/10/2022
by   Mitchell Wortsman, et al.
10

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs – we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. As a highlight, the resulting ViT-G model attains 90.94 accuracy on ImageNet, a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically.

READ FULL TEXT VIEW PDF

page 5

page 22

09/04/2021

Robust fine-tuning of zero-shot models

Large pre-trained models such as CLIP offer consistent accuracy across a...
01/19/2022

Enhanced Performance of Pre-Trained Networks by Matched Augmentation Distributions

There exists a distribution discrepancy between training and testing, in...
11/29/2022

Context-Aware Robust Fine-Tuning

Contrastive Language-Image Pre-trained (CLIP) models have zero-shot abil...
09/14/2020

Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction

Fine-tuning pre-trained models have achieved impressive performance on s...
08/31/2021

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning

Pre-Trained Models have been widely applied and recently proved vulnerab...
05/01/2020

When Ensembling Smaller Models is More Efficient than Single Large Models

Ensembling is a simple and popular technique for boosting evaluation per...

1 Introduction

In recent years, research has shown that models pre-trained on large and diverse datasets learn representations that transfer well to a variety of tasks. As a result, machine learning practitioners now commonly develop solutions for downstream tasks by fine-tuning large pre-trained models

[girshick2014rich, yosinski2014transferable, kornblith2019better, kolesnikov2020big]. Typically, the fine-tuning process involves two steps: (1) fine-tune models with a variety of hyperparameter configurations, and (2) select the model which achieves the highest accuracy on the held-out validation set. The remaining models are then discarded.

Method ImageNet acc. Distribution (top-1, %) shifts ViT-G [zhai2021scaling] 90.45 CoAtNet-7 [dai2021coatnet] 90.88 Our models/evaluations based on ViT-G: ViT-G (reevaluated) 90.47 82.06 Best model in 90.78 84.68 hyperparam search Greedy soup 90.94 85.02
Table 1: Model soups improve accuracy over the best individual model when fine-tuning a JFT-3B pre-trained ViT-G/14 model [zhai2021scaling] on ImageNet. Instead of selecting the best model from a hyperparameter sweep during fine-tuning, model soups average the weights of multiple fine-tuned models. To evaluate model performance under distribution shift we consider average accuracy on ImageNet-V2 [pmlr-v97-recht19a], ImageNet-R [imagenetr], ImageNet-Sketch [imagenetsketch], ObjectNet [objectnet], and ImageNet-A [imageneta]. Additional results and details are provided by Table 4 and Section 3.3.2.
Figure 1: Model soups improve accuracy over the best individual model when performing a large, random hyperparameter search for fine-tuning a CLIP ViT-B/32 model on ImageNet. The uniform soup (blue circle) averages all fine-tuned models (green diamonds) in a random hyperparameter search over learning rate, weight-decay, iterations, data augmentation, mixup, and label smoothing. The greedy soup adds models sequentially to the model soup, keeping a model in the soup if accuracy on the held-out validation set does not decrease.
11todo: 1Yair: TODO: make it clear in Table 1 itself (not just the caption) that the source model is ViT-G, similarly to how it’s done in Table 4. Also make Table 1 and Figure 1 take up the same amount of vertical space

Selecting a single model and discarding the rest has several downsides. For one, the selected model may not achieve the best performance. In particular, ensembling outputs of many models 22todo: 2Hong: Let’s be consistent in our usage of output ensembles vs. logit ensembles. Also, is logit ensemble a well-accpted terminology? For clarity, we should briefly define what it is the first time we use it. Without understanding what it is, readers cannot understand the inference cost reductions. can outperform the best single model, albeit at a high computational cost during inference. 33todo: 3Hong: In addition to Yair’s comment, I think we should emphasize that inference cost is a BIG DEAL. Some researchers have the impression that training cost is the only elephant in the room. But the computational multiplier in front inference cost is often at least in the order of 10Ms to Bs and inference cost is often a bigger cost overall. For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution performance [radford2021learning, andreassen2021evolution, wortsman2021robust, pham2021scaling].

In this work, we propose a more accurate and robust alternative to the second step of the conventional recipe in the context of fine-tuning a large pre-trained model. Instead of selecting the individual fine-tuned model which achieves the highest accuracy on the held-out validation set, we average the weights of models fine-tuned independently, and refer to the result as a model soup. Given the results of the first step—a hyperparameter sweep over fine-tuned models—averaging several of these models to form a model soup requires no additional training and adds no cost at inference time.

Since neural networks are non-linear with potentially many solutions in different loss basins, it is perhaps surprising that averaging the weights of independently fine-tuned models achieves high performance. However, recent work

[neyshabur2020being] observes that fine-tuned models optimized independently from the same initialization lie in the same basin of the error landscape, inspiring our method. Weight averaging along a single training trajectory has previously been shown to improve the performance of models trained from random initialization [szegedy2016rethinking, izmailov2018averaging]. Our approach extends weight averaging to the context of fine-tuning, where we find that it also works across many independent runs.

44todo: 4Rapha: In addition to Hong’s comment about outline (below), it might also be useful to add a bold ”Contributions” in this paragraph, this could replace a bulletpoint contributions list if you think the detailed contributions paragraph is more informative.

We perform a comprehensive experimental study of fine-tuning to understand the behavior of model soups. For our main results we fine-tune CLIP [radford2021learning] and ALIGN [jia2021scaling], which are pre-trained with a contrastive loss on image-text pairs, and a ViT-G model pre-trained on JFT [zhai2021scaling]. Our results show that model soups often outperform the best individual model on both the in-distribution and natural distribution shift test sets (Table 1, Figure 1, Figure 4). A model soup composed of ViT-G models achieves 90.94% on ImageNet [deng2009imagenet], surpassing the previous state of the art of 90.88% obtained by the CoAtNet model [dai2021coatnet] while requiring 25% fewer FLOPs at inference time. In general, model soups can approach the performance of ensembling, with no additional computational cost or memory relative to a single model during inference. Beyond ImageNet and associated distribution shifts, our results show that model soups are applicable when fine-tuning on tasks from the WILDS [wilds2021] benchmark, and when fine-tuning transformer models [vaswani2017attention, devlin-etal-2019-bert, raffel2020t5] for text classification.

While the most straightforward approach to making a model soup is to average all the weights uniformly, we find that greedy soups, where models are sequentially added to the soup if they improve accuracy on held-out data, outperforms uniform averaging. Greedy soups avoid adding in models which may lie in a different basin of the error landscape, which could happen if, for example, models are fine-tuned with high learning rates.

In addition to empirical observations, we analytically relate the similarity in loss between weight-averaging and logit-ensembling to the flatness of the loss (i.e., its second derivative on a line between models) and confidence of the predictions (expressed via the variance of a logits difference drawn from the weight-average softmax). We empirically validate our approximation on a subset of the models we train and show that it is strongly correlated with the true averaging vs. ensembling performance difference, particularly in the learning rate regimes where soups are effective and models achieve higher accuracy.

Paper outline. Our method of model soups is presented and evaluated in Sections 2 and 3, respectively. Next, Section 4 includes our analysis relating model soups and ensembles, Section 5 details the scope and limitations of the proposed method, and Section 6 contextualizes model soups by reviewing related work.

2 Method

Method Cost
Best on val. set
Ensemble
Uniform soup
Greedy soup Recipe 1
Learned soup Appendix B.1
Table 2: The primary methods contrasted in this work. Each is a model found through fine-tuning from a shared initialization. Cost refers to the memory and compute requirements during inference relative to a single model. All methods require the same training.

This section highlights three recipes for model souping, the uniform, greedy, and learned soup, though the greedy soup is our central method. We summarize the methods described in this section in Table 2.

We consider a neural network with input data and parameters . Fine-tuning is analogous to standard neural network training but includes an important distinction: the parameters are initialized to those found via pre-training.

Let denote the parameters obtained by fine-tuning with pre-trained initialization and hyperparameter configuration . The hyperparameter configuration can include the choice of optimizer, data augmentation, training iterations, and a random seed which will determine data order.

For hyperparameter configurations let . Conventionally, the parameters which attain the highest accuracy on a held out validation set are selected, and the remaining parameters are discarded. Instead, model soups use an average of , i.e., where . The uniform soup is constructed by averaging all fine-tuned models and so .

There are settings in which a hyperparameter configuration can produce a model with low accuracy that results in a low accuracy uniform soup. This issue can be circumvented with a greedy soup (Recipe 1). The greedy soup is constructed by sequentially adding each model as a potential ingredient in the soup, and only keeping the model in the soup if performance on a held out validation set (disjoint from the training and test sets) improves. Before running this procedure we sort the models in decreasing order of validation set accuracy, and so the greedy soup can be no worse than the best individual model on the held-out validation set. We also explore a more advanced learned soup

recipe that optimizes model interpolation weights by gradient-based minibatch optimization (see Appendix 

B.1 for details). This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks.

  Input: Potential soup ingredients (optionally sorted in decreasing order of ).
  
  for to do
     if then
         
  return
Recipe 1

3 Experiments

This section presents our key experimental findings. We begin with experimental setup (Section 3.1) and provide intuition for model soups by examining error landscape visualizations (Section 3.2). Next we present our main results (Section 3.3), using model soups as an alternative to selecting the best performing individual model. Finally, we explore model soups in the context of robust fine-tuning (Section 3.4), and examine model soups constructed by fine-tuning on different datasets (Section 3.5).

3.1 Experimental setup

Our experiments explore the application of model soups when fine-tuning various models. The primary models we fine-tune are the CLIP [radford2021learning] and ALIGN [jia2021scaling] models pre-trained with contrastive supervision from image-text pairs, a ViT-G/14 model pre-trained on JFT-3B [zhai2021scaling], and transformer models for text classification [devlin-etal-2019-bert, colin2020exploring]. Unless otherwise mentioned, experiments use the CLIP ViT-B/32 model. Fine-tuning is performed end-to-end (all parameters are modified) which often results in better accuracy than training only the final linear layer [kornblith2019better, agrawal2014analyzing, chatfield2014return, azizpour2015generic].

We consider two different methods for initializing the final linear layer before fine-tuning. The first method initializes the model from a linear probe (LP), as described in kumar2021finetuning

, and we refer to this method as LP initialization. The second method uses the zero-shot initialization, e.g., using the classifier produced by the text tower of CLIP or ALIGN as the initialization. Both methods for initializing the model produce similar trends when applicable, and unless otherwise stated we use the LP initialization.

For the ensemble baselines [dietterich2000ensemble, deepensembles] we ensemble the logits (unormalized outputs) of models as in gontijo2021no. Fine-tuning uses a supervised cross-entropy loss and, unless otherwise mentioned, is conducted on ImageNet [deng2009imagenet]. When fine-tuning on ImageNet we also evaluate on the five natural distribution shifts: ImageNetV2 [pmlr-v97-recht19a], ImageNet-R [imagenetr], ImageNet-Sketch [imagenetsketch], ObjectNet [objectnet], and ImageNet-A [imageneta]. We often report results averaged over these five distribution shifts. Since the official ImageNet validation set is used as the test set, we use roughly 2% of the ImageNet training set as a held-out validation set for constructing greedy soups.

3.2 Intuition with error landscape visualizations

Figure 2: The solution with the highest accuracy is often not a fine-tuned model but rather lies between fine-tuned models. This figure shows loss and error on a two dimensional slice of the loss and error landscapes. We use the zero-shot initialization and fine-tune twice (illustrated by the gray arrows), independently, to obtain solutions and . As in garipov2018loss, we obtain an orthonormal basis , for the plane spanned by these three models, and the and -axis show movement in parameter space in these directions, respectively.

To provide intuition, we visualize a two dimensional slice of the training loss and test error landscape when fine-tuning CLIP on ImageNet. In these experiments, we use the zero-shot initialization and fine-tune twice, independently, to produce solutions and . The points and define a plane in parameter space, and we evaluate the ImageNet train loss, ImageNet test error, and the test error on the five aforementioned distribution shifts on this plane. The results are illustrated in Figure 2 where the zero-shot initialization () is shown as a star and a solution fine-tuned with learning rate () is shown as a blue square. For we either use the same learning rate as (but vary the random seed) or learning rate . For both the in-distribution and out-of-distribution test sets, the loss/error contours are basin-shaped, and none of the three points is optimal.55todo: 5Yair: edited the sentence to avoid the somewhat vague “error basin”

These results suggest that (1) interpolating the weights of two fine-tuned solutions can improve accuracy compared to individual models and (2) more uncorrelated solutions—models that form an angle999In particular, the angle between and , i.e., the angle between the arrows shown in Figure 2. closer to 90 degrees—may lead to higher accuracy on the linear interpolation path.

Figure 3: The advantage of averaging solutions (-axis) is correlated with the angle between between solutions, while varying hyperparameter configurations between pairs enables a larger . Each point corresponds to a pair of models that are fine-tuned independently from a shared initialization with different hyperparameter configurations. The angle between between solutions refers to the angle between and (i.e., the initialization is treated as the origin). Accuracy is averaged over ImageNet and the five distribution shifts described in Section 3.1.
Figure 4: Model soups improve accuracy when fine-tuning ALIGN.

To investigate the correlation between accuracy improvement and angle, we consider a series of models trained with different seeds, learning rates, and data augmentation. For each pair , we compare the accuracy of their average with the average of their accuracies, , which we refer to as the interpolation advantage. Figure 4 illustrates the results, in which we observe that the interpolation advantage is correlated with the angle and that varying the learning rate, seed, or data augmentation can produce solutions which are more orthogonal. Experimental details and discussion of high learning rates provided in Appendix B.2.

Finally, in Appendix C we ask the question: for a one dimensional grid of hyperparameters , how does averaging the models fine-tuned with hyperparameter configurations and corresponding to the endpoints compare with picking the best individual model fine-tuned with hyperparameter configuration ? The hyperparameters we vary are optimizer, augmentation, and learning rate. For the vast majority of grid searches, the average of the endpoints outperforms the best individual model in the grid.

3.3 Model soups

With the gains of averaging two fine-tuned models in mind, we turn our attention to averaging many models with different hyperparameters: this section presents our main results, which show that averaging fine-tuned models can be used as an alternative to the conventional procedure of selecting the single model which performs best on the held-out validation set. We explore CLIP [radford2021learning] and ALIGN [jia2021scaling] fine-tuned on ImageNet [deng2009imagenet] and WILDS [wilds2021] (Section 3.3.1), ViT-G pre-trained on JFT-3B [zhai2021scaling] and fine-tuned on ImageNet (Section 3.3.2), and transformer models fine-tuned on text classification tasks (Section 3.3.3). Appendix E

additionally explores CLIP ViT-L fine-tuned on CIFAR-10 and ImageNet-22k-pretrained ViT-B/32 fine-tuned on ImageNet.

3.3.1 Fine-tuning CLIP and ALIGN

Figure 5: The greedy soup requires less models to reach the same accuracy as selecting the best individual model on the held-out validation set. On the -axis we show the number of models considered in a random search over hyperparameters while the -axis displays the accuracy of various methods for model selection which are summarized in Table 2. All methods require the same amount of training and compute cost during inference with the exception of the ensembles, which require a separate pass through each model. Results are for fine-tuning CLIP ViT-B/32, averaged over three random orders (shown with faded lines).

We begin our study of model soups by considering two-pretrained models, CLIP ViT-B/32 and ALIGN EfficientNet-L2, and performing a hyperparameter sweep for the fine-tuning each model on ImageNet. For CLIP we use a random hyperparameter search over learning rate, weight decay, training epochs, label smoothing, and data augmentation, obtaining 72 fine-tuned models (details in Appendix 

B.3.1). For ALIGN we use a grid search over learning rate, data augmentation, and mixup, obtaining 12 fine-tuned models (details in Appendix B.3.2). To form our greedy soups, we sort models in order of decreasing accuracy on the held-out validation set before applying Recipe 1. For both CLIP and ALIGN, the greedy soup selects 5 models. Figure 1 and 4 show the performance of the resulting models and their uniform and greedy soups for CLIP and ALIGN. The greedy soup improves over the best model in the hyperparameter sweep by 0.7 and 0.5 percentage points, respectively.

ImageNet acc. Distribution shifts acc.
Best individual model on ImageNet 80.38 47.83
Second best individual model on ImageNet 79.89 43.87
Uniform soup 79.97 51.45
Greedy soup (decreasing order of held-out val accuracy) 81.03 50.75
Greedy soup (random order)
Learned soup 80.89 51.07
Learned soup (by layer) 81.37 50.87
Ensemble 81.19 50.77
Greedy ensemble 81.90 49.44
Table 3: Ablation on multiple methods from Table 2 and their variants when when fine-tuning CLIP ViT-B/32 with the random hyperparameter search described in Section 3.3.1

. For “Greedy soup (random order)”, we try three random orders report mean and standard deviation. The “Learned soup” and its variants are descried in Appendix 

B.1.
Figure 6: Model soups improve accuracy when fine-tuning on the diverse classification tasks WILDS-FMoW [wilds2021, christie2018functional] and WILDS-iWildCam [wilds2021, beery2021iwildcam]. Results are shown for the CLIP ViT-L/14 model and a random hyperparameter search over learning rate, weight-decay, iterations, data augmentation, mixup, and label smoothing.

Furthermore, we show that the greedy soup requires less models to reach the same accuracy as selecting the best individual model on the held-out validation set. We consider an additional setting where we prepare a sequence of soups by sequentially adding CLIP models from the hyperparameter sweep in random order. Figure 5 shows the performance of the uniform and greedy soup, as well as the best single model so far and a logit ensemble, as a function of the number of models considered. For essentially any number of models, the greedy soup outperforms the best single model on both the ImageNet and out-of-distribution test sets; the greedy soup is better than the uniform soup on ImageNet and comparable to it out-of-distribution. The logit ensemble is better than the greedy soup on ImageNet, but worse out-of-distribution.

Table 3 lists the performance of the CLIP soups and baselines described above, as well as additional soup variants described in Appendix B.1.

To further establish the generality of the model soup, we replicate the CLIP hyperparameter sweep experiment on two image classification tasks from WILDS [wilds2021], namely FMoW [christie2018functional] and iWildCam [beery2021iwildcam]; Figure 6 shows results qualitatively similar to our ImageNet experiment, and Appendix B.3.1 describes experimental details.

We report several additional variants and baselines for the experiment described above. In Appendix D we present results for different hyperparameter sweeps and fine-tuning initializations, when fine-tuning CLIP on ImageNet. For instance, we try a standard grid search which is similar to the grid search described for ALIGN above, and an extreme grid search which includes solutions fine-tuned with extreme hyperparameters that result in badly performing models (details in Appendix B.3.1). Moreover, Appendix G compares model soups with additional baselines, including distillation from an ensemble as in [hinton2014dark], models which apply weight-averaging along their trajectory, and sharpness aware minimization [foret2021sharpnessaware].

We highlight a few interesting takeaways from these experiments: (1) The greedy soup outperforms the best individual model—with no extra training and no extra compute during inference, we were able to produce a better model. (2) While the uniform soup can outperform the best individual model, we only observe this when all individual models achieve high accuracy (e.g., when fine-tuning ALIGN in Figure 1); unlike the examples in Figure 2, there can be an error barrier between fine-tuned models. We mainly observe this when fine-tuning with high learning rates (this is illustrated in Appendix B.2, Figure B.1). However, these high learning rate models also have a lower accuracy, and are therefore excluded by the greedy soup.

3.3.2 Fine-tuning a ViT-G model pre-trained on JFT-3B

To test whether the gains obtained by model soups are additive with other techniques used to obtain state-of-the-art models, we applied our greedy soup technique to 58 ViT-G/14 models fine-tuned on ImageNet. We vary the learning rate, decay schedule, loss function, and minimum crop size in the data augmentation, and optionally apply RandAugment 

[cubuk2020randaugment], mixup [zhang2017mixup], or CutMix [yun2019cutmix]. We also train four models with sharpness-aware minimization (SAM) [foret2021sharpnessaware]. For further details of our hyperparameter sweep, see Appendix B.3.3. For each model training run, we save exponential moving averages (EMA) of the weights [szegedy2016rethinking] computed with decay factors of 0.999 (low EMA) and 0.9999999 (high EMA). Whereas high EMA generally provides the best accuracy over the course of an individual training run, both greedy soup and greedy ensembling obtain higher validation accuracy when applied to parameters with low EMA. We report the highest single model accuracy numbers obtained with either EMA decay value, but perform greedy soup and ensembling with models trained with EMA decay of 0.999. For each combination of training run and EMA decay rate, we evaluate accuracy on our held out validation set every 1000 steps. We use these accuracy values to pick the best checkpoint for ensembling, souping, and subsequent evaluation.

In Table 4, we report results on the ImageNet validation set and the five distribution shift datasets studied above as well as two relabeled ImageNet validation sets, ReaL [beyer2020we] and multilabel [shankar2020evaluating]. Our greedy soup procedure selects 14 of the 58 models fine-tuned as part of our hyperparameter sweep, and this soup performs statistically significantly better than the best individually fine-tuned model selected based on our held out validation set on all datasets except for ObjectNet. Even when we give an unfair advantage to individually fine-tuned models by selecting them based on their performance on each test set (denoted “oracle” in Table 4), the greedy soup, which was selected using only in-distribution data, remains superior on most datasets. Only on ReaL and ObjectNet does there exist an individual model that performs statistically significantly better than the soup, and the best model differs between those two datasets. Greedy ensembling performs similarly to the greedy soup in terms of ImageNet top-1 and multilabel accuracy, and slightly better on ReaL, but significantly worse on all distribution shift datasets except for ImageNet-V2. Thus, greedy soup can provide additional gains on top of standard hyperparameter tuning even in the extremely high accuracy regime.

ImageNet Distribution shifts
Method Top-1 ReaL Multilabel IN-V2 IN-R IN-Sketch ObjectNet IN-A Avg shifts
ViT/G-14 [zhai2021scaling] 90.45 90.81 83.33 70.53
CoAtNet-7 [dai2021coatnet] 90.88
Our models/evaluations based on ViT-G/14:
ViT/G-14 [zhai2021scaling] (reevaluated) 90.47 90.86 96.89 83.39 94.38 72.37 71.16 89.00 82.06
Best model on held out val set 90.72 91.04 96.94 83.76 95.04 73.16 78.20 91.75 84.38
Best model on each test set (oracle) 90.78 91.78 97.29 84.31 95.04 73.73 79.03 92.16 84.68
Greedy ensemble 90.93 91.29 97.23 84.14 94.85 73.07 77.87 91.69 84.33
Greedy soup 90.94 91.20 97.17 84.22 95.46 74.23 78.52 92.67 85.02
Table 4: Greedy soup improves over the best individual models obtained in a hyperparameter sweep for ViT-G/14 pre-trained on JFT-3B and fine-tuned on ImageNet, both in- and out-of-distribution. Accuracy numbers not significantly different from the best are bold-faced. Statistical comparisons are performed using an exact McNemar test or permutation test at . Avg shift accuracy of the best model on each test set is the best average accuracy of any individual model.
Model Method MRPC RTE CoLA SST-2
BERT [devlin2019bert] Best individual model 88.3 61.0 59.1 92.5
Greedy soup 88.3 (+0.0) 61.7 (+0.7) 59.1 (+0.0) 93.0 (+0.5)
T5 [raffel2020t5] Best individual model 91.8 78.3 58.8 94.6
Greedy soup 92.4 (+0.6) 79.1 (+0.8) 60.2 (+0.4) 94.7 (+0.1)
Table 5: Performance of model soups on four text classification datasets from the GLUE benchmark [wang2018glue].

3.3.3 Fine-tuning on text classification tasks

To test whether the gains obtained by model soups extend to domains beyond image classification, we conduct preliminary experiments with natural language processing (NLP). While more investigation is warranted to establish the applicability of model soups for NLP, we believe our experiments are a promising initial step. In particular, we fine-tune BERT [devlin2019bert] and T5 [raffel2020t5] models on four text classification tasks from the GLUE benchmark [wang2018glue]: MRPC [dolan2005automatically], RTE [dagan2005pascal, bar2006second, giampiccolo2007third, bentivogli2009fifth], CoLA [warstadt2018neural] and SST-2 [socher2013recursive], as in [dodge2020fine]. We use the standard metric for each dataset: average of accuracy and score for MRPC, accuracy for RTE, Matthews correlation for CoLA [matthews1975comparison] and accuracy for SST-2. More details are provided in Appendix H.

We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed. Table 5 reports the corresponding metric on the validation set for BERT-base uncased [devlin-etal-2019-bert] and T5-base [raffel2020t5]. Additional experimental details and results for more models are provided in Appendix I. While the improvements are not as pronounced as in image classification, the greedy soup can improve performance over the best individual model in many cases.

3.4 Robust fine-tuning

Figure 7: Model soups compared to baselines for robust fine-tuning. WiSE-FT [wortsman2021robust] improves the robustness of model fine-tuned from initialization by interpolating between and . Above we display the accuracy of models along these interpolation curves both for regular fine-tuned models and model soups (left: random hyperparameter search using the LP initialization, right: grid search using the zero-shot initialization). The model soups lie beyond the WiSE-FT curves generated by any individual model, and accuracy can be improved on the distribution shifts by applying WiSE-FT to the model soups.

wortsman2021robust introduce WiSE-FT, a method for improving the robustness of a model which is fine-tuned from initialization by linearly interpolating and . An intriguing observation was that, once the data augmentation is fixed, interpolating between and often traces a similar curve regardless of hyperparameters.101010 This is visible in Figure 7 (right) where different data augmentations are shown with different colors. On the other hand, in Figure 7 (left) there are many different methods of data augmentation as we conduct a random hyperparameter search. In other words, a reasonable hypothesis was that this curve is Pareto optimal—no hyperparameter configuration would surpass it. In Figure 7, we trace the curves when interpolating between and for a random hyperparameter search (left) and the standard grid search described in Appendix B.3.1 (right) when fine-tuning CLIP ViT-B/32. We find that the uniform soup and greedy soup lie beyond these interpolation curves. Moreover, we find interpolating between these soups and the initialization also provides additional accuracy improvements on the distribution shifts.

3.5 Cross-dataset soups

So far, our experiments have studied soups of models fine-tuned on the same dataset with different hyperparameters. In this section, we prepare soups containing models fine-tuned on different datasets. We evaluate the resulting soups on a held-out dataset, from which no labeled training data is used (i.e., zero-shot evaluation).

Concretely, we consider soups based on the CLIP zero-shot initialization along with six models fine-tuned independently on CIFAR-10 [krizhevsky2009learning], Describable Textures [dtd], Food-101 [food101], SUN397 [sun397]

, Stanford Cars

[cars] and ImageNet [deng2009imagenet]. We evaluate on CIFAR-100 [krizhevsky2009learning], which does not share classes with CIFAR-10. Since each task has a different set of classes, the last layers cannot be part of the soup. Hence, during fine-tuning, we freeze the linear head produced by CLIP’s text tower so that task-specific learning is captured only in the backbone weights. At test time, we use the “backbone soup” with a zero-shot head constructed from CLIP’s text tower and the CIFAR-100 class names with the prompt-ensembling used for ImageNet by radford2021learning. Figure 8 (left) shows that a model soup containing models trained on each of these datasets and the zero-shot model improves zero-shot performance on CIFAR-100 by 6.4 percentage points over the CLIP baseline. Moreover, Figure 8 (right) shows that the choice of which fine-tuned models to include can have a substantial impact on the accuracy of the resulting soup. See Appendix B.4 for additional details.

Figure 8: Model soups can improve zero-shot performance on new downstream tasks. (left) Starting with zero-shot CLIP we create a soup by adding models fine-tuned on ImageNet, CIFAR-10, Food101, SUN397, DTD, and Cars, and evaluate on CIFAR-100. Different orders for adding models are shown with faded lines. (right) The average change in CIFAR-100 accuracy when a model fine-tuned on the dataset listed in the -axis is added to the model soup.

4 Analytically comparing soups to ensembles

Figure 9: Validation of the analytical approximation (1) for the performance difference of a 2-model soup and ensemble. Each marker on the scatter plots represent a different choice of endpoint models () and interpolation weight . In every scatter plot, the vertical axis shows the true performance difference between the soup and ensemble (in loss for the left and center panes, and error for the right pane), where a positive value indicates the ensemble is better. The horizontal axis shows our approximation for the loss difference. The top row shows results with inverse temperature chosen to calibrate the soup, and the bottom row shows results for fixed to .

The goal of this section is to obtain complementary analytical insight into the effectiveness of model soups. For simplicity, we consider a soup consisting of only two models with parameters and . For weighting parameter we let denote the weight-averaged soup. We would like to understand when would the soup error, , be lower that the best of both endpoints, .

Note that just convexity of in does not by itself imply superiority of the soup to both endpoints, as the minimum of over may be obtained at the endpoints even when is convex. To get further leverage on the problem, we compare the soup to the logit-level ensemble The rich literature on ensembles (see Sec. 6) tells us that the expected error of the ensemble, , is often strictly below for neural networks. Therefore, whenever we expect the soup to outperform both endpoint models.

To analytically compare the soup and the ensemble, we replace the 0-1 loss with a differentiable surrogate. Specifically, we the consider cross-entropy loss . We let denote the -calibrated expected loss of the soup, and similarly define for the ensemble. We derive the following approximation for the loss difference:

(1)

where is the standard “softmax” distribution and is the difference between the endpoint logits. We obtain our approximation in the regime where the logits are not too far from linear; see Appendix F.3 for a detailed derivation.

The first term in approximation (1) is negatively proportional to the second derivative of the loss along the trajectory: when the approximation holds, convexity of the loss indeed favors the soup. However, the second term in the approximation does not follow from the “convex basin” intuition. This term always favors the ensemble, but is small in one of two cases: (a) the somewhat trivial case when the endpoint models are similar (so that is small) and (b) when the soup produces confident predictions, implying that is close to a point mass and consequently the variance term is small.

To test our approximation, we evaluate it over of set of fine-tuned models with different learning rates, augmentation strategies, random seeds and values. We set to calibrate the soup model, and find that it improves the ability of our approximation to predict the soup/ensemble error difference; see Appendix F.4 for detailed description of our setup.

Figure 9 summarizes the results of our empirical evaluations. When excluding the high learning rate of (center and right panels),111111Fine-tuned models with learning rate are far in weight space from the initial model and are often rejected when forming greedy soups. Therefore, we do not expect our approximation to be tight for these learning rates. we see that the approximation is strongly correlated with both the true difference in loss as well as the difference in error, and the approximation and true loss difference generally agree in sign. Additional details are provided in Appendix F.

5 Scope and limitations

While this work has so far demonstrated that averaging many fine-tuned models is a useful technique for improving accuracy, this section explores two limitations of the approach. The first is the applicability of model soups, and the second is the failure of model soups to substantially improve calibration.

Applicability. So far our experiments have mainly explored models pre-trained on large, heterogeneous datasets. In Appendix E we also explore model soups for an ImageNet-22k pre-trained model. While the greedy soup still provides improvements on ImageNet, these improvements are less substantial compared to those observed when fine-tuning CLIP and ALIGN.

Calibration. While ensembles improve model calibration [guo2017calibration, roelofs2020mitigating], model soups do not have the same effect. As hyperparameters can also have an effect on calibration, we consider the ensemble and soup of 20 models which are identical other than random seed. Results are illustrated in Figure A.1 using the calibration metrics of roelofs2020mitigating.

6 Related work

Averaging model weights.

Averaging the weights of models is a popular approach in convex optimization and deep learning. Most applications study models along the same optimization trajectory, e.g.

[ruppert1988efficient, polyak1990new, szegedy2016rethinking, izmailov2018averaging, zhang2019lookahead]. By contrast, frankle2020linear, neyshabur2020being and matena2021merging weight-average models which share an initialization but are optimized independently. frankle2020linear find that, when training a pair of models from scratch with the same hyperparameter configuration but different data order, interpolating weights achieves no better than random accuracy. However, if the two models share a portion of their optimization trajectory, accuracy does not drop when they are averaged. Analogously, neyshabur2020being demonstrate that when two models are fine-tuned with the same pre-trained initialization, the interpolated model attains at least the accuracy of the endpoints. Unlike frankle2020linear and neyshabur2020being we consider averaging many models with varied hyperparameter configurations.

matena2021merging merge models with the same pre-trained initialization that are fine-tuned on different text classification tasks. They also propose Fisher information as an alternative technique for model merging. Unlike experiments in our Section 3.5, matena2021merging use data from the target distribution for fine-tuning. Moreover, wortsman2021robust average zero-shot and fine-tuned models, finding improvements in- and out-of-distribution. In contrast to wortsman2021robust, we average models across many independent runs which provides more substantial improvements in-distribution.

Stochastic Weight Averaging (SWA) [izmailov2018averaging], which averages weights along a single optimization trajectory, is also motivated by the relation between ensembling model outputs and averaging model weights. In contrast, the averaging we propose is across independent runs. Moreover, while their analysis relates the averaged network outputs (i.e., the logit ensemble) to the output of the a network with the averaged weights, our analysis (Section 4

) goes a step further and relates the classification losses associated with these two vectors.

Pre-training and fine-tuning.

In computer vision and natural language processing, the best performing models are often pre-trained on a large dataset before being fine-tuned on data from the target task

[donahue2014decaf, yosinski2014transferable, sharif2014cnn, girshick2014rich, mahajan2018exploring, kornblith2019better, yalniz2019billion, kolesnikov2020big, bommasani2021opportunities]

. This paradigm is also referred to as transfer learning. Recently, image-text pre-training has become increasingly popular in computer vision as a pre-training task

[radford2021learning, jia2021scaling, mu2021slip, pham2021scaling]. Recent work has explored alternative strategies for adapting these models to specific target tasks [coop, gao2021clip, zhang2021tip], for instance via a lightweight residual feature adapter. In contrast, our work explores standard end-to-end fine-tuned models. Other work has attempted to improve transfer learning by regularizing models toward their initialization [xuhong2018explicit], choosing layers to tune on a per-example basis [guo2019spottune], reinitializing layers over the course of training [li2020rifle], or using multiple pretrained models with data-dependent gating [shu2021zoo].

Ensembles.

Combining the outputs of many models is a foundational technique for improving the accuracy and robustness of machine learning models [dietterich2000ensemble, bauer1999empirical, breiman1996bagging, friedman2001elements, deepensembles, FREUND1997119]. ovadia2019can show that ensembles exhibit high accuracy under distribution shift. mustafa2020deep propose a method for identifying subsets of pre-trained models for fine-tuning and later ensembling them, finding strong in-distribution accuracy and robustness to distribution shift. gontijo2021no conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to uncorrelated errors and better ensemble accuracy. Finally, previous work has explored building ensembles of models produced by hyperparameter searches [snoek2015scalable, mendoza2016towards, saikia2020optimized], including greedy selection strategies [caruana2004ensemble, caruana2006getting, levesque2016bayesian, wenzel2020hyperparameter]. Importantly, ensembles require a separate inference pass through each model, which increases computational costs. When the number of models is large, this can be prohibitively expensive. Unlike ensembles, model soups require no extra compute at inference time.

7 Conclusion

Our results challenge the conventional procedure of selecting the best model on the held-out validation set when fine-tuning. With no extra compute during inference, we are often able to produce a better model by averaging the weights of multiple fine-tuned solutions.

Acknowledgements

We thank Ting Chen, Jesse Dodge, Ben Eysenbach, David Fleet, Pieter-Jan Kindermans, Mohammad Norouzi, Sarah Pratt and Vivek Ramanujan for helpful discussions and draft feedback, Lucas Beyer and Xiaohua Zhai for assistance with ViT-G/14 fine-tuning, and Hyak at UW for computing support.

YC was supported in part by the Israeli Science Foundation (ISF) grant no. 2486/21, the Len Blavatnik and the Blavatnik Family foundation, and The Yandex Initiative for Machine Learning. 66todo: 6ludwig: add funding

References

Appendix A Additional Figures

Figure A.1: Like model ensembling, model soups improve accuracy, but unlike model ensembling, model soups do not improve calibration. Expected calibration error (ECE) is computed using equal-mass binning. The soup in this figure is the uniform soup, and all models in this experiment are fine-tuned CLIP ViT-B/32 models with the same hyperparameters but different random seeds. The calibrated soup and calibrated ensemble refer to a soup and ensemble composed of models which are calibrated through temperature scaling [guo2017calibration]. Calibrating models before ensembling or souping had no effect on accuracy and so these curves are omitted from the plots on the left.

Appendix B Experimental Details

b.1 Learned soup

In addition to the greedy soup method described in the text, we also explore a more advanced souping procedure, which removes the sequential constraint from the greedy soup and requires only a single pass through the held out validation set. We refer to this method as the learned soup, as it involves learning the soup mixing coefficients for each of the ingredients on the held-out validation set. However, the learned soup has the downside of requiring all models to be simultaneously loaded in memory. In practice we combine the models on CPU before moving the parameters to GPU for each batch. For loss and validation set , we find mixing coefficients and temperature scaling parameter via

(2)

In practice we find better results when is parameterized as the output of a softmax, so that each is positive and values sum to one. We optimizer the aforementioned equation with gradient based mini-batch optimization for three epochs over the held-out validation set with the AdamW otpimizer and constant learning rate 0.1.

As presented in Table 3, we also try a “by layer” variant of the learned soup. For this we learn a separate for each layer of the network.

b.2 Error landscape visualizations

To supplement Figure 2, we provide an identical experiment but with a 10x bigger learning rate instead of 10x smaller. Results are illustrated in Figure B.1 with linear instead of log scaling for the contour lines. Since the error difference is more substantial, linear scaling was more clear. When fine-tuning with a larger learning rate, error increases on the path between the two fine-tuned solutions. All error landscape visualizations use CLIP ViT-B/32 fine-tuned on ImageNet for 10 epochs with minimal data augmentation, as used by CLIP during pre-training. When computing angles between the two fine-tuned solutions, as in Figure 4

, we use the repeated weights which constitute the majority of the network parameters. We ignore gain terms which tend to skew positive if occurring before ReLU activations.

In Figure 4 we consider solutions fine-tuned with learning rates less that . As in Figure B.1, if a learning rate that is large is used accuracy will decrease on the path in weight space between the two models.

Figure B.1: Replicating Figure 2 with a 10x larger learning rate instead of 10x smaller in the second row.

b.3 Model soups

This section describes the set of hyperparameters used for searches. For all ImageNet experiments, we withhold 2% of the training set and use these examples as the held-out validation set for model selection in greedy and learned soup.

b.3.1 CLIP experiments

Unless otherwise mentioned, all experiments used the AdamW optimizer [loshchilov2018decoupled] with cosine annealing learning rate schedule [loshchilov2016sgdr] for 10 epochs at batch size 512. When necessary we discretize augmentation strength into minimal, medium, and strong. Minimal augmentation uses only a random crop consisting of 90%-100% of the total image area. Medium is the default augmentation used by the timm library [rw2019timm]. Strong refers to RandAugment [cubuk2020randaugment] (, ).

We now provide the low level details for the hyperparemter searches, which are standard grid, extreme grid, and random search. The standard grid includes learning rates , where typically perform the best. Augmentation strengths are minimal, medium, or strong. Mixup is either off or on at . We consider all combinations of the above, running each hyperparameter configuration with two random seeds.

The extreme grid considers learning rates , where typically perform the best. Augmentation strengths are minimal, medium, or strong. Mixup is either off or on at . Moreover, we include the initialization in this search, which often outperforms some of the extreme learning rates but is far from the most accurate model.

The random search chooses learning rate where is selected uniformly at random from 4 to 6. Weight decay is chosen randomly as where

is selected uniformly at random from 0.2 to 4. With probability 0.5, label smoothing is set to 0 and otherwise it is selected uniformly at random between 0 and 0.25. Fine-tuning epochs are chosen randomly between four and sixteen. Mixup is 0 with probability 0.5, and otherwise is chosen uniformly at random from 0 to 0.9. With probability

we use minimal augmentation, otherwise we use randaug where and are chosen uniformly at random between 0 and 20 and 0 and 2 respectively.

When fine-tuning on WILDS-FMoW and WILDS-iWildCam for Figure 6, we use the same random search as when we fine-tune CLIP on ImageNet. The only difference is that we are able to use a larger ViT-L/14 model as the datasets are smaller. This also requires us to change the default batch size from 512 to 128.

b.3.2 ALIGN experiments

We fine-tuned ALIGN EfficientNet-L2 models using AdamW with weight decay of 0.1 at a resolution of for 25 epochs, with the final layer initialized from a linear probe without data augmentation. We fine-tuned 5 models with standard Inception-style random crops (consisting of 5% to 100% of the total image area with an aspect ratio between 0.75 and 1.33) and different learning rates (, , , , and ). We also fine-tuned 7 additional models at a learning rate of with different data augmentation strategies. Specifically, we varied the random cropping strategy (either Inception-style crops or less aggressive crops consisting of 90% to 100% of the total image area with an aspect ratio between 0.95 and 1.05), the use of RandAugment [cubuk2020randaugment] (off or , ), and the use of mixup [zhang2017mixup] (off or ) and trained models with all combinations of these strategies. Our soups are obtained by considering these 12 models as well as the linear probe initialization. We perform evaluation at resolution using a square center crop from images. The accuracy we attain with greedy soup approaches that reported by jia2021scaling, which evaluated at resolution.

b.3.3 ViT-G/14 experiments

These models are initialized with a backbone that was pretrained on the JFT-3B dataset [zhai2021scaling] and linear probes obtained at either the resolution at which the ViT-G/14 was pretrained or at the resolution used for fine-tuning. Models are fine-tuned at a batch size of 512 for either 10,000 or 20,000 steps (approximately 4 or 8 epochs) using the Adafactor optimizer [shazeer2018adafactor] with learning rates of 3e-5 or 5e-5; a constant or cosine decay learning rate schedule; and softmax or binary cross-entropy loss. When fine-tuning with binary cross-entropy loss, we use a linear probe that is also trained with binary cross-entropy loss. We vary data augmentation, applying RandAugment [cubuk2020randaugment], mixup [zhang2017mixup], or CutMix [yun2019cutmix] of varying strengths and random cropping with a minimum crop size of 5%, 70%, 90%, or 100% of the full image. When applying SAM, we consider models with perturbations either synchronized or unsynchronized across accelerators, including one model with synchronized perturbations and a combination of CutMix and SAM. All models are fine-tuned at resolution and evaluated by rescaling test images to (without preserving the aspect ratio) and taking a central crop.

We manually tuned hyperparameters with the goal of maximizing single-model accuracy. After settling on the use of Adafactor as the optimizer, we included all subsequently trained models in the pool of models to be used for greedy soup. The model that performs best on the holdout set is initialized with a linear probe and fine-tuned with a learning rate of 3e-5 and a constant learning rate decay schedule, with softmax cross-entropy loss, a minimum crop size of 90%, and CutMix with . The model that performs best on the official ImageNet validation set is initialized with a linear probe and fine-tuned at a learning rate of 3e-5 and a constant learning rate decay schedule, with softmax cross-entropy loss, a minimum crop size of 90%, CutMix with , and SAM. The greedy soup contains models trained with a wide range of different hyperparameter values including different learning rates, linear probes, loss functions, and every form of data augmentation and minimum crop size investigated. Notably, although models trained with SAM with synchronized perturbations are included in the greedy soup, the greedy soup process skips over the models trained with SAM with unsynchronized perturbations because adding them produces a large drop in holdout accuracy.

b.4 Cross-dataset soups details

This section provides additional details for the findings presented in Section 3.5. When fine-tuning we initialize with CLIP ViT-B/32 and use learning rate for 10 epochs with mini-batch size of 512. We train with minimal augmentation.

Appendix C Analysis of 1D hyperparameter grids

This section asks: for a one dimensional grid of hyperparameters , how does averaging the models fine-tuned with hyperparameter configurations and corresponding to the endpoints compare with picking the best individual model fine-tuned with hyperparameter configuration ?

Figure C.1: Analysis of 1D hyperparameter grids, where the average of models at the endpoints often outperforms the best individual model in the grid. In particular, colors and numbers indicate the percentage point improvement obtained by averaging the models on the and axis versus taking the best individual model in the range between them. Results are shown for the CLIP ViT-B/32 model fine-tuned on ImagNet.

The results are illustrated in Figure C.1, where each square represents a grid . The average of the endpoints often outperforms the best individual model in the grid. A notable exception is when the learning rate is the left endpoint of the grid. As this experiment uses AdamW, this learning rate is too high for fine-tuning and, unlike the examples in Figure 2, there is a high error barrier between the two fine-tuned solutions (see Figure B.1, lower right for example).

When varying optimizer we use minimal data augmentation and LR

for RMSProp 

[rmsprop], Adam [kingma2014adam], and AdamW [loshchilov2018decoupled]. SGD requires a larger learning rate, and so we use . When varying augmentation strength, we use minimal data augmentation and LR .

Appendix D Additional grid searches and initializations

This section recreates Figure 5 with different initializations (linear probe initialization or zero-shot) and different grid searches (standard and extreme grid) when fine-tuning CLIP ViT-B/32. The standard and extreme grid searches are described in Section B.3.1.

Figure D.1 considers the linear probe (LP) initialization and the standard grid. Figure D.2 considers the linear probe (LP) initialization and the extreme grid. Figure D.3 considers the zero-shot initialization and the standard grid. Figure D.4 considers the zero-shot initialization and the extreme grid.

Figure D.1: Replicating Figure 5 with the LP initialization and the standard grid hyperparameter search.
Figure D.2: Replicating Figure 5 with the LP initialization and the extreme grid hyperparameter search.
Figure D.3: Replicating Figure 5 with the zero-shot initialization and the standard grid hyperparameter search.
Figure D.4: Replicating Figure 5 with the zero-shot initialization and the extreme grid hyperparameter search.

Appendix E Additional fine-tuning and pre-training datasets

Figure E.1: Fine-tuning a CLIP ViT-L model on CIFAR-10 [krizhevsky2009learning] with the random hyperparameter search described in Section B.3.1. The -axis displays accuracy on CIFAR-10.1 [pmlr-v97-recht19a], a reproduction of CIFAR-10 with a distribution shift. Figure E.2: Fine-tuning on ImageNet, using a ViT-B/32 [dosovitskiy2021an] pre-trained on ImageNet-22k [deng2009imagenet].

In this section we explore fine-tuning or pre-training on additional datasets. Figure E.1 displays results for fine-tuning a CLIP ViT-L model on CIFAR-10 [krizhevsky2009learning]. The -axis of Figure E.1 displays accuracy on CIFAR-10.1 [pmlr-v97-recht19a], a reproduction of CIFAR-10 with a distribution shift. The individual models are fine-tuned with the random hyperparameter search described in Section B.3.1.

Next, Figure E.2 shows results when fine-tuning a ViT-B/32 [dosovitskiy2021an] model pre-trained on ImageNet-22k [deng2009imagenet] and fine-tuned on ImageNet. This differs from many of our other experiments as the dataset used for pre-training is smaller and less diverse. While the greedy soup offers an improvement, the improvement is less substantial than, e.g., Figure 1 which uses the same model and hyperparameter search but a different pre-training dataset.

Finally, we fine-tune a ViT-B/32 model five times on ImageNet, using the best hyperparameters found by the hyperparameter sweep, varying only the random seed. This experiment is conducted both for a model pre-trained on ImageNet-22k [deng2009imagenet] and a pre-trained CLIP model. The results are shown in Figure E.3, comparing, for an experimental budget of models: (i) the individual model with random seed , (ii) the model soup composed of models with random seeds 1 through , and (iii) the ensemble composed of models with random seeds 1 through . The performance of the model soup appears correlated with the performance of the ensemble, as we find that CLIP models are more amenable to both ensembling and souping than models pre-trained on ImageNet-22k.

Figure E.3: For a CLIP and ImageNet-22k pre-trained ViT-B/32 model, we use the best hyperparameters found by the hyperparameter sweep to fine-tune multiple times, varying only the random seed. For an experimental budget of models, we show: (i) the individual model with random seed , (ii) the model soup composed of models with random seeds 1 through , and (iii) the ensemble composed of models with random seeds 1 through .

Appendix F Analytical comparison details

f.1 Notation and preliminaries

We begin by restating and adding to the notation used in Section 4. For a model with parameter vector and input vector , we let denote the model’s logit output for -way classification. Throughout, we fix two endpoint models and , and for an interpolation parameter define

to be the “soup” weight averaged model and its corresponding logits. We also write

for the logits of the ensemble model. We write

for the difference of the two endpoints.

For a logit vector and a ground-truth label , denote the cross-entropy loss by

For some distribution over we write the expected -calibrated log losses of the soup and ensemble as

respectively.

We have the following expression for the derivatives of cross entropy w.r.t. logits. The gradient is

where is the th standard basis vector and has in its th entry. The Hessian is

so that for any , we have

Finally, we use to denote a vector in whose th entry is . Similarly, denotes a vector in whose th entry is , where gradients and Hessian are with respect to .

f.2 An exact expression for logit difference

We use the fundamental theorem of calculus and elemntary algebraic manipulation to obtain an exact integral form for the difference between the soup and ensemble logits. To streamline notation we drop the dependence of the logits on the input .

(3)

where

Note that

f.3 Derivation of approximation

We continue to suppress the dependence on in order to simplify notation. We begin with the following first order approximation of the pointwise log-loss difference between the ensemble and soup, which is also a lower bound due to convexity.

Now, we approximate the ensemble and soup logit difference using eq. 3 by assuming that for all ; this holds when the logits are approximately quadratic along the line between the checkpoints. The resulting approximation is

Combining the two approximation above, we obtain

To relate this expression to the Hessian of the loss with respect to the parameters, we note that for any

(by the chain rule)

When setting , we note that the second term on the RHS is (up to a constant) our approximation for the loss difference). Recalling the expression for the cross-entropy Hessian, the first term is

As a final approximation, we let

this holds when logits are too far from linear in .

Substituting back and making explicit, we obtain

where we have used

Scaling all logits by , the approximation becomes

Averaging the result over , we arrive at the approximation (1), which we repeat here for ease of reference:

f.4 Detailed empirical evaluations

Evaluation setup.

We evaluated our bounds on checkpoints from the ViT-B/32 fine-tuning experiments from the extreme grid search described in Section B.3.1. We selected three learning rate values (, and ), two levels augmentation (none and RandAugment+MixUp), and considered two different random seeds ( and ). From these checkpoints (as well as the initialization) we constructed the following pairs:

  • All pairs with different learning rate, the same augmentation level and seed 0,

  • All pairs with the same learning rate, different augmentation level and seed 0,

  • All pairs with the same learning rate and augmentation level, but different seeds,

  • All checkpoints with seed 0 coupled with the initialization.

This results in 21 pairs overall. For each pair and each we evaluated , as well as the approximation (1). We performed this evaluation on the ImageNet validation set as well as on the 5 distribution shift test sets considered throughout this paper.

The effect of temperature calibration.

Since our ultimate goal is to accurately predict the difference in error rather than the difference in loss, we introduce the inverse-temperature parameter to the loss, and tune it to calibrate the soup model. Specifically, for every model pair, value of and test set, we take .

While choosing based on the soup rather the ensemble might skew the loss in favor of the soup, it has no effect on the difference in prediction error. Moreover, in preliminary experiments calibrating the ensemble produced very similar results. In contrast, as shown in Figure 9, fixing throughout results in far poorer prediction of the difference in error.

Appendix G Additional baselines

This section explores additional baselines for model soups, including distillation from an ensemble as in hinton2014dark (Table G.1), fix-augmentation as in fixaug (Table G.2), weight-averaging along a trajectory as in szegedy2016rethinking (Table G.3), and Sharpness Aware Minimization as in foret2021sharpnessaware (Table G.4).

Unless otherwise mentioned, we fine-tune CLIP ViT-B/32 models with AdamW [loshchilov2018decoupled] and cosine annealing learning rate [loshchilov2016sgdr] for 10 epochs on ImageNet with a learning rate of 2e-5 and medium augmentation (data augmentation policies are discussed in more detail in Section B.3.1).

We explore the baseline of distillation [hinton2014dark, hinton2015distilling] from the ensemble of three models trained with different data augmentation. As previously reported [bagherinezhad2018label, beyer2021knowledge], we find that it improves accuracy to run distillation with data augmentation. Unfortunately, this substantially increases the computational resources necessary to distill from the ensemble. As we cannot cache the predictions of the models in the ensemble, it is necessary to perform a forward pass for each model in the ensemble at each step of fine-tuning. This makes distilling from an ensemble similarly expensive as training the models which constitute the ensemble. Nevertheless, as illustrated in Table G.1, model soups still perform favorably.

Table G.1 also introduces stochastic augmentation. For each data point, stochastic augmentation randomly applies minimal, medium, or strong data augmentation. Additionally, Table G.2 explores an alternative method for merging augmentations together. This augmentation policy, which we refer to as fix-aug, is introduced by fixaug. For fix-aug, strong augmentation is used for all but the final epoch, which uses minimal augmentation.

Next, Table G.3 applies model soups to solutions which already average along the fine-tuning trajectory. Methods for averaging along an individual optimization trajectory include exponential moving averages (EMA) [szegedy2016rethinking] and stochastic weight averages (SWA) [izmailov2018averaging]. Since accuracy is high even from initialization we use EMA. While EMA improves the accuracy of a single model, we find that models without EMA are more amenable to souping. Regardless, the model soup improves over the best single model with EMA.

Finally, Table G.4 explores the relation between model soups and sharpness-aware minimization (SAM) [foret2021sharpnessaware]. In line with previous results, we find that SAM improves accuracy over vanilla fine-tuning. Souping two models trained with SAM improves over either individual model, although the magnitude of the gain is smaller than for vanilla fine-tuning. Souping models trained with and without SAM yields higher accuracy than souping models trained only with vanilla fine-tuning or only with SAM.

ImageNet Distribution shifts
Individual model (LR 3e-05, minimal aug) 76.42 43.21
Individual model (LR 3e-05, medium aug) 78.83 43.55
Individual model (LR 3e-05, strong aug) 79.08 43.75
Individual model (LR 3e-05, stochastic aug) 78.94 45.04
Individual model (LR 3e-05, stochastic aug 3x epochs) 78.38 42.18
Distillation from the ensemble (LR 3e-05, no aug) 78.59 43.45
Distillation from the ensemble (LR 3e-05, stochastic aug) 79.79 45.63
Soup minimal, medium, and strong aug (LR 3e-05) 80.24 47.97
Ensemble minimal, medium, and strong aug (LR 3e-05) 80.19 46.33
Individual model (LR 1e-05, minimal aug) 77.19 47.98
Individual model (LR 1e-05, medium aug) 79.51 46.74
Individual model (LR 1e-05, strong aug) 79.33 46.62
Individual model (LR 1e-05, stochastic aug) 79.48 48.07
Individual model (LR 1e-05, stochastic aug 3x epochs) 79.59 46.89
Distillation from the ensemble (LR 1e-05, no aug) 79.13 47.28
Distillation from the ensemble (LR 1e-05, stochastic aug) 79.88 47.49
Soup minimal, medium, and strong aug (LR 1e-05) 80.08 49.75
Ensemble minimal, medium, and strong aug (LR 1e-05) 80.17 49.36
Table G.1: Comparing model soups to network distillation from an ensemble of models trained with different data augmentations. Stochastic data augmentation randomly applies minimal, medium, or strong data augmentation.
ImageNet Distribution shifts
Individual model (LR 3e-05, minimal aug) 76.42 43.21
Individual model (LR 3e-05, medium aug) 78.83 43.55
Individual model (LR 3e-05, strong aug) 79.08 43.75
Individual model (LR 3e-05, fix aug) 79.43 45.46
Individual model (LR 3e-05, fix aug 4x epochs) 78.57 41.53
Soup minimal, medium, and strong aug (LR 3e-05) 80.24 47.97
Soup minimal, medium, strong, and fix aug (LR 3e-05) 80.41 48.14
Individual model (LR 1e-05, minimal aug) 77.19 47.98
Individual model (LR 1e-05, medium aug) 79.51 46.74
Individual model (LR 1e-05, strong aug) 79.33 46.62
Individual model (LR 1e-05, fix aug) 79.70 48.18
Individual model (LR 1e-05, fix aug 4x epochs) 79.96 45.86
Soup minimal, medium, and strong aug (LR 1e-05) 80.08 49.75
Soup minimal, medium, strong, and fix aug (LR 1e-05) 80.17 49.71
Table G.2: Comparing models soups of different augmentations with another method which combines different augmentation strategies—fix aug, as described in fixaug. For fix aug we use strong data augmentation for all except the final epoch for which we apply minimal aug.
ImageNet Distribution shifts
Individual model (no EMA, LR 3e-05, minimal aug) 76.42 43.21
Individual model (no EMA, LR 3e-05, medium aug) 78.83 43.55
Individual model (no EMA, LR 3e-05, strong aug) 79.08 43.75
Soup minimal, medium, and strong aug without EMA (LR 3e-05) 80.24 47.97
Individual model (EMA decay 0.9999, LR 3e-05, minimal aug) 77.61 47.45
Individual model (EMA decay 0.9999, LR 3e-05, medium aug) 79.37 46.89
Individual model (EMA decay 0.9999, LR 3e-05, strong aug) 79.17 46.85
Soup minimal, medium, strong aug with EMA (LR 3e-05) 79.76 49.69
Individual model (no EMA, LR 1e-05, minimal aug) 77.19 47.98
Individual model (no EMA, LR 1e-05, medium aug) 79.51 46.74
Individual model (no EMA, LR 1e-05, strong aug) 79.33 46.62
soup minimal, medium, strong without EMA (LR 1e-05) 80.08 49.75
Individual model (EMA decay 0.9999, LR 1e-05, minimal aug) 77.47 50.94
Individual model (EMA decay 0.9999, LR 1e-05, medium aug) 78.93 49.76
Individual model (EMA decay 0.9999, LR 1e-05, strong aug) 78.85 49.02
Soup minimal, medium, and strong aug with EMA (LR 1e-05) 79.16 51.76
Table G.3: Applying model soups to methods which average models along the trajectory. While taking the exponential moving average (EMA) of weights along the trajectory can improve the performance of a single model, the EMA solution is less amenable to souping.
ImageNet Distribution shifts
Vanilla fine-tuning (seed 0) 79.32 45.09
Vanilla fine-tuning (seed 1) 79.16 45.12
SAM fine-tuning (seed 0) 79.61 43.78
SAM fine-tuning (seed 1) 79.59 43.79
Soup (vanilla fine-tuning, seeds 0 and 1) 79.78 46.46
Soup (SAM fine-tuning, seeds 0 and 1) 79.85 44.44
Soup (vanilla fine-tuning and SAM fine-tuning, seed 0) 80.04 45.38
Table G.4: Applying model soups to models trained with sharpness aware minimization (SAM) [foret2021sharpnessaware].

Appendix H Text classification datasets

We study four text classification datasets from the GLUE benchmark [wang2018glue].

Microsoft Research Paraphrase Corpus

(MRPC; [dolan2005automatically]) contains pairs of sentences, labeled as either nearly semantically equivalent, or not. The dataset is evaluated using the average of and accuracy. The training set consists of 3.7 thousand samples and the validation set of 409 samples.

Recognizing Textual Entailment

(RTE; [wang2018glue]) contains pair of sentences, and the task is to predict whether the first sentence (the premise) entails or contradicts the second sentence (the hypothesis). The data is originally from a series of datasets [dagan2005pascal, bar2006second, giampiccolo2007third, bentivogli2009fifth]. The dataset is evaluated using classification accuracy. The training set consists of 2.5 thousand samples and the validation set of 277 samples.

Corpus of Linguistic Acceptability

(CoLA; [warstadt2018neural]) contains sentences labeled as either grammatical or ungrammatical. Models are evaluated on Matthews correlation (MCC; [matthews1975comparison]), which ranges between and . The training set consists of 8.6 thousand samples and the validation set consists of 1043 samples.

Stanford Sentiment Treebank

(SST-2; [socher2013recursive]) contains sentences labelled as expressing positive or negative sentiment, collected from movie reviews. The dataset is evaluated using classification accuracy. The training set consists of 67 thousand samples and the validation set consists of 873 samples.

Appendix I Fine-tuning details for text classification tasks

Model Method MRPC RTE CoLA SST-2
BERT-base [devlin2019bert] Best individual model 88.3 61.0 59.1 92.5
Uniform soup 76.0 52.7 0.0 89.9
Greedy soup 88.3 61.7 59.1 93.0
BERT-large [devlin2019bert] Best individual model 88.8 56.7 63.1 92.2
Uniform soup 15.8 52.7 1.90 50.8
Greedy soup 88.8 56.7 63.1 92.3
T5-small [raffel2020t5] Best individual model 89.7 70.0 42.2 91.7
Uniform soup 82.7 61.7 10.4 91.1
Greedy soup 89.7 70.0 43.0 91.7
T5-base [raffel2020t5] Best individual model 91.8 78.3 58.8 94.6
Uniform soup 86.4 71.8 12.3 94.6
Greedy soup 92.4 79.1 60.2 94.7
T5-large [raffel2020t5] Best individual model 93.4 82.7 61.7 96.3
Uniform soup 74.8 50.2 0.00 96.0
Greedy soup 93.4 84.8 62.7 96.3
Table I.1: Performance of model soups on four text classification datasets from the GLUE benchmark [wang2018glue].

Each model is fine-tuned 32 times on each dataset, performing a random hyperparameter search. The learning rate is chosen uniformly in log space over [, ], the batch size is chosen uniformly from and the number of epochs from . Evaluation is conducted once at the end of training, without early stopping. We use a maximum sequence length of 128 tokens and train with Adam [kingma2014adam] using , and

, gradient clipping of

, no weight decay, and with the learning rate being decayed linearly to zero at the end of training. We use pre-trained weights from the Huggingface Transformers library [wolf2020transformers]. For BERT models, we use the uncased version.

Fine-tuning occurs without any additional parameters to avoid distorting the features from the pre-trained models [kumar2021finetuning]. For such, the classification tasks are adapted to be suited to the pre-training objective of BERT and T5. For T5, the tasks are cast as a sequence-to-sequence problem. For instance, for sentiment analyses, an example is to predict “A) positive” from “sentence: The best movie I’ve ever seen! | options: A) positive B) negative | label:”. For BERT, the tasks are cast as a masked language modeling problem. For instance, for linguistic acceptability, an example is to predict “A) acceptable” for the inputs “sentence: model soups are grammatical. | options: A) acceptable B) unacceptable | label: [MASK] [MASK] [MASK]”. For evaluation, we select which of the options is given the highest probability according to the model.

The full set of results is shown in Table I.1. On 10 out of the 20 combinations of models and datasets, the greedy soup shows better performance than the best individual model from the hyperparameter search. Uniform soups show worse performance than the best individual model on all experiments, which could be an artifact of the broad range of hyperparameters used in the search. While the experiments varied only basic hyperparameters such as learning rate and batch size, we hypothesize that a broader set of hyperparameter choices (e.g. data augmentation [wei2019eda, ma2019nlpaug]) could lead to more diverse models and better soups.

Finally, as a word of caution for practitioners, we remind readers that many recent language models have tied weights on the output and embedding layers

[press2017using]. For this reason, caution is needed when averaging models in-place; doing so might inadvertently lead to undesired behavior.