1 Introduction
In recent years, research has shown that models pretrained on large and diverse datasets learn representations that transfer well to a variety of tasks. As a result, machine learning practitioners now commonly develop solutions for downstream tasks by finetuning large pretrained models
[girshick2014rich, yosinski2014transferable, kornblith2019better, kolesnikov2020big]. Typically, the finetuning process involves two steps: (1) finetune models with a variety of hyperparameter configurations, and (2) select the model which achieves the highest accuracy on the heldout validation set. The remaining models are then discarded.Selecting a single model and discarding the rest has several downsides. For one, the selected model may not achieve the best performance. In particular, ensembling outputs of many models ^{2}^{2}todo: 2Hong: Let’s be consistent in our usage of output ensembles vs. logit ensembles. Also, is logit ensemble a wellaccpted terminology? For clarity, we should briefly define what it is the first time we use it. Without understanding what it is, readers cannot understand the inference cost reductions. can outperform the best single model, albeit at a high computational cost during inference. ^{3}^{3}todo: 3Hong: In addition to Yair’s comment, I think we should emphasize that inference cost is a BIG DEAL. Some researchers have the impression that training cost is the only elephant in the room. But the computational multiplier in front inference cost is often at least in the order of 10Ms to Bs and inference cost is often a bigger cost overall. For another, finetuning a model on downstream tasks can sometimes reduce outofdistribution performance [radford2021learning, andreassen2021evolution, wortsman2021robust, pham2021scaling].
In this work, we propose a more accurate and robust alternative to the second step of the conventional recipe in the context of finetuning a large pretrained model. Instead of selecting the individual finetuned model which achieves the highest accuracy on the heldout validation set, we average the weights of models finetuned independently, and refer to the result as a model soup. Given the results of the first step—a hyperparameter sweep over finetuned models—averaging several of these models to form a model soup requires no additional training and adds no cost at inference time.
Since neural networks are nonlinear with potentially many solutions in different loss basins, it is perhaps surprising that averaging the weights of independently finetuned models achieves high performance. However, recent work
[neyshabur2020being] observes that finetuned models optimized independently from the same initialization lie in the same basin of the error landscape, inspiring our method. Weight averaging along a single training trajectory has previously been shown to improve the performance of models trained from random initialization [szegedy2016rethinking, izmailov2018averaging]. Our approach extends weight averaging to the context of finetuning, where we find that it also works across many independent runs.We perform a comprehensive experimental study of finetuning to understand the behavior of model soups. For our main results we finetune CLIP [radford2021learning] and ALIGN [jia2021scaling], which are pretrained with a contrastive loss on imagetext pairs, and a ViTG model pretrained on JFT [zhai2021scaling]. Our results show that model soups often outperform the best individual model on both the indistribution and natural distribution shift test sets (Table 1, Figure 1, Figure 4). A model soup composed of ViTG models achieves 90.94% on ImageNet [deng2009imagenet], surpassing the previous state of the art of 90.88% obtained by the CoAtNet model [dai2021coatnet] while requiring 25% fewer FLOPs at inference time. In general, model soups can approach the performance of ensembling, with no additional computational cost or memory relative to a single model during inference. Beyond ImageNet and associated distribution shifts, our results show that model soups are applicable when finetuning on tasks from the WILDS [wilds2021] benchmark, and when finetuning transformer models [vaswani2017attention, devlinetal2019bert, raffel2020t5] for text classification.
While the most straightforward approach to making a model soup is to average all the weights uniformly, we find that greedy soups, where models are sequentially added to the soup if they improve accuracy on heldout data, outperforms uniform averaging. Greedy soups avoid adding in models which may lie in a different basin of the error landscape, which could happen if, for example, models are finetuned with high learning rates.
In addition to empirical observations, we analytically relate the similarity in loss between weightaveraging and logitensembling to the flatness of the loss (i.e., its second derivative on a line between models) and confidence of the predictions (expressed via the variance of a logits difference drawn from the weightaverage softmax). We empirically validate our approximation on a subset of the models we train and show that it is strongly correlated with the true averaging vs. ensembling performance difference, particularly in the learning rate regimes where soups are effective and models achieve higher accuracy.
Paper outline. Our method of model soups is presented and evaluated in Sections 2 and 3, respectively. Next, Section 4 includes our analysis relating model soups and ensembles, Section 5 details the scope and limitations of the proposed method, and Section 6 contextualizes model soups by reviewing related work.
2 Method
Method  Cost  
Best on val. set  
Ensemble  
Uniform soup  
Greedy soup  Recipe 1  
Learned soup  Appendix B.1 
This section highlights three recipes for model souping, the uniform, greedy, and learned soup, though the greedy soup is our central method. We summarize the methods described in this section in Table 2.
We consider a neural network with input data and parameters . Finetuning is analogous to standard neural network training but includes an important distinction: the parameters are initialized to those found via pretraining.
Let denote the parameters obtained by finetuning with pretrained initialization and hyperparameter configuration . The hyperparameter configuration can include the choice of optimizer, data augmentation, training iterations, and a random seed which will determine data order.
For hyperparameter configurations let . Conventionally, the parameters which attain the highest accuracy on a held out validation set are selected, and the remaining parameters are discarded. Instead, model soups use an average of , i.e., where . The uniform soup is constructed by averaging all finetuned models and so .
There are settings in which a hyperparameter configuration can produce a model with low accuracy that results in a low accuracy uniform soup. This issue can be circumvented with a greedy soup (Recipe 1). The greedy soup is constructed by sequentially adding each model as a potential ingredient in the soup, and only keeping the model in the soup if performance on a held out validation set (disjoint from the training and test sets) improves. Before running this procedure we sort the models in decreasing order of validation set accuracy, and so the greedy soup can be no worse than the best individual model on the heldout validation set. We also explore a more advanced learned soup
recipe that optimizes model interpolation weights by gradientbased minibatch optimization (see Appendix
B.1 for details). This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks.3 Experiments
This section presents our key experimental findings. We begin with experimental setup (Section 3.1) and provide intuition for model soups by examining error landscape visualizations (Section 3.2). Next we present our main results (Section 3.3), using model soups as an alternative to selecting the best performing individual model. Finally, we explore model soups in the context of robust finetuning (Section 3.4), and examine model soups constructed by finetuning on different datasets (Section 3.5).
3.1 Experimental setup
Our experiments explore the application of model soups when finetuning various models. The primary models we finetune are the CLIP [radford2021learning] and ALIGN [jia2021scaling] models pretrained with contrastive supervision from imagetext pairs, a ViTG/14 model pretrained on JFT3B [zhai2021scaling], and transformer models for text classification [devlinetal2019bert, colin2020exploring]. Unless otherwise mentioned, experiments use the CLIP ViTB/32 model. Finetuning is performed endtoend (all parameters are modified) which often results in better accuracy than training only the final linear layer [kornblith2019better, agrawal2014analyzing, chatfield2014return, azizpour2015generic].
We consider two different methods for initializing the final linear layer before finetuning. The first method initializes the model from a linear probe (LP), as described in kumar2021finetuning
, and we refer to this method as LP initialization. The second method uses the zeroshot initialization, e.g., using the classifier produced by the text tower of CLIP or ALIGN as the initialization. Both methods for initializing the model produce similar trends when applicable, and unless otherwise stated we use the LP initialization.
For the ensemble baselines [dietterich2000ensemble, deepensembles] we ensemble the logits (unormalized outputs) of models as in gontijo2021no. Finetuning uses a supervised crossentropy loss and, unless otherwise mentioned, is conducted on ImageNet [deng2009imagenet]. When finetuning on ImageNet we also evaluate on the five natural distribution shifts: ImageNetV2 [pmlrv97recht19a], ImageNetR [imagenetr], ImageNetSketch [imagenetsketch], ObjectNet [objectnet], and ImageNetA [imageneta]. We often report results averaged over these five distribution shifts. Since the official ImageNet validation set is used as the test set, we use roughly 2% of the ImageNet training set as a heldout validation set for constructing greedy soups.
3.2 Intuition with error landscape visualizations
To provide intuition, we visualize a two dimensional slice of the training loss and test error landscape when finetuning CLIP on ImageNet. In these experiments, we use the zeroshot initialization and finetune twice, independently, to produce solutions and . The points and define a plane in parameter space, and we evaluate the ImageNet train loss, ImageNet test error, and the test error on the five aforementioned distribution shifts on this plane. The results are illustrated in Figure 2 where the zeroshot initialization () is shown as a star and a solution finetuned with learning rate () is shown as a blue square. For we either use the same learning rate as (but vary the random seed) or learning rate . For both the indistribution and outofdistribution test sets, the loss/error contours are basinshaped, and none of the three points is optimal.^{5}^{5}todo: 5Yair: edited the sentence to avoid the somewhat vague “error basin”
These results suggest that (1) interpolating the weights of two finetuned solutions can improve accuracy compared to individual models and (2) more uncorrelated solutions—models that form an angle^{9}^{9}9In particular, the angle between and , i.e., the angle between the arrows shown in Figure 2. closer to 90 degrees—may lead to higher accuracy on the linear interpolation path.
To investigate the correlation between accuracy improvement and angle, we consider a series of models trained with different seeds, learning rates, and data augmentation. For each pair , we compare the accuracy of their average with the average of their accuracies, , which we refer to as the interpolation advantage. Figure 4 illustrates the results, in which we observe that the interpolation advantage is correlated with the angle and that varying the learning rate, seed, or data augmentation can produce solutions which are more orthogonal. Experimental details and discussion of high learning rates provided in Appendix B.2.
Finally, in Appendix C we ask the question: for a one dimensional grid of hyperparameters , how does averaging the models finetuned with hyperparameter configurations and corresponding to the endpoints compare with picking the best individual model finetuned with hyperparameter configuration ? The hyperparameters we vary are optimizer, augmentation, and learning rate. For the vast majority of grid searches, the average of the endpoints outperforms the best individual model in the grid.
3.3 Model soups
With the gains of averaging two finetuned models in mind, we turn our attention to averaging many models with different hyperparameters: this section presents our main results, which show that averaging finetuned models can be used as an alternative to the conventional procedure of selecting the single model which performs best on the heldout validation set. We explore CLIP [radford2021learning] and ALIGN [jia2021scaling] finetuned on ImageNet [deng2009imagenet] and WILDS [wilds2021] (Section 3.3.1), ViTG pretrained on JFT3B [zhai2021scaling] and finetuned on ImageNet (Section 3.3.2), and transformer models finetuned on text classification tasks (Section 3.3.3). Appendix E
additionally explores CLIP ViTL finetuned on CIFAR10 and ImageNet22kpretrained ViTB/32 finetuned on ImageNet.
3.3.1 Finetuning CLIP and ALIGN
We begin our study of model soups by considering twopretrained models, CLIP ViTB/32 and ALIGN EfficientNetL2, and performing a hyperparameter sweep for the finetuning each model on ImageNet. For CLIP we use a random hyperparameter search over learning rate, weight decay, training epochs, label smoothing, and data augmentation, obtaining 72 finetuned models (details in Appendix
B.3.1). For ALIGN we use a grid search over learning rate, data augmentation, and mixup, obtaining 12 finetuned models (details in Appendix B.3.2). To form our greedy soups, we sort models in order of decreasing accuracy on the heldout validation set before applying Recipe 1. For both CLIP and ALIGN, the greedy soup selects 5 models. Figure 1 and 4 show the performance of the resulting models and their uniform and greedy soups for CLIP and ALIGN. The greedy soup improves over the best model in the hyperparameter sweep by 0.7 and 0.5 percentage points, respectively.ImageNet acc.  Distribution shifts acc.  
Best individual model on ImageNet  80.38  47.83 
Second best individual model on ImageNet  79.89  43.87 
Uniform soup  79.97  51.45 
Greedy soup (decreasing order of heldout val accuracy)  81.03  50.75 
Greedy soup (random order)  
Learned soup  80.89  51.07 
Learned soup (by layer)  81.37  50.87 
Ensemble  81.19  50.77 
Greedy ensemble  81.90  49.44 
. For “Greedy soup (random order)”, we try three random orders report mean and standard deviation. The “Learned soup” and its variants are descried in Appendix
B.1.Furthermore, we show that the greedy soup requires less models to reach the same accuracy as selecting the best individual model on the heldout validation set. We consider an additional setting where we prepare a sequence of soups by sequentially adding CLIP models from the hyperparameter sweep in random order. Figure 5 shows the performance of the uniform and greedy soup, as well as the best single model so far and a logit ensemble, as a function of the number of models considered. For essentially any number of models, the greedy soup outperforms the best single model on both the ImageNet and outofdistribution test sets; the greedy soup is better than the uniform soup on ImageNet and comparable to it outofdistribution. The logit ensemble is better than the greedy soup on ImageNet, but worse outofdistribution.
Table 3 lists the performance of the CLIP soups and baselines described above, as well as additional soup variants described in Appendix B.1.
To further establish the generality of the model soup, we replicate the CLIP hyperparameter sweep experiment on two image classification tasks from WILDS [wilds2021], namely FMoW [christie2018functional] and iWildCam [beery2021iwildcam]; Figure 6 shows results qualitatively similar to our ImageNet experiment, and Appendix B.3.1 describes experimental details.
We report several additional variants and baselines for the experiment described above. In Appendix D we present results for different hyperparameter sweeps and finetuning initializations, when finetuning CLIP on ImageNet. For instance, we try a standard grid search which is similar to the grid search described for ALIGN above, and an extreme grid search which includes solutions finetuned with extreme hyperparameters that result in badly performing models (details in Appendix B.3.1). Moreover, Appendix G compares model soups with additional baselines, including distillation from an ensemble as in [hinton2014dark], models which apply weightaveraging along their trajectory, and sharpness aware minimization [foret2021sharpnessaware].
We highlight a few interesting takeaways from these experiments: (1) The greedy soup outperforms the best individual model—with no extra training and no extra compute during inference, we were able to produce a better model. (2) While the uniform soup can outperform the best individual model, we only observe this when all individual models achieve high accuracy (e.g., when finetuning ALIGN in Figure 1); unlike the examples in Figure 2, there can be an error barrier between finetuned models. We mainly observe this when finetuning with high learning rates (this is illustrated in Appendix B.2, Figure B.1). However, these high learning rate models also have a lower accuracy, and are therefore excluded by the greedy soup.
3.3.2 Finetuning a ViTG model pretrained on JFT3B
To test whether the gains obtained by model soups are additive with other techniques used to obtain stateoftheart models, we applied our greedy soup technique to 58 ViTG/14 models finetuned on ImageNet. We vary the learning rate, decay schedule, loss function, and minimum crop size in the data augmentation, and optionally apply RandAugment
[cubuk2020randaugment], mixup [zhang2017mixup], or CutMix [yun2019cutmix]. We also train four models with sharpnessaware minimization (SAM) [foret2021sharpnessaware]. For further details of our hyperparameter sweep, see Appendix B.3.3. For each model training run, we save exponential moving averages (EMA) of the weights [szegedy2016rethinking] computed with decay factors of 0.999 (low EMA) and 0.9999999 (high EMA). Whereas high EMA generally provides the best accuracy over the course of an individual training run, both greedy soup and greedy ensembling obtain higher validation accuracy when applied to parameters with low EMA. We report the highest single model accuracy numbers obtained with either EMA decay value, but perform greedy soup and ensembling with models trained with EMA decay of 0.999. For each combination of training run and EMA decay rate, we evaluate accuracy on our held out validation set every 1000 steps. We use these accuracy values to pick the best checkpoint for ensembling, souping, and subsequent evaluation.In Table 4, we report results on the ImageNet validation set and the five distribution shift datasets studied above as well as two relabeled ImageNet validation sets, ReaL [beyer2020we] and multilabel [shankar2020evaluating]. Our greedy soup procedure selects 14 of the 58 models finetuned as part of our hyperparameter sweep, and this soup performs statistically significantly better than the best individually finetuned model selected based on our held out validation set on all datasets except for ObjectNet. Even when we give an unfair advantage to individually finetuned models by selecting them based on their performance on each test set (denoted “oracle” in Table 4), the greedy soup, which was selected using only indistribution data, remains superior on most datasets. Only on ReaL and ObjectNet does there exist an individual model that performs statistically significantly better than the soup, and the best model differs between those two datasets. Greedy ensembling performs similarly to the greedy soup in terms of ImageNet top1 and multilabel accuracy, and slightly better on ReaL, but significantly worse on all distribution shift datasets except for ImageNetV2. Thus, greedy soup can provide additional gains on top of standard hyperparameter tuning even in the extremely high accuracy regime.
ImageNet  Distribution shifts  
Method  Top1  ReaL  Multilabel  INV2  INR  INSketch  ObjectNet  INA  Avg shifts 
ViT/G14 [zhai2021scaling]  90.45  90.81  –  83.33  –  –  70.53  –  – 
CoAtNet7 [dai2021coatnet]  90.88  –  –  –  –  –  –  –  – 
Our models/evaluations based on ViTG/14:  
ViT/G14 [zhai2021scaling] (reevaluated)  90.47  90.86  96.89  83.39  94.38  72.37  71.16  89.00  82.06 
Best model on held out val set  90.72  91.04  96.94  83.76  95.04  73.16  78.20  91.75  84.38 
Best model on each test set (oracle)  90.78  91.78  97.29  84.31  95.04  73.73  79.03  92.16  84.68 
Greedy ensemble  90.93  91.29  97.23  84.14  94.85  73.07  77.87  91.69  84.33 
Greedy soup  90.94  91.20  97.17  84.22  95.46  74.23  78.52  92.67  85.02 
Model  Method  MRPC  RTE  CoLA  SST2 
BERT [devlin2019bert]  Best individual model  88.3  61.0  59.1  92.5 
Greedy soup  88.3 (+0.0)  61.7 (+0.7)  59.1 (+0.0)  93.0 (+0.5)  
T5 [raffel2020t5]  Best individual model  91.8  78.3  58.8  94.6 
Greedy soup  92.4 (+0.6)  79.1 (+0.8)  60.2 (+0.4)  94.7 (+0.1) 
3.3.3 Finetuning on text classification tasks
To test whether the gains obtained by model soups extend to domains beyond image classification, we conduct preliminary experiments with natural language processing (NLP). While more investigation is warranted to establish the applicability of model soups for NLP, we believe our experiments are a promising initial step. In particular, we finetune BERT [devlin2019bert] and T5 [raffel2020t5] models on four text classification tasks from the GLUE benchmark [wang2018glue]: MRPC [dolan2005automatically], RTE [dagan2005pascal, bar2006second, giampiccolo2007third, bentivogli2009fifth], CoLA [warstadt2018neural] and SST2 [socher2013recursive], as in [dodge2020fine]. We use the standard metric for each dataset: average of accuracy and score for MRPC, accuracy for RTE, Matthews correlation for CoLA [matthews1975comparison] and accuracy for SST2. More details are provided in Appendix H.
We finetune 32 models for each dataset with a random hyperparameter search over learning rate, batch size, number of epochs and random seed. Table 5 reports the corresponding metric on the validation set for BERTbase uncased [devlinetal2019bert] and T5base [raffel2020t5]. Additional experimental details and results for more models are provided in Appendix I. While the improvements are not as pronounced as in image classification, the greedy soup can improve performance over the best individual model in many cases.
3.4 Robust finetuning
wortsman2021robust introduce WiSEFT, a method for improving the robustness of a model which is finetuned from initialization by linearly interpolating and . An intriguing observation was that, once the data augmentation is fixed, interpolating between and often traces a similar curve regardless of hyperparameters.^{10}^{10}10 This is visible in Figure 7 (right) where different data augmentations are shown with different colors. On the other hand, in Figure 7 (left) there are many different methods of data augmentation as we conduct a random hyperparameter search. In other words, a reasonable hypothesis was that this curve is Pareto optimal—no hyperparameter configuration would surpass it. In Figure 7, we trace the curves when interpolating between and for a random hyperparameter search (left) and the standard grid search described in Appendix B.3.1 (right) when finetuning CLIP ViTB/32. We find that the uniform soup and greedy soup lie beyond these interpolation curves. Moreover, we find interpolating between these soups and the initialization also provides additional accuracy improvements on the distribution shifts.
3.5 Crossdataset soups
So far, our experiments have studied soups of models finetuned on the same dataset with different hyperparameters. In this section, we prepare soups containing models finetuned on different datasets. We evaluate the resulting soups on a heldout dataset, from which no labeled training data is used (i.e., zeroshot evaluation).
Concretely, we consider soups based on the CLIP zeroshot initialization along with six models finetuned independently on CIFAR10 [krizhevsky2009learning], Describable Textures [dtd], Food101 [food101], SUN397 [sun397]
[cars] and ImageNet [deng2009imagenet]. We evaluate on CIFAR100 [krizhevsky2009learning], which does not share classes with CIFAR10. Since each task has a different set of classes, the last layers cannot be part of the soup. Hence, during finetuning, we freeze the linear head produced by CLIP’s text tower so that taskspecific learning is captured only in the backbone weights. At test time, we use the “backbone soup” with a zeroshot head constructed from CLIP’s text tower and the CIFAR100 class names with the promptensembling used for ImageNet by radford2021learning. Figure 8 (left) shows that a model soup containing models trained on each of these datasets and the zeroshot model improves zeroshot performance on CIFAR100 by 6.4 percentage points over the CLIP baseline. Moreover, Figure 8 (right) shows that the choice of which finetuned models to include can have a substantial impact on the accuracy of the resulting soup. See Appendix B.4 for additional details.4 Analytically comparing soups to ensembles
The goal of this section is to obtain complementary analytical insight into the effectiveness of model soups. For simplicity, we consider a soup consisting of only two models with parameters and . For weighting parameter we let denote the weightaveraged soup. We would like to understand when would the soup error, , be lower that the best of both endpoints, .
Note that just convexity of in does not by itself imply superiority of the soup to both endpoints, as the minimum of over may be obtained at the endpoints even when is convex. To get further leverage on the problem, we compare the soup to the logitlevel ensemble The rich literature on ensembles (see Sec. 6) tells us that the expected error of the ensemble, , is often strictly below for neural networks. Therefore, whenever we expect the soup to outperform both endpoint models.
To analytically compare the soup and the ensemble, we replace the 01 loss with a differentiable surrogate. Specifically, we the consider crossentropy loss . We let denote the calibrated expected loss of the soup, and similarly define for the ensemble. We derive the following approximation for the loss difference:
(1) 
where is the standard “softmax” distribution and is the difference between the endpoint logits. We obtain our approximation in the regime where the logits are not too far from linear; see Appendix F.3 for a detailed derivation.
The first term in approximation (1) is negatively proportional to the second derivative of the loss along the trajectory: when the approximation holds, convexity of the loss indeed favors the soup. However, the second term in the approximation does not follow from the “convex basin” intuition. This term always favors the ensemble, but is small in one of two cases: (a) the somewhat trivial case when the endpoint models are similar (so that is small) and (b) when the soup produces confident predictions, implying that is close to a point mass and consequently the variance term is small.
To test our approximation, we evaluate it over of set of finetuned models with different learning rates, augmentation strategies, random seeds and values. We set to calibrate the soup model, and find that it improves the ability of our approximation to predict the soup/ensemble error difference; see Appendix F.4 for detailed description of our setup.
Figure 9 summarizes the results of our empirical evaluations. When excluding the high learning rate of (center and right panels),^{11}^{11}11Finetuned models with learning rate are far in weight space from the initial model and are often rejected when forming greedy soups. Therefore, we do not expect our approximation to be tight for these learning rates. we see that the approximation is strongly correlated with both the true difference in loss as well as the difference in error, and the approximation and true loss difference generally agree in sign. Additional details are provided in Appendix F.
5 Scope and limitations
While this work has so far demonstrated that averaging many finetuned models is a useful technique for improving accuracy, this section explores two limitations of the approach. The first is the applicability of model soups, and the second is the failure of model soups to substantially improve calibration.
Applicability. So far our experiments have mainly explored models pretrained on large, heterogeneous datasets. In Appendix E we also explore model soups for an ImageNet22k pretrained model. While the greedy soup still provides improvements on ImageNet, these improvements are less substantial compared to those observed when finetuning CLIP and ALIGN.
Calibration. While ensembles improve model calibration [guo2017calibration, roelofs2020mitigating], model soups do not have the same effect. As hyperparameters can also have an effect on calibration, we consider the ensemble and soup of 20 models which are identical other than random seed. Results are illustrated in Figure A.1 using the calibration metrics of roelofs2020mitigating.
6 Related work
Averaging model weights.
Averaging the weights of models is a popular approach in convex optimization and deep learning. Most applications study models along the same optimization trajectory, e.g.
[ruppert1988efficient, polyak1990new, szegedy2016rethinking, izmailov2018averaging, zhang2019lookahead]. By contrast, frankle2020linear, neyshabur2020being and matena2021merging weightaverage models which share an initialization but are optimized independently. frankle2020linear find that, when training a pair of models from scratch with the same hyperparameter configuration but different data order, interpolating weights achieves no better than random accuracy. However, if the two models share a portion of their optimization trajectory, accuracy does not drop when they are averaged. Analogously, neyshabur2020being demonstrate that when two models are finetuned with the same pretrained initialization, the interpolated model attains at least the accuracy of the endpoints. Unlike frankle2020linear and neyshabur2020being we consider averaging many models with varied hyperparameter configurations.matena2021merging merge models with the same pretrained initialization that are finetuned on different text classification tasks. They also propose Fisher information as an alternative technique for model merging. Unlike experiments in our Section 3.5, matena2021merging use data from the target distribution for finetuning. Moreover, wortsman2021robust average zeroshot and finetuned models, finding improvements in and outofdistribution. In contrast to wortsman2021robust, we average models across many independent runs which provides more substantial improvements indistribution.
Stochastic Weight Averaging (SWA) [izmailov2018averaging], which averages weights along a single optimization trajectory, is also motivated by the relation between ensembling model outputs and averaging model weights. In contrast, the averaging we propose is across independent runs. Moreover, while their analysis relates the averaged network outputs (i.e., the logit ensemble) to the output of the a network with the averaged weights, our analysis (Section 4
) goes a step further and relates the classification losses associated with these two vectors.
Pretraining and finetuning.
In computer vision and natural language processing, the best performing models are often pretrained on a large dataset before being finetuned on data from the target task
[donahue2014decaf, yosinski2014transferable, sharif2014cnn, girshick2014rich, mahajan2018exploring, kornblith2019better, yalniz2019billion, kolesnikov2020big, bommasani2021opportunities]. This paradigm is also referred to as transfer learning. Recently, imagetext pretraining has become increasingly popular in computer vision as a pretraining task
[radford2021learning, jia2021scaling, mu2021slip, pham2021scaling]. Recent work has explored alternative strategies for adapting these models to specific target tasks [coop, gao2021clip, zhang2021tip], for instance via a lightweight residual feature adapter. In contrast, our work explores standard endtoend finetuned models. Other work has attempted to improve transfer learning by regularizing models toward their initialization [xuhong2018explicit], choosing layers to tune on a perexample basis [guo2019spottune], reinitializing layers over the course of training [li2020rifle], or using multiple pretrained models with datadependent gating [shu2021zoo].Ensembles.
Combining the outputs of many models is a foundational technique for improving the accuracy and robustness of machine learning models [dietterich2000ensemble, bauer1999empirical, breiman1996bagging, friedman2001elements, deepensembles, FREUND1997119]. ovadia2019can show that ensembles exhibit high accuracy under distribution shift. mustafa2020deep propose a method for identifying subsets of pretrained models for finetuning and later ensembling them, finding strong indistribution accuracy and robustness to distribution shift. gontijo2021no conduct a largescale study of ensembles, finding that higher divergence in training methodology leads to uncorrelated errors and better ensemble accuracy. Finally, previous work has explored building ensembles of models produced by hyperparameter searches [snoek2015scalable, mendoza2016towards, saikia2020optimized], including greedy selection strategies [caruana2004ensemble, caruana2006getting, levesque2016bayesian, wenzel2020hyperparameter]. Importantly, ensembles require a separate inference pass through each model, which increases computational costs. When the number of models is large, this can be prohibitively expensive. Unlike ensembles, model soups require no extra compute at inference time.
7 Conclusion
Our results challenge the conventional procedure of selecting the best model on the heldout validation set when finetuning. With no extra compute during inference, we are often able to produce a better model by averaging the weights of multiple finetuned solutions.
Acknowledgements
We thank Ting Chen, Jesse Dodge, Ben Eysenbach, David Fleet, PieterJan Kindermans, Mohammad Norouzi, Sarah Pratt and Vivek Ramanujan for helpful discussions and draft feedback, Lucas Beyer and Xiaohua Zhai for assistance with ViTG/14 finetuning, and Hyak at UW for computing support.
YC was supported in part by the Israeli Science Foundation (ISF) grant no. 2486/21, the Len Blavatnik and the Blavatnik Family foundation, and The Yandex Initiative for Machine Learning. ^{6}^{6}todo: 6ludwig: add funding
References
Appendix A Additional Figures
Appendix B Experimental Details
b.1 Learned soup
In addition to the greedy soup method described in the text, we also explore a more advanced souping procedure, which removes the sequential constraint from the greedy soup and requires only a single pass through the held out validation set. We refer to this method as the learned soup, as it involves learning the soup mixing coefficients for each of the ingredients on the heldout validation set. However, the learned soup has the downside of requiring all models to be simultaneously loaded in memory. In practice we combine the models on CPU before moving the parameters to GPU for each batch. For loss and validation set , we find mixing coefficients and temperature scaling parameter via
(2) 
In practice we find better results when is parameterized as the output of a softmax, so that each is positive and values sum to one. We optimizer the aforementioned equation with gradient based minibatch optimization for three epochs over the heldout validation set with the AdamW otpimizer and constant learning rate 0.1.
As presented in Table 3, we also try a “by layer” variant of the learned soup. For this we learn a separate for each layer of the network.
b.2 Error landscape visualizations
To supplement Figure 2, we provide an identical experiment but with a 10x bigger learning rate instead of 10x smaller. Results are illustrated in Figure B.1 with linear instead of log scaling for the contour lines. Since the error difference is more substantial, linear scaling was more clear. When finetuning with a larger learning rate, error increases on the path between the two finetuned solutions. All error landscape visualizations use CLIP ViTB/32 finetuned on ImageNet for 10 epochs with minimal data augmentation, as used by CLIP during pretraining. When computing angles between the two finetuned solutions, as in Figure 4
, we use the repeated weights which constitute the majority of the network parameters. We ignore gain terms which tend to skew positive if occurring before ReLU activations.
b.3 Model soups
This section describes the set of hyperparameters used for searches. For all ImageNet experiments, we withhold 2% of the training set and use these examples as the heldout validation set for model selection in greedy and learned soup.
b.3.1 CLIP experiments
Unless otherwise mentioned, all experiments used the AdamW optimizer [loshchilov2018decoupled] with cosine annealing learning rate schedule [loshchilov2016sgdr] for 10 epochs at batch size 512. When necessary we discretize augmentation strength into minimal, medium, and strong. Minimal augmentation uses only a random crop consisting of 90%100% of the total image area. Medium is the default augmentation used by the timm library [rw2019timm]. Strong refers to RandAugment [cubuk2020randaugment] (, ).
We now provide the low level details for the hyperparemter searches, which are standard grid, extreme grid, and random search. The standard grid includes learning rates , where typically perform the best. Augmentation strengths are minimal, medium, or strong. Mixup is either off or on at . We consider all combinations of the above, running each hyperparameter configuration with two random seeds.
The extreme grid considers learning rates , where typically perform the best. Augmentation strengths are minimal, medium, or strong. Mixup is either off or on at . Moreover, we include the initialization in this search, which often outperforms some of the extreme learning rates but is far from the most accurate model.
The random search chooses learning rate where is selected uniformly at random from 4 to 6. Weight decay is chosen randomly as where
is selected uniformly at random from 0.2 to 4. With probability 0.5, label smoothing is set to 0 and otherwise it is selected uniformly at random between 0 and 0.25. Finetuning epochs are chosen randomly between four and sixteen. Mixup is 0 with probability 0.5, and otherwise is chosen uniformly at random from 0 to 0.9. With probability
we use minimal augmentation, otherwise we use randaug where and are chosen uniformly at random between 0 and 20 and 0 and 2 respectively.When finetuning on WILDSFMoW and WILDSiWildCam for Figure 6, we use the same random search as when we finetune CLIP on ImageNet. The only difference is that we are able to use a larger ViTL/14 model as the datasets are smaller. This also requires us to change the default batch size from 512 to 128.
b.3.2 ALIGN experiments
We finetuned ALIGN EfficientNetL2 models using AdamW with weight decay of 0.1 at a resolution of for 25 epochs, with the final layer initialized from a linear probe without data augmentation. We finetuned 5 models with standard Inceptionstyle random crops (consisting of 5% to 100% of the total image area with an aspect ratio between 0.75 and 1.33) and different learning rates (, , , , and ). We also finetuned 7 additional models at a learning rate of with different data augmentation strategies. Specifically, we varied the random cropping strategy (either Inceptionstyle crops or less aggressive crops consisting of 90% to 100% of the total image area with an aspect ratio between 0.95 and 1.05), the use of RandAugment [cubuk2020randaugment] (off or , ), and the use of mixup [zhang2017mixup] (off or ) and trained models with all combinations of these strategies. Our soups are obtained by considering these 12 models as well as the linear probe initialization. We perform evaluation at resolution using a square center crop from images. The accuracy we attain with greedy soup approaches that reported by jia2021scaling, which evaluated at resolution.
b.3.3 ViTG/14 experiments
These models are initialized with a backbone that was pretrained on the JFT3B dataset [zhai2021scaling] and linear probes obtained at either the resolution at which the ViTG/14 was pretrained or at the resolution used for finetuning. Models are finetuned at a batch size of 512 for either 10,000 or 20,000 steps (approximately 4 or 8 epochs) using the Adafactor optimizer [shazeer2018adafactor] with learning rates of 3e5 or 5e5; a constant or cosine decay learning rate schedule; and softmax or binary crossentropy loss. When finetuning with binary crossentropy loss, we use a linear probe that is also trained with binary crossentropy loss. We vary data augmentation, applying RandAugment [cubuk2020randaugment], mixup [zhang2017mixup], or CutMix [yun2019cutmix] of varying strengths and random cropping with a minimum crop size of 5%, 70%, 90%, or 100% of the full image. When applying SAM, we consider models with perturbations either synchronized or unsynchronized across accelerators, including one model with synchronized perturbations and a combination of CutMix and SAM. All models are finetuned at resolution and evaluated by rescaling test images to (without preserving the aspect ratio) and taking a central crop.
We manually tuned hyperparameters with the goal of maximizing singlemodel accuracy. After settling on the use of Adafactor as the optimizer, we included all subsequently trained models in the pool of models to be used for greedy soup. The model that performs best on the holdout set is initialized with a linear probe and finetuned with a learning rate of 3e5 and a constant learning rate decay schedule, with softmax crossentropy loss, a minimum crop size of 90%, and CutMix with . The model that performs best on the official ImageNet validation set is initialized with a linear probe and finetuned at a learning rate of 3e5 and a constant learning rate decay schedule, with softmax crossentropy loss, a minimum crop size of 90%, CutMix with , and SAM. The greedy soup contains models trained with a wide range of different hyperparameter values including different learning rates, linear probes, loss functions, and every form of data augmentation and minimum crop size investigated. Notably, although models trained with SAM with synchronized perturbations are included in the greedy soup, the greedy soup process skips over the models trained with SAM with unsynchronized perturbations because adding them produces a large drop in holdout accuracy.
b.4 Crossdataset soups details
This section provides additional details for the findings presented in Section 3.5. When finetuning we initialize with CLIP ViTB/32 and use learning rate for 10 epochs with minibatch size of 512. We train with minimal augmentation.
Appendix C Analysis of 1D hyperparameter grids
This section asks: for a one dimensional grid of hyperparameters , how does averaging the models finetuned with hyperparameter configurations and corresponding to the endpoints compare with picking the best individual model finetuned with hyperparameter configuration ?
The results are illustrated in Figure C.1, where each square represents a grid . The average of the endpoints often outperforms the best individual model in the grid. A notable exception is when the learning rate is the left endpoint of the grid. As this experiment uses AdamW, this learning rate is too high for finetuning and, unlike the examples in Figure 2, there is a high error barrier between the two finetuned solutions (see Figure B.1, lower right for example).
When varying optimizer we use minimal data augmentation and LR
for RMSProp
[rmsprop], Adam [kingma2014adam], and AdamW [loshchilov2018decoupled]. SGD requires a larger learning rate, and so we use . When varying augmentation strength, we use minimal data augmentation and LR .Appendix D Additional grid searches and initializations
This section recreates Figure 5 with different initializations (linear probe initialization or zeroshot) and different grid searches (standard and extreme grid) when finetuning CLIP ViTB/32. The standard and extreme grid searches are described in Section B.3.1.
Figure D.1 considers the linear probe (LP) initialization and the standard grid. Figure D.2 considers the linear probe (LP) initialization and the extreme grid. Figure D.3 considers the zeroshot initialization and the standard grid. Figure D.4 considers the zeroshot initialization and the extreme grid.
Appendix E Additional finetuning and pretraining datasets
In this section we explore finetuning or pretraining on additional datasets. Figure E.1 displays results for finetuning a CLIP ViTL model on CIFAR10 [krizhevsky2009learning]. The axis of Figure E.1 displays accuracy on CIFAR10.1 [pmlrv97recht19a], a reproduction of CIFAR10 with a distribution shift. The individual models are finetuned with the random hyperparameter search described in Section B.3.1.
Next, Figure E.2 shows results when finetuning a ViTB/32 [dosovitskiy2021an] model pretrained on ImageNet22k [deng2009imagenet] and finetuned on ImageNet. This differs from many of our other experiments as the dataset used for pretraining is smaller and less diverse. While the greedy soup offers an improvement, the improvement is less substantial than, e.g., Figure 1 which uses the same model and hyperparameter search but a different pretraining dataset.
Finally, we finetune a ViTB/32 model five times on ImageNet, using the best hyperparameters found by the hyperparameter sweep, varying only the random seed. This experiment is conducted both for a model pretrained on ImageNet22k [deng2009imagenet] and a pretrained CLIP model. The results are shown in Figure E.3, comparing, for an experimental budget of models: (i) the individual model with random seed , (ii) the model soup composed of models with random seeds 1 through , and (iii) the ensemble composed of models with random seeds 1 through . The performance of the model soup appears correlated with the performance of the ensemble, as we find that CLIP models are more amenable to both ensembling and souping than models pretrained on ImageNet22k.
Appendix F Analytical comparison details
f.1 Notation and preliminaries
We begin by restating and adding to the notation used in Section 4. For a model with parameter vector and input vector , we let denote the model’s logit output for way classification. Throughout, we fix two endpoint models and , and for an interpolation parameter define
to be the “soup” weight averaged model and its corresponding logits. We also write
for the logits of the ensemble model. We write
for the difference of the two endpoints.
For a logit vector and a groundtruth label , denote the crossentropy loss by
For some distribution over we write the expected calibrated log losses of the soup and ensemble as
respectively.
We have the following expression for the derivatives of cross entropy w.r.t. logits. The gradient is
where is the th standard basis vector and has in its th entry. The Hessian is
so that for any , we have
Finally, we use to denote a vector in whose th entry is . Similarly, denotes a vector in whose th entry is , where gradients and Hessian are with respect to .
f.2 An exact expression for logit difference
We use the fundamental theorem of calculus and elemntary algebraic manipulation to obtain an exact integral form for the difference between the soup and ensemble logits. To streamline notation we drop the dependence of the logits on the input .
(3) 
where
Note that
f.3 Derivation of approximation
We continue to suppress the dependence on in order to simplify notation. We begin with the following first order approximation of the pointwise logloss difference between the ensemble and soup, which is also a lower bound due to convexity.
Now, we approximate the ensemble and soup logit difference using eq. 3 by assuming that for all ; this holds when the logits are approximately quadratic along the line between the checkpoints. The resulting approximation is
Combining the two approximation above, we obtain
To relate this expression to the Hessian of the loss with respect to the parameters, we note that for any
(by the chain rule)
When setting , we note that the second term on the RHS is (up to a constant) our approximation for the loss difference). Recalling the expression for the crossentropy Hessian, the first term is
As a final approximation, we let
this holds when logits are too far from linear in .
Substituting back and making explicit, we obtain
where we have used
Scaling all logits by , the approximation becomes
Averaging the result over , we arrive at the approximation (1), which we repeat here for ease of reference:
f.4 Detailed empirical evaluations
Evaluation setup.
We evaluated our bounds on checkpoints from the ViTB/32 finetuning experiments from the extreme grid search described in Section B.3.1. We selected three learning rate values (, and ), two levels augmentation (none and RandAugment+MixUp), and considered two different random seeds ( and ). From these checkpoints (as well as the initialization) we constructed the following pairs:

All pairs with different learning rate, the same augmentation level and seed 0,

All pairs with the same learning rate, different augmentation level and seed 0,

All pairs with the same learning rate and augmentation level, but different seeds,

All checkpoints with seed 0 coupled with the initialization.
This results in 21 pairs overall. For each pair and each we evaluated , as well as the approximation (1). We performed this evaluation on the ImageNet validation set as well as on the 5 distribution shift test sets considered throughout this paper.
The effect of temperature calibration.
Since our ultimate goal is to accurately predict the difference in error rather than the difference in loss, we introduce the inversetemperature parameter to the loss, and tune it to calibrate the soup model. Specifically, for every model pair, value of and test set, we take .
While choosing based on the soup rather the ensemble might skew the loss in favor of the soup, it has no effect on the difference in prediction error. Moreover, in preliminary experiments calibrating the ensemble produced very similar results. In contrast, as shown in Figure 9, fixing throughout results in far poorer prediction of the difference in error.
Appendix G Additional baselines
This section explores additional baselines for model soups, including distillation from an ensemble as in hinton2014dark (Table G.1), fixaugmentation as in fixaug (Table G.2), weightaveraging along a trajectory as in szegedy2016rethinking (Table G.3), and Sharpness Aware Minimization as in foret2021sharpnessaware (Table G.4).
Unless otherwise mentioned, we finetune CLIP ViTB/32 models with AdamW [loshchilov2018decoupled] and cosine annealing learning rate [loshchilov2016sgdr] for 10 epochs on ImageNet with a learning rate of 2e5 and medium augmentation (data augmentation policies are discussed in more detail in Section B.3.1).
We explore the baseline of distillation [hinton2014dark, hinton2015distilling] from the ensemble of three models trained with different data augmentation. As previously reported [bagherinezhad2018label, beyer2021knowledge], we find that it improves accuracy to run distillation with data augmentation. Unfortunately, this substantially increases the computational resources necessary to distill from the ensemble. As we cannot cache the predictions of the models in the ensemble, it is necessary to perform a forward pass for each model in the ensemble at each step of finetuning. This makes distilling from an ensemble similarly expensive as training the models which constitute the ensemble. Nevertheless, as illustrated in Table G.1, model soups still perform favorably.
Table G.1 also introduces stochastic augmentation. For each data point, stochastic augmentation randomly applies minimal, medium, or strong data augmentation. Additionally, Table G.2 explores an alternative method for merging augmentations together. This augmentation policy, which we refer to as fixaug, is introduced by fixaug. For fixaug, strong augmentation is used for all but the final epoch, which uses minimal augmentation.
Next, Table G.3 applies model soups to solutions which already average along the finetuning trajectory. Methods for averaging along an individual optimization trajectory include exponential moving averages (EMA) [szegedy2016rethinking] and stochastic weight averages (SWA) [izmailov2018averaging]. Since accuracy is high even from initialization we use EMA. While EMA improves the accuracy of a single model, we find that models without EMA are more amenable to souping. Regardless, the model soup improves over the best single model with EMA.
Finally, Table G.4 explores the relation between model soups and sharpnessaware minimization (SAM) [foret2021sharpnessaware]. In line with previous results, we find that SAM improves accuracy over vanilla finetuning. Souping two models trained with SAM improves over either individual model, although the magnitude of the gain is smaller than for vanilla finetuning. Souping models trained with and without SAM yields higher accuracy than souping models trained only with vanilla finetuning or only with SAM.
ImageNet  Distribution shifts  
Individual model (LR 3e05, minimal aug)  76.42  43.21 
Individual model (LR 3e05, medium aug)  78.83  43.55 
Individual model (LR 3e05, strong aug)  79.08  43.75 
Individual model (LR 3e05, stochastic aug)  78.94  45.04 
Individual model (LR 3e05, stochastic aug 3x epochs)  78.38  42.18 
Distillation from the ensemble (LR 3e05, no aug)  78.59  43.45 
Distillation from the ensemble (LR 3e05, stochastic aug)  79.79  45.63 
Soup minimal, medium, and strong aug (LR 3e05)  80.24  47.97 
Ensemble minimal, medium, and strong aug (LR 3e05)  80.19  46.33 
Individual model (LR 1e05, minimal aug)  77.19  47.98 
Individual model (LR 1e05, medium aug)  79.51  46.74 
Individual model (LR 1e05, strong aug)  79.33  46.62 
Individual model (LR 1e05, stochastic aug)  79.48  48.07 
Individual model (LR 1e05, stochastic aug 3x epochs)  79.59  46.89 
Distillation from the ensemble (LR 1e05, no aug)  79.13  47.28 
Distillation from the ensemble (LR 1e05, stochastic aug)  79.88  47.49 
Soup minimal, medium, and strong aug (LR 1e05)  80.08  49.75 
Ensemble minimal, medium, and strong aug (LR 1e05)  80.17  49.36 
ImageNet  Distribution shifts  
Individual model (LR 3e05, minimal aug)  76.42  43.21 
Individual model (LR 3e05, medium aug)  78.83  43.55 
Individual model (LR 3e05, strong aug)  79.08  43.75 
Individual model (LR 3e05, fix aug)  79.43  45.46 
Individual model (LR 3e05, fix aug 4x epochs)  78.57  41.53 
Soup minimal, medium, and strong aug (LR 3e05)  80.24  47.97 
Soup minimal, medium, strong, and fix aug (LR 3e05)  80.41  48.14 
Individual model (LR 1e05, minimal aug)  77.19  47.98 
Individual model (LR 1e05, medium aug)  79.51  46.74 
Individual model (LR 1e05, strong aug)  79.33  46.62 
Individual model (LR 1e05, fix aug)  79.70  48.18 
Individual model (LR 1e05, fix aug 4x epochs)  79.96  45.86 
Soup minimal, medium, and strong aug (LR 1e05)  80.08  49.75 
Soup minimal, medium, strong, and fix aug (LR 1e05)  80.17  49.71 
ImageNet  Distribution shifts  
Individual model (no EMA, LR 3e05, minimal aug)  76.42  43.21 
Individual model (no EMA, LR 3e05, medium aug)  78.83  43.55 
Individual model (no EMA, LR 3e05, strong aug)  79.08  43.75 
Soup minimal, medium, and strong aug without EMA (LR 3e05)  80.24  47.97 
Individual model (EMA decay 0.9999, LR 3e05, minimal aug)  77.61  47.45 
Individual model (EMA decay 0.9999, LR 3e05, medium aug)  79.37  46.89 
Individual model (EMA decay 0.9999, LR 3e05, strong aug)  79.17  46.85 
Soup minimal, medium, strong aug with EMA (LR 3e05)  79.76  49.69 
Individual model (no EMA, LR 1e05, minimal aug)  77.19  47.98 
Individual model (no EMA, LR 1e05, medium aug)  79.51  46.74 
Individual model (no EMA, LR 1e05, strong aug)  79.33  46.62 
soup minimal, medium, strong without EMA (LR 1e05)  80.08  49.75 
Individual model (EMA decay 0.9999, LR 1e05, minimal aug)  77.47  50.94 
Individual model (EMA decay 0.9999, LR 1e05, medium aug)  78.93  49.76 
Individual model (EMA decay 0.9999, LR 1e05, strong aug)  78.85  49.02 
Soup minimal, medium, and strong aug with EMA (LR 1e05)  79.16  51.76 
ImageNet  Distribution shifts  
Vanilla finetuning (seed 0)  79.32  45.09 
Vanilla finetuning (seed 1)  79.16  45.12 
SAM finetuning (seed 0)  79.61  43.78 
SAM finetuning (seed 1)  79.59  43.79 
Soup (vanilla finetuning, seeds 0 and 1)  79.78  46.46 
Soup (SAM finetuning, seeds 0 and 1)  79.85  44.44 
Soup (vanilla finetuning and SAM finetuning, seed 0)  80.04  45.38 
Appendix H Text classification datasets
We study four text classification datasets from the GLUE benchmark [wang2018glue].
Microsoft Research Paraphrase Corpus
(MRPC; [dolan2005automatically]) contains pairs of sentences, labeled as either nearly semantically equivalent, or not. The dataset is evaluated using the average of and accuracy. The training set consists of 3.7 thousand samples and the validation set of 409 samples.
Recognizing Textual Entailment
(RTE; [wang2018glue]) contains pair of sentences, and the task is to predict whether the first sentence (the premise) entails or contradicts the second sentence (the hypothesis). The data is originally from a series of datasets [dagan2005pascal, bar2006second, giampiccolo2007third, bentivogli2009fifth]. The dataset is evaluated using classification accuracy. The training set consists of 2.5 thousand samples and the validation set of 277 samples.
Corpus of Linguistic Acceptability
(CoLA; [warstadt2018neural]) contains sentences labeled as either grammatical or ungrammatical. Models are evaluated on Matthews correlation (MCC; [matthews1975comparison]), which ranges between and . The training set consists of 8.6 thousand samples and the validation set consists of 1043 samples.
Stanford Sentiment Treebank
(SST2; [socher2013recursive]) contains sentences labelled as expressing positive or negative sentiment, collected from movie reviews. The dataset is evaluated using classification accuracy. The training set consists of 67 thousand samples and the validation set consists of 873 samples.
Appendix I Finetuning details for text classification tasks
Model  Method  MRPC  RTE  CoLA  SST2 
BERTbase [devlin2019bert]  Best individual model  88.3  61.0  59.1  92.5 
Uniform soup  76.0  52.7  0.0  89.9  
Greedy soup  88.3  61.7  59.1  93.0  
BERTlarge [devlin2019bert]  Best individual model  88.8  56.7  63.1  92.2 
Uniform soup  15.8  52.7  1.90  50.8  
Greedy soup  88.8  56.7  63.1  92.3  
T5small [raffel2020t5]  Best individual model  89.7  70.0  42.2  91.7 
Uniform soup  82.7  61.7  10.4  91.1  
Greedy soup  89.7  70.0  43.0  91.7  
T5base [raffel2020t5]  Best individual model  91.8  78.3  58.8  94.6 
Uniform soup  86.4  71.8  12.3  94.6  
Greedy soup  92.4  79.1  60.2  94.7  
T5large [raffel2020t5]  Best individual model  93.4  82.7  61.7  96.3 
Uniform soup  74.8  50.2  0.00  96.0  
Greedy soup  93.4  84.8  62.7  96.3 
Each model is finetuned 32 times on each dataset, performing a random hyperparameter search. The learning rate is chosen uniformly in log space over [, ], the batch size is chosen uniformly from and the number of epochs from . Evaluation is conducted once at the end of training, without early stopping. We use a maximum sequence length of 128 tokens and train with Adam [kingma2014adam] using , and
, gradient clipping of
, no weight decay, and with the learning rate being decayed linearly to zero at the end of training. We use pretrained weights from the Huggingface Transformers library [wolf2020transformers]. For BERT models, we use the uncased version.Finetuning occurs without any additional parameters to avoid distorting the features from the pretrained models [kumar2021finetuning]. For such, the classification tasks are adapted to be suited to the pretraining objective of BERT and T5. For T5, the tasks are cast as a sequencetosequence problem. For instance, for sentiment analyses, an example is to predict “A) positive” from “sentence: The best movie I’ve ever seen!  options: A) positive B) negative  label:”. For BERT, the tasks are cast as a masked language modeling problem. For instance, for linguistic acceptability, an example is to predict “A) acceptable” for the inputs “sentence: model soups are grammatical.  options: A) acceptable B) unacceptable  label: [MASK] [MASK] [MASK]”. For evaluation, we select which of the options is given the highest probability according to the model.
The full set of results is shown in Table I.1. On 10 out of the 20 combinations of models and datasets, the greedy soup shows better performance than the best individual model from the hyperparameter search. Uniform soups show worse performance than the best individual model on all experiments, which could be an artifact of the broad range of hyperparameters used in the search. While the experiments varied only basic hyperparameters such as learning rate and batch size, we hypothesize that a broader set of hyperparameter choices (e.g. data augmentation [wei2019eda, ma2019nlpaug]) could lead to more diverse models and better soups.
Finally, as a word of caution for practitioners, we remind readers that many recent language models have tied weights on the output and embedding layers
[press2017using]. For this reason, caution is needed when averaging models inplace; doing so might inadvertently lead to undesired behavior.