## 1 Introduction

The success of Neural Networks (NN) is surprising, considering the hard optimization problem to be solved during training of NNs. Specifically, NN training is NP-complete (blumTraining3NodeNeural1988), the loss surface and optimization problem are non-convex (dauphinIdentifyingAttackingSaddle2014; goodfellowQualitativelyCharacterizingNeural2015; lecunDeepLearning2015) and the parameter space to fit during training is high dimensional (brownLanguageModelsAre2020). Additionally, NN training is sensitive to random initialization and hyperparameter selection (haninHowStartTraining2018; liVisualizingLossLandscape2018). Together, this leads to an interesting characteristic of NN training: given a dataset and an architecture, different random initializations or hyperparameters lead to different minima on the loss surface and therefore result in different model parameters (i.e., weights and biases). Consequently, multiple training results in different NN models. The resulting population of NN (referred to as model zoo) is an interesting object to study: Do individual models of the model zoo have something in common? Do they form structures in weight space? What can we infer from such structures? Can we learn representations of them? Lastly, can such structures be exploited to generate new models with controllable properties?

These questions have been partially answered in prior work. Theoretical and empirical work demonstrates increasingly well-behaved loss surfaces for growing number of parameters (goodfellowQualitativelyCharacterizingNeural2015; dauphinMetaInitInitializingLearning2019; liVisualizingLossLandscape2018). The shape of the loss surface and the starting point is determined by hyperparameters and the initialization, respectively (liVisualizingLossLandscape2018). NN training navigates the loss surface with iterative, gradient-based update schemes smoothed by momentum. The step length along a trajectory as well as the curvature are determined by the change of the loss as well as how aligned the subsequent updates are (cazenavetteDatasetDistillationMatching2022; schurholtInvestigationWeightSpace2021). Together, these findings suggest that populations of NN models evolve on unique and smooth trajectories in weight space. Related work has empirically confirmed the existence of such structures in NNs (denilPredictingParametersDeep2013), demonstrated the feasibility to learn representations of them, showed that they encode information on model properties (unterthinerPredictingNeuralNetwork2020; eilertsenClassifyingClassifierDissecting2020; schurholtSelfSupervisedRepresentationLearning2021) and can be used to generate unseen models with desirable properties (schurholtHyperRepresentationsPreTrainingTransfer2022; schurholtHyperRepresentationsGenerativeModels2022; zhmoginovHyperTransformerModelGeneration2022; knyazevParameterPredictionUnseen2021) To thoroughly answer the questions above, a large and systematically created dataset of model weights is necessary.

Unfortunately, so far only few model zoos with specific properties have been published (unterthinerPredictingNeuralNetwork2020; eilertsenClassifyingClassifierDissecting2020; suchAtariModelZoo2019; schurholtSelfSupervisedRepresentationLearning2021)

. While many machine learning domains have standardized datasets, there is no model zoo nor a benchmark to evaluate and compare against. The lack of a standardized model zoos has three significant disadvantages: (i), existing model zoos are usually designed for a specific purpose and of limited general utility. Their design space is rather sparse, covering only small portions of all available hyperparameter combinations. Moreover, some existing zoos are generated on synthetic tasks and are small, containing only a small population of models; (ii), researchers have to choose between using an existing zoo or generating a new one for each new experiment, weighing disadvantages of existing zoos against the effort and computational resources required to generate a new zoo; (iii), a new model zoo causes subsequent work to lose comparability to existing research. Therefore, the lack of a benchmark model zoo significantly increases the friction for new research.

Our contributions: To study the behaviour of populations of NNs, we publish a large-scale model zoo of diverse populations of neural network models with controlled generating factors of model training. Special care has been taken in their design and the used protocols for training. To do so, we have defined and restricted the generating factors of model zoo training to achieve desired zoo characteristics.

The zoos are trained on eight standard image classification datasets, with a broad range of hyperparameters and contain thousands of configurations. Further, we add sparsified model zoo twins to each of these zoos. All together, the zoos include a total of 50’360 unique image classification NNs, resulting in over 3’844’360 collected model states.

Potential use-cases for the model zoo include (a) model analysis for reliability, bias, fairness, or adversarial vulnerability, (b) inference of learning dynamics for efficiency gain, model selection or early stopping, (c) representation learning of such populations, or (d) model generation. Additionally, we present an analysis of the model zoos and a set of experimental setups for benchmarks on these use-cases and initial results as foundation for evaluation and comparison.

With this work we provide a standardized dataset of diverse model zoos connected to popular image datasets, its corresponding meta-data and performance evaluations to the machine learning research community. All data is made publicly available to foster community building around the topic and to provide a ground for use beyond the defined benchmark tasks. An overview of the proposed dataset and benchmark as well as potential use-cases can be found in Fig. 1

## 2 Existing Populations of Neural Networks Models

With the increase in usage of neural networks, requirements for evaluation, testing and certification have grown. Methods to analyze NN models may attempt to visualize salient features for a given class (zeilerVisualizingUnderstandingConvolutional2014; karpathyVisualizingUnderstandingRecurrent2015; yosinskiUnderstandingNeuralNetworks2015), investigate the robustness of models to specific types of noise (zintgrafVisualizingDeepNeural2017; dabkowskiRealTimeImage2017), predict model properties from model features (yakTaskArchitectureIndependentGeneralization2019; jiangPredictingGeneralizationGap2019; corneanuComputingTestingError2020) or compare models based on their activations (raghuSVCCASingularVector2017; morcosInsightsRepresentationalSimilarity2018; nguyenWideDeepNetworks2020) However, while most of these methods rely on common (image) datasets to train and evaluate their models, there is no common dataset of neural network models to compare the evaluation methods on. Model zoos as common evaluation datasets can be a step up to evaluate the evaluation methods.

There are only few publications who use model zoos. In (liuKnowledgeFlowImprove2019), zoos of pre-trained models are used as teacher models to train a target model. Similarly, (shuZooTuningAdaptiveTransfer2021) propose a method to learn a combination of the weights of models from a zoo for a new task. (zhouJittorGANFasttrainingGenerative2021) uses a zoo of GAN models trained with different methods to accelerate GAN training. To facilitate continual learning, (rameshModelZooGrowing2022) propose to generate zoos of models trained on different tasks or experiences, and to ensemble them for future tasks.

Larger model zoos containing a few thousand models are used in (unterthinerPredictingNeuralNetwork2020) to predict the accuracy of the models from their weights. Similarly, (eilertsenClassifyingClassifierDissecting2020) use zoos of larger models to predict hyperparameters from the weights. In (gavrikovCNNFilterDB2022), a large collection of 3x3 convolutional filters trained on different datasets is presented and analysed. Other work identifies structures in the form of subspaces with beneficial properties (lucasAnalyzingMonotonicLinear; wortsmanLearningNeuralNetwork2021; bentonLossSurfaceSimplexes2021). (schurholtSelfSupervisedRepresentationLearning2021)

use zoos to learn self-supervised representations on the weights of the models in the zoo. The authors demonstrate that the learned representations have high predictive capabilities for model properties such as accuracy, generalization gap, epoch and various hyperparameters. Further, they investigate the impact of the generating factors of model zoos on their properties.

(schurholtHyperRepresentationsPreTrainingTransfer2022; schurholtHyperRepresentationsGenerativeModels2022)demonstrate that learned representations can be instantiated in new models, as initialization for fine-tuning or transfer learning. This work systematically extends their zoos to more datasets and architectures.

## 3 Model Zoo Generation

The proposed model zoo datasets contain systematically generated and diverse populations of neural networks. Since the applicability of the model zoos for downstream tasks largely depends on the composition and properties of the zoos, special care has to be taken in their design and the used protocol for training. The entire procedure can be considered as defining and restricting the generating factors of model zoo training with respect to their latent relation of desired zoo characteristics. The described procedure and protocol could be also used as general blueprint for the generation of model zoos.

In our paper, the term architecture means the structure of a NN, i.e., set of operations and their connectivity. We use ’model’ to denote an instantiating of an architecture with weights over all stages of training, ’model state’ to denote the model with the specific state of weights at a specific training epoch, and the weights w to denote all trainable parameters (weights and biases).

### 3.1 Model Zoo Design

#### Generating Factors

Following (unterthinerPredictingNeuralNetwork2020), we define the tuple as a configuration of a model zoo’s generating factors. We denote the dataset of image samples with their corresponding labels as . The NN architecture is denoted by . We denote the set of hyperparameters used for training, (e.g.

, loss function, optimizer, learning rate, weight initialization, seed, batch-size, epochs) as

. While dataset and architecture are fixed for a model zoo, provides not only the set of hyperparameters but also configures the ranges for individual hyperparameter such as learning rate for model zoo generation. Training with such differing configurations results in a population of NN models i.e., the model zoo. We convert the weights and biases of each model to a vectorized form. In the resulting model zoo , denotes the flattened vector of the weights and biases of one trained NN model from the set of models of the zoo.#### Configurations & Diversity

The model zoos have to be representative of real world models, but also diverse and span an interesting range of properties. The definition of diversity of model zoos, as well as the choice of how much diversity to include, is as difficult as in image datasets, e.g. (dengImageNetLargeScaleHierarchical; feiConstructionAnalysisLarge). Model zoos can be diverse in their properties (i.e., performance) as well as in their generating factors , or in their weights w. We aim at generating model zoos with a rich set of models and diversity in these aspects. As these zoo properties are effects of the generating factors, we tune the diversity of the generating factors and evaluate the diversity in Section 4.

Dataset | Arch | Config | Init | Activation | Otpim | LR | WD | Dropout | Seed |
---|---|---|---|---|---|---|---|---|---|

MNIST | CNN (s) | Seed | U | T | AD | 3e-4 | 0 | 0 | 1-1000 |

CNN (s) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-3, 1e-4 | 0, 0.5 | ||

CNN (s) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-3, 1e-4 | 0, 0.5 | 1-10 | |

F-MNIST | CNN (s) | Seed | U | T | AD | 3e-4 | 0 | 0 | 1-1000 |

CNN (s) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-3, 1e-4 | 0, 0.5 | ||

CNN (s) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-3, 1e-4 | 0, 0.5 | 1-10 | |

SVHN | CNN (s) | Seed | U | T | AD | 3e-3 | 0 | 0 | 1-1000 |

CNN (s) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-3, 1e-4, 0 | 0, 0.3, 0.5 | ||

CNN (s) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-3, 1e-4, 0 | 0, 0.3, 0.5 | 1-10 | |

USPS | CNN (s) | Seed | U | T | AD | 3e-4 | 1e-3 | 0 | 1-1000 |

CNN (s) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-2, 1e-3 | 0, 0.5 | ||

CNN (s) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-2, 1e-3 | 0, 0.5 | 1-10 | |

CIFAR10 | CNN (s) | Seed | KU | G | AD | 1e-4 | 1e-2 | 0 | 1-1000 |

CNN (s) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3 | 1e-2, 1e-3 | 0, 0.5 | ||

CNN (s) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3 | 1e-2, 1e-3 | 0, 0.5 | 1-10 | |

CIFAR10 | CNN (m) | Seed | KU | G | AD | 1e-4 | 1e-2 | 0 | 1-1000 |

CNN (m) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3 | 1e-2, 1e-3 | 0, 0.5 | ||

CNN (m) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3 | 1e-2, 1e-3 | 0, 0.5 | 1-10 | |

STL (s) | CNN (s) | Seed | KU | T | AD | 1e-4 | 1e-3 | 0 | 1-1000 |

CNN (s) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-2, 1e-3 | 0, 0.5 | ||

CNN (s) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-2, 1e-3 | 0, 0.5 | 1-10 | |

STL | CNN (m) | Seed | KU | T | AD | 1e-4 | 1e-3 | 0 | 1-1000 |

CNN (m) | Hyp-10-r | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-2, 1e-3 | 0, 0.5 | ||

CNN (m) | Hyp-10-f | U, N, KU, KN | T, S, R, G | AD, SGD | 1e-3,1e-4 | 1e-2, 1e-3 | 0, 0.5 | 1-10 | |

CIFAR10 | RN-18 | Seed | KU | R | SGD | 1e-4* | 5e-4 | 0 | 1-1000 |

CIFAR10 | RN-18 | Seed | KU | R | SGD | 1e-4* | 5e-4 | 0 | 1-1000 |

CIFAR10 | RN-18 | Seed | KU | R | SGD | 1e-4* | 5e-4 | 0 | 1-1000 |

denotes the activation function: T - Tanh, S - Sigmoid, R - ReLU, G - GeLU.

Optimdenotes the optimizer: AD - Adam, SGD - Stochastic Gradient Descent. Models with learning rates denoted with * have been trained with a one-cycle LR scheduler, the listed LR is the maximum value.

Prior work discusses the impact of random seeds on properties of model zoos. While (yakTaskArchitectureIndependentGeneralization2019) use multiple random seeds for the same hyperparameter configuration, (unterthinerPredictingNeuralNetwork2020) explicitly argues against that to prevent information leakage between models from train to test set. To achieve diverse model zoos and disentangle the generating factors (seeds and hyperparameters), we train model zoos in three different configurations, some with random seeds, others with fixed seeds.

#### Random Seeds

The first configuration, denoted as Hyp-10-rand, varies a broad range of hyperparameters to define a grid of hyperparameters. To include the effect of different random initializations, each of the hyperparameter nodes in the grid is repeated with ten randomly drawn seeds. One model is configured with the combination of hyperparameters and seed, with a total of ten models per hyperparameter node. It is very unlikely for two models in the zoo share the same random seed. With this, we achieve the highest amount of diversity in properties, generating factors and weights.

#### Fixed Seeds

The second configuration, denoted as Hyp-10-fix, uses the same hyperparameter grid as , but repeats each node with ten fixed seeds . Fixing the seeds allows evaluation methods to control for the seed, isolate the influence of hyperparameter choices and still get robust results over 10 repetitions. A side effect of the (desired) isolation of factors of influence is, that fixing the seeds leads to repetitions of the starting point in weight space for models with the same seed and initialization methods. In the beginning of the training, these models may have similar trajectories.

#### Fixed Hyperparameters

For the third configuration, denoted as Seed, we fix one set of hyperparameters, and repeat that with 1000 different seeds. With that, we achieve zoos that are very diverse in weights and covers a broad range in weight space. These zoos and can be used to evaluate the impact of weights and their starting point on model performance. The hyperparameters for the Seed zoos are chosen such that there is still a level of diversity in model performance.

### 3.2 Specification of Generating Factors for Model Zoos

This section describes the systematic specification of the trained model zoos. Multiple generating factors define a configuration for the model zoo generation, detailed in Table 1.

Datasets : We generate model zoos for the following image classification datasets: MNIST (lecunGradientbasedLearningApplied1998), Fashion-MNIST (xiaoFashionMNISTNovelImage2017), SVHN (netzerReadingDigitsNatural2011), CIFAR-10 (krizhevskyLearningMultipleLayers2009), STL-10 (coatesAnalysisSingleLayerNetworks2011), USPS (hullDatabaseHandwrittenText1994), CIFAR-100 (krizhevskyLearningMultipleLayers2009) and

Tiny Imagenet

(leTinyImageNetVisual).Hyperparameter :
varied hyperparameters to train models in zoos are: (1) seed, (2) initialization method, (3) activation function,
(4) dropout, (4) optimization algorithm, (5) learning rate, and (6) weight decay. The batch-size and number of training epoch is kept constant within zoos.

Architecture :
To preserve the comparability within a model zoo, each zoo is generated using a single neural
network architecture. One of three standard architectures is used to generate each zoo.
Our intention with this dataset is similar to research communities such as Neural Architecture Search (NAS), Meta-Learning or Continual Learning (CL), where initial work started small-scale
(zhmoginovHyperTransformerModelGeneration2022; rameshModelZooGrowing2022)

. Hence, the first two architectures are a small and a slightly larger Convolutional Neural Network (CNN), both have three convolutional and two fully-connected layers, but different numbers of channels (details in Appendix

A). The third architecture is a standard ResNet-18 (heDeepResidualLearning2016). The (1) small CNN has a total of - parameters, the (2) medium CNN has parameters, the (3) ResNet-18 has 11.2M-11.3M parameters.Compared to (1), the medium architecture (2) provides additional diversity to the collection of model zoos and performs significantly better on more complex datasets CIFAR-10 and STL-10. These architectures are similar to the one used in (schurholtHyperRepresentationsGenerativeModels2022). The ResNet-18 architecture is included to apply the model zoo blueprint to models of the widely used ResNet family and so facilitate research on populations of real-world sized models.

### 3.3 Training of Model Zoos

Neural network models are trained from the previously defined three configurations (Seed, Hyp-10-rand, Hyp-10-fix, see Sec 2.1). With the 8 image datasets and the three configurations, this results in 27 model zoos. The zoos include a total of around 50’360 unique neural network models.

Training Protocol:

Every model in the collection of zoos is trained according to the same protocol. We keep the same train, validation and test splits for each zoo, and train each model for 50 epochs with gradient descent methods (SGD+momentum or ADAM). At every epoch, the model checkpoint as well as accuracy and loss of all splits are recorded. Validation and test performance are also recorded before the first training epoch. This makes 51 checkpoints per model training trajectory including the starting checkpoint representing the model initialization before training starts. The ResNet-18 zoos on CIFAR100 and Tiny Imagenet require more updates and are trained for 60 epochs. In total, this results in a set of

2’585’360 collected model states.Splits: To enable comparability, this set of models is split into training (70%), validation (15%), and test (15%) subsets. This split is done such that all individual checkpoints of one model training (i.e., the 51 checkpoints per training) is entirely in either training, validation, or test and therefore no information is leaked between these subsets.

Sparsified Model Zoo Twins: Model sparsification is an effective method to reduce computational cost of models. However, methods to sparsify models to a high degree while preserving the performance are still actively researched (hoeflerSparsityDeepLearning2021). In order to allow systematic studies of sparsification, we are extending the model zoos with sparsified model zoo twins serving as counterparts to existing zoos in the dataset. Using Variational Dropout (VD) (molchanovVariationalDropoutSparsifies2017), we sparsify the trained models from existing model zoos. VD generates a sparsification trajectory for each model, along which we track the performance, degree of sparsity and the sparsified checkpoint. With 25 sparsification epochs, this yields 1’259’000 sparsification model states.

### 3.4 Data Management and Accessibility of Model Zoos

The model zoos are made publicly available in an accessible, standardized, and well documented way to the research community under the Creative Commons Attribution 4.0 license (CC-BY 4.0). We ensure the technical accessibility of the data by hosting it on Zenodo, where the data will be hosted for at least 20 years. Further, we take steps to reduce access barriers by providing code for data loading and preprocessing, to reduce the friction associated with analyzing of the raw zoo files. All code can be found on the model zoo website www.modelzoos.cc. To ensure conceptional accessibility, we include detailed insights, visualizations and the analysis of the model zoo (Sec. 4) with each zoo. Further details can be found in Appendix B.

## 4 Model Zoo Analysis

Performance | Agreement | Weights | ||||||
---|---|---|---|---|---|---|---|---|

Dataset | Architecture | Config | Accuracy | l2-dist | cos dist | |||

MNIST | CNN (s) | Seed | 91.1 (0.9) | 88.5 (1.3) | 77.2 (5.2) | 18.9 (58.4) | 124.1 (4.9) | 77.1 (4.1) |

CNN (s) | Hyp-10-r | 79.9 (30.7) | 67.7 (35.5) | 58.6 (25.9) | 0.4 (46.5) | 150.6 (66.5) | 98.8 (7.2) | |

CNN (s) | Hyp-10-f | 80.3 (30.3) | 68.3 (35.3) | 58.8 (25.7) | 0.3 (46.7) | 149.7 (66.8) | 97.7 (10.0) | |

F-MNIST | CNN (s) | Seed | 72.7 (1.0) | 79.8 (2.6) | 82.3 (12.6) | 22.6 (55.6) | 122.0 (4.9) | 74.5 (4.4) |

CNN (s) | Hyp-10-r | 68.4 (23.7) | 59.9 (29.1) | 64.6 (23.5) | 1.0 (46.0) | 149.6 (62.2) | 99.2 (6.8) | |

CNN (s) | Hyp-10-f | 68.7 (23.4) | 60.4 (28.7) | 64.6 (22.7) | 0.9 (46.3) | 148.5 (61.9) | 97.9 (9.9) | |

SVHN | CNN (s) | Seed | 71.1 (8.0) | 67.2 (10.3) | 67.7 (15.7) | 7.1 (113.7) | 137.6 (8.3) | 94.5 (5.1) |

CNN (s) | Hyp-10-r | 35.9 (24.3) | 61.6 (35.9) | 17.8 (28.0) | 1.4 (42.2) | 170.5 (149.4) | 83.6 (30.4) | |

CNN (s) | Hyp-10-f | 36.0 (24.4) | 61.4 (36.0) | 18.1 (27.9) | 1.3 (42.2) | 170.0 (149.0) | 83.2 (30.7) | |

USPS | CNN (s) | Seed | 87.0 (1.7) | 87.3 (2.2) | 86.7 (6.3) | 8.2 (26.9) | 123.1 (5.2) | 75.9 (5.0) |

CNN (s) | Hyp-10-r | 64.7 (30.8) | 55.3 (31.4) | 50.9 (30.5) | 2.1 (39.6) | 155.5 (92.6) | 99.1 (8.9) | |

CNN (s) | Hyp-10-f | 65.0 (30.7) | 55.4 (31.3) | 50.4 (30.4) | 1.9 (40.1) | 154.2 (93.1) | 97.3 (13.7) | |

CIFAR10 | CNN (s) | Seed | 48.7 (1.4) | 65.7 (3.1) | 72.9 (11.3) | 1.1 (11.0) | 138.7 (5.6) | 96.3 (5.1) |

CNN (s) | Hyp-10-r | 35.1 (16.3) | 33.3 (22.9) | 47.5 (34.0) | -0.2 (17.0) | 155.6 (71.0) | 97.5 (10.8) | |

CNN (s) | Hyp-10-f | 35.1 (16.2) | 33.3 (22.8) | 47.3 (34.2) | -0.2 (16.9) | 155.3 (70.0) | 97.2 (11.1) | |

CIFAR10 | CNN (m) | Seed | 61.5 (0.7) | 76.0 (1.6) | 92.4 (1.7) | 0.1 (18.2) | 137.0 (7.9) | 94.1 (9.2) |

CNN (m) | Hyp-10-r | 39.6 (21.8) | 34.5 (27.1) | 43.2 (36.5) | -0.4 (23.0) | 158.9 (79.9) | 98.6 (12.2) | |

CNN (m) | Hyp-10-f | 39.6 (21.7) | 34.4 (26.7) | 42.8 (37.8) | -0.4 (22.9) | 158.1 (77.2) | 98.0 (13.1) | |

STL | CNN (s) | Seed | 39.0 (1.0) | 48.4 (3.0) | 81.5 (3.9) | -0.1 (19.1) | 141.2 (5.0) | 99.8 (4.2) |

CNN (s) | Hyp-10-r | 23.1 (12.3) | 23.4 (20.9) | 39.0 (30.7) | 3.0 (40.0) | 158.7 (107.3) | 98.7 (10.9) | |

CNN (s) | Hyp-10-f | 23.0 (12.2) | 23.3 (21.1) | 38.1 (30.0) | 3.0 (39.8) | 157.1 (107.2) | 96.8 (16.3) | |

STL | CNN (m) | Seed | 47.4 (0.9) | 53.9 (2.2) | 83.3 (2.3) | 0.1 (26.6) | 141.3 (6.0) | 99.9 (5.8) |

CNN (m) | Hyp-10-r | 24.3 (14.7) | 23.2 (24.2) | 34.1 (30.0) | 2.3 (45.7) | 159.3 (103.0) | 99.1 (12.5) | |

CNN (m) | Hyp-10-f | 24.4 (14.7) | 23.7 (24.5) | 34.6 (30.3) | 2.3 (46.5) | 157.4 (104.1) | 97.6 (20.1) | |

CIFAR10 | ResNet-18 | Seed | 92.1 (0.2) | 93.4 (0.7) | –.- (-.-) | -0.01 (1.7) | 122.1 (3.9) | 72.2 (2.3) |

CIFAR100 | ResNet-18 | Seed | 74.2 (0.3) | 77.6 (1.2) | –.- (-.-) | -0.1 (1.6) | 130.8 (4.1) | 83.1 (2.6) |

Tiny ImageNet | ResNet-18 | Seed | 63.9 (0.7) | 66.1 (1.9) | –.- (-.-) | -0.1 (1.9) | 125.4 (4.9) | 77.1 (3.0) |

The model zoos have been created aiming at diversity in generating factors, weights and performance. In this section, we analyse the zoos and their properties. Zoo cards with key values and visualizations are provided along with the zoos online. We consider models at their last epoch for the analysis. For all later analysis, non-viable checkpoints are excluded from each zoo. This includes the removal of every checkpoint with NaN values or values beyond a threshold. The threshold value is set for each zoo, such that it only excludes diverging models.

#### Performance

To investigate the performance diversity, we consider the accuracy of the models in the zoo, see Table 2 and Figure 2. As expected, the zoos with variation only in the seed show the smallest variation in performance. Changing the hyperparameters induces a broader range of variation. Changing (Hyper-10-rand) or fixing (Hyper-10-fix) the seeds does not affect the accuracy distribution.

#### Model Agreement

To get a more in-depth insights in the diversity of model behavior, we investigate their pairwise agreement, see Table 2. To that end, we compute the rate of agreement of class prediction between two models as . Here are the predictions of models for sample of samples. Further, if and otherwise . Further, we compute the pairwise centered kernel alignment (cka) score between intermediate and last layer outputs and denote it as . The cka score evaluates the correlation of activations, compensating for equivariances typical for neural networks (nguyenWideDeepNetworks2020). In empirical evaluations, we found the cka score robust for relatively small number of image samples, and compute the scrore using 50 images to reduce the computational load. Both agreement metrics confirm the expectation and performance results. Zoos with higher overall performance naturally have a higher agreement on average, as there fewer samples on which to disagree. Zoos with varying hyperparameters(Hyp-10-rand and Hyp-10-fix) agree less on average than zoos with changes in seed only (Seed). What is more, the distribution of and in the Seed zoos is unimodal and approximately gaussian. In the Hyp-10 zoos, the distributions are bi-modal, with one mode around 0.1 (0.0) and the other around 0.9 (0.75) in hard agreement (cka score). In these zoos, models agree to a rather high degree with some models, and disagree with others.

#### Weights

Lastly, we investigate the diversity of the model zoos in weight space, see again Table 2

. By design, the mean weight value of the zoos varying only in the seed is larger than in the other zoos, while the standard deviation does not differ greatly (Table

2, column w). To get a better intuition in the distribution of models in weight space, we compute the pairwise and cosine distance , and investigate their distribution. Here, too, varying the hyperparameters introduces higher amounts of diversity, while changing or fixing the seeds does not affect the weight diversity much. As these values are computed at the end of model training, repeated starting points due to fixed seeds appear not to reduce weight diversity significantly. In a more hands-off approach, we compute 2d reductions of the weight over all epochs using UMAP (mcinnesUMAPUniformManifold2018a). In the 2d reductions (see Figure 3), the zoos varying in seed only show little to no structure. Zoos with hyperparameter changes and random seeds are similarly unstructured. Zoos with varying hyperparameters and fixed seeds show clear clusters with models of the same initialization method and activation function. These findings are further supported by the predictability of initialization method and activation function (Table 3). The structures are unsurprising considering that the activation function is very influential in shaping the loss surface, while initialization method and the seed determine the starting point on it. Depending on the downstream task, this property can be desirable or should be avoided, which is why we provide both configurations.#### Model Property Prediction

As a set of benchmark results on the proposed model zoos and to further evaluate the zoos, we use linear models to predict hyperparameters or performance values of the individual models. As features, we use the model weights or per-layer quintiles of the weights as in (unterthinerPredictingNeuralNetwork2020). Linear models are used to evaluate the properties of the dataset and the quality of the features. We report these results in Table 3. The layer-wise weight statistics () have generally higher predictive performance than the raw weights . In particular, are not affected by using fixed or random seeds and thus generalize well to unseen seeds. For the ResNet-18 zoos, becomes too large to be used as a feature and is therefore omitted. Across all zoos, the accuracy as well as the hyperparameters can be predicted very accurately. Generalization gap and epoch appear to be more difficult to predict. These findings hold for all zoos, regardless of the different architectures, model sizes, task complexity and performance range. can be used to predict the initialization method and activation function to very high accuracy, if the seeds are fixed. The performance drops drastically if seeds are varied. This results confirms our expectation of diversity in weight space induced by fixing or varying seed. These results show i) that the model weights of our zoos contain rich information on their properties; ii) confirm the notions of diversity that were design goals for the zoos; and iii) leave room for improvements on the more difficult properties to predict, in particular the generalization gap.

## 5 Potential Use-Cases & Applications

While populations of NNs have been used in previous work, they still are relatively novel as a dataset. As use-cases for such datasets may not be obvious, this section presents potential use-cases and applications. For all use-cases, we collect related work that uses model populations. Here, the zoos may be used as data or to evaluate the methods. For some of the use-cases, the analysis above provides support. Lastly, we suggest ideas for future work which we hope can inspire the community to make use of the model zoos.

### 5.1 Model Analysis

The analysis of trained models is an important and difficult step in the machine learning pipeline. Commonly, models are applied on hold-out test sets, which may contain difficult cases with specific properties (lecunDeepLearning2015). Other approaches identify subsections of input data that is relevant for a specific output (yosinskiUnderstandingNeuralNetworks2015; karpathyVisualizingUnderstandingRecurrent2015; zintgrafVisualizingDeepNeural2017). A third group of methods compares the activations of models, e.g. the cka method used in Sec. 4 to measure diversity (kornblithSimilarityNeuralNetwork2019).

Populations of models have been used to identify commonalities in model weights, activations, or graph structure which are predictive for model properties. Some methods use the weights, weight-statistics or eigenvalues of the weight matrices as features to predict a model’s accuracy or hyper-parameters

(unterthinerPredictingNeuralNetwork2020; eilertsenClassifyingClassifierDissecting2020; martinTraditionalHeavyTailedSelf2019). Recently, (schurholtSelfSupervisedRepresentationLearning2021) have learned self-supervised representation of the weights and demonstrate their usefulness for predicting model properties. Other publications use activations to approximate intermediate margins (yakTaskArchitectureIndependentGeneralization2019; jiangPredictingGeneralizationGap2019) or graph connectivity features (corneanuComputingTestingError2020) to predict the generalization gap or test accuracy. Standardized, diverse model zoos may facilitate development of new methods, or be used as evaluation dataset for existing model analysis, interpretability or comparison method.Previous work as well as the experiment results in Sec 4 indicate that even more complex model properties might be predicted from the weights. By studying populations of models, in-depth diagnostics of models, such as whether a model learned a specific bias, may be based on the weights or topology of models. Lastly, model properties as well of the weights may be used to derive a model ’identity’ along the training trajectory, to allow for NN versioning.

### 5.2 Learning Dynamics

Analysing and utilizing the learning dynamics of models has been a useful practice. For example, early stopping (FinnoffImprovingModelSelection1993), which determines when to end training at minimal generalization error based on a cross validation set and has become standard in machine learning practice.

More recently, methods have exploited zoos of models. Population based training (jaderbergPopulationBasedTraining2017) evaluates the performance of model candidates in a population, decides which of the candidates to pursue further and which to give up. HyperBand evaluates performance metrics for groups of models to optimize hyperparameters (liHyperbandNovelBanditBased2018; liSystemMassivelyParallel2020). Research in Neural Architecture Search was greatly simplified by the NASBench dataset family (yingNASBench101ReproducibleNeural2019), which contains performance metrics for varying hyperparameter choices. Our model zoos extend these datasets by adding models including their weights at states throughout training, which may open new doors for new approaches.

The accuracy distribution of our model zoos become relatively broad if hyperparameters are varied (Figure 2). For early stopping or population based methods, identifying a good range of hyperparameters to try, and then identifying those candidates that will perform best towards the end of training, is a challenging and relevant task. Our model zoos may be used to develop and evaluate methods to that end. Beyond that, diverse model zoos offer the opportunity to make further steps of understanding and exploiting the learning dynamics of models, i.e., by studying the regularities of generalizing and overfitting models. The shape and curvature of training trajectories may contain rich information on the state of model training. Such information could be used to monitor model training, or adjust hyperparameters to achieve better results. The sparsified model zoos add several potential use-cases. They may be used to study the sparsification performance on a population level, study emerging patterns of populations of sparse models, or the relation of full models and their sparse counterparts.

### 5.3 Representation Learning

NN models have grown in recent years, and with them the dimensionality of their parameter space. Empirically, it is more effective to train large models to high performance and distill them in a second step, than to directly train the small models (hoeflerSparsityDeepLearning2021; liuWeActuallyNeed2021). This and other related problems raise interesting questions. What are useful regularities in NN weights? How can the weight space be navigated in a more efficient way?

Recent work has attempted to learn lower dimensional representations of the weights of NNs (haHyperNetworks2016; ratzlaffHyperGANGenerativeModel2019; zhangGraphHyperNetworksNeural2019; knyazevParameterPredictionUnseen2021; schurholtSelfSupervisedRepresentationLearning2021; schurholtHyperRepresentationsGenerativeModels2022; schurholtHyperRepresentationsPreTrainingTransfer2022). Such representations can reveal the latent structure of NN weights. Other approaches identify subspaces in the weight space which relate to high performance or generalization (wortsmanLearningNeuralNetwork2021; lucasMonotonicLinearInterpolation2021; bentonLossSurfaceSimplexes2021). In (schurholtSelfSupervisedRepresentationLearning2021), representations learned on model zoos achieve higher performance in predicting model properties than weights or weight statistics. (knyazevParameterPredictionUnseen2021) proposes a method to learn from a population of diverse neural architectures to generate weights for unseen architectures in a single forward pass.

Our model zoos can be either a dataset to train representations on as in (schurholtSelfSupervisedRepresentationLearning2021) or (bentonLossSurfaceSimplexes2021), or as common dataset to validate such methods. Learned representations may bring better understanding of the weight space and thus help to reduce the computational cost and improve performance of NNs.

### 5.4 Generating New Models

In conventional machine learning, models are randomly initialized and then trained on data. As that procedure may require large amounts of data and computational resources, fine-tuning and transfer learning are more efficient training approaches that re-use already trained models for a different task or dataset (yosinskiHowTransferableAre2014; fengTransferredDiscrepancyQuantifying2020). Other publications have extended the concept of transfer learning from a one-to-one setup to many-to-one setups (liuKnowledgeFlowImprove2019; shuZooTuningAdaptiveTransfer2021). Both approaches attempt to combine learned knowledge from several source models into a single target model. Most recently, (schurholtHyperRepresentationsGenerativeModels2022; schurholtHyperRepresentationsPreTrainingTransfer2022) have generated unseen NN models with desireable properties from representations learned on model zoos. The generated models were able to outperform random initialization and pretraining in transfer-learning regimes. In (peeblesLearningLearnGenerative2022), a transformer is trained on a population of models with diffusion to generate model weights.

All these approaches require suitable and diverse models to be available. Further, the exact properties of models suitable for generative use, transfer learning or ensembles are still in discussion (fengTransferredDiscrepancyQuantifying2020). Population based transfer learning methods such as zoo-tuning (shuZooTuningAdaptiveTransfer2021), knowledge flow (liuKnowledgeFlowImprove2019) or model-zoo (rameshModelZooGrowing2022) have been demonstrated on populations with only few models. Populations for these methods ideally are as diverse as possible, so that they provide different features. Investigating the models in the proposed zoos may help identifying models which lend themselves for transfer learning or ensembling.

## 6 Conclusion

To enable the investigation of populations of neural network models, we release a novel dataset of model zoos with this work. These model zoos contain systematically generated and diverse populations of 50’360 neural network models comprised of 3’844’360 collective model states. The released model zoos come with a comprehensive analysis and initial benchmarks for multiple downstream tasks and invite further work in the direction of the following use cases: (i) model analysis, (ii) learning dynamics, (iii) representation learning and (iv) model generation.

## Acknowledgments

This work was partially funded by Google Research Scholar Award, the University of St.Gallen Basic Research Fund, and project PID2020-117142GB-I00 funded by MCIN/ AEI /10.13039/501100011033. We are thankful to Erik Vee for insightful discussions and Michael Mommert for editorial support.

## References

## Appendix A Model Zoo Generation Details

In our model zoos, we use three architectures. Two of them rely on a general CNN architecture, the third is a common ResNet-18[heDeepResidualLearning2016]. For the first two architectures, use the general CNN architecture in two sizes, detailed in Table 4. By varying different generating factors listed in Table 1, we create a grid of configurations, where each node represents a model. Each node is instantiated as a model and trained with the exact same training protocol. We chose the hyperparameters with diversity in mind. The ranges for each of the generating factors are chosen such that they can lead to functioning models with a corresponding set of other generating factors. Nonetheless, that leads to some nodes with uncommon and less than promising configurations.

The code to generate the models can be found on www.modelzoos.cc. With that code, the model zoos can be replicated, changed or extended. We trained our model zoos on CPU nodes with up to 64 CPUs. Training a zoo takes between 3h (small models, small configuration and small dataset) and 3 days (large models, large configuration and large dataset). Overall, the generation of the zoos took around 30’000 CPU hours.

Layer | Component | CNN small | CNN large |

Conv 1 | input channels | 1 or 3 | 3 |

output channels | 8 | 16 | |

kernel size | 5 | 3 | |

stride | 1 | 1 | |

padding | 0 | 0 | |

Max Pooling | kernel size | 2 | 2 |

Activation | |||

Conv 2 | input channels | 8 | 16 |

output channels | 6 | 32 | |

kernel size | 5 | 3 | |

stride | 1 | 1 | |

padding | 0 | 0 | |

Max Pooling | kernel size | 2 | 2 |

Activation | |||

Conv 3 | input channels | 6 | 32 |

output channels | 4 | 15 | |

kernel size | 2 | 3 | |

stride | 1 | 1 | |

padding | 0 | 0 | |

Activation | |||

Linear 1 | input channels | 36 | 60 |

output channels | 20 | 20 | |

Activation | |||

Linear 2 | input channels | 20 | 20 |

output channels | 10 | 10 | |

Total Parameters | 2464 or 2864 | 10853 |

## Appendix B Data Management and Accessibility of Model Zoos

Data Management and Documentation:

To ensure that every zoo is reproducible, expandable, and understandable, we document each zoo. For each zoo, a Readme file is generated, displaying basic information about the zoo. The exact search pattern and the training protocol used to train the zoo is saved in a in a machine-readable json file. To make the zoos expandable, the dataset used to train the zoo and a file describing the model architecture are included. The model class definition in pytorch is included with the zoo. Each model is saved along with a json file containing its exact hyperparameter combination. A second json file contains the the performance metrics during training. Model checkpoints are saved for every epoch. To enable further training of the models in the zoo, a checkpoint recording the optimizer state is saved for the final epoch of each model. All data can be found on the model zoo website as well directly from Zenodo.

Accessibility:

We ensure the technical accessibility of the data by hosting it on Zenodo, where the data will be hosted for at least 20 years. Further, we take steps to reduce access barriers by providing code for data loading and preprocessing. With that we reduce the friction associated with analyzing of the raw zoo files. Further, it improves consistency by reducing errors associated with extracting information from the zoo. To that end, we provide a PyTorch dataset class encapsulating all model zoos for easy and quick access within the PyTorch framework. A Tensorflow counterpart will follow. All code can be found on the model zoo website as well as a code repository on github. To ensure conceptional accessibility, we include detailed insights, visualizations and the analysis of the model zoo (Sec.

4) with each zoo. Mode details can be found on the dataset website www.modelzoos.cc.## Appendix C Dataset Documentation and Intended Uses

The main dataset documentation can be found at www.modelzoos.cc and is detailed in the paper in Section 3.4. There, we provide links to the zoos, which are hosted on Zenodo as well as analysis of the zoos. In the future, the analysis will be systematically extended. The documentation includes code to reproduce, adapt or extend the zoos, code to reproduce the benchmark results, as well as code to load and preprocess the datasets. Dataset Metadata and DOIs are automatically provided by Zenodo, which also guarantees the long-term availability of the data. Files are stored as zip, json and pt

(pytorch) files. All libraries to read and use the files are common and open source. We provide the code necessary to read and interpret the data.

The datasets are synthetic and intended to investigate populations of neural network models, i.e., to develop or evaluate model analysis methods, progress the understanding of learning dynamics, serve as datasets for representation learning on neural network models, or as a basis for new model generation methods. More information regarding the usage is given in the paper.

## Appendix D Author Statement

The dataset is publicly available under www.modelzoos.cc and licensed under the Creative Commons Attribution 4.0 International license (CC-BY 4.0). The authors state that they bear responsibility under the CC-BY 4.0 license.

## Appendix E Hosting, Licensing, and Maintenance Plan

The dataset is publicly available under www.modelzoos.cc and licensed under the Creative Commons Attribution 4.0 International license (CC-BY 4.0). The landing page contains documentation, code and references to the datasets, as detailed in the paper in Section 3.4. The datasets are hosted on Zenodo, to ensure (i) long-term availability (at least 20 years), (ii) automatic searchable dataset meta data, (iii) DOIs for dataset, and (iv) dataset versioning. The authors will maintain the datasets, but invite the community to engage. Code to recreate, correct, adapt, or extend the datasets is provided, s.t. maintenance can be taken over by the community at need. The github repository allows the community to discuss, interact, add or change code.