In recent decade, a lot of neural network architectures for solving different artificial intelligence problems were proposed, and each new state-of-the-art model usually has more parameters than its predecessor. While working within a fixed architectural family, such as WideResNets(Zagoruyko and Komodakis, 2016) or Transformers (Vaswani et al., 2017), a variety of upscaling methods can be used to increase the network size. The two most common approaches are to increase the model depth, i. e. the number of layers (He et al., 2015; Wang et al., 2019; Pham et al., 2019)
, or to increase the model width by, for example, increasing the number of filters in a convolutional neural network or the dimensionality of embeddings and fully-connected layers in a Transformer. Both methods improve model performance and also can be combined together for higher effectiveness(Tan and Le, 2019). In this work, we focus on increasing the network size by scaling the width of each layer because straightforward upscaling of the network depth often leads to optimization difficulties (Huang et al., 2016; Bapna et al., 2018).
Rather than increasing the size of one model, one can train an ensemble of several models. One of the most popular methods to construct an ensemble of deep neural networks is to train individual networks from different random initializations, and then average their predictions (Hansen and Salamon, 1990; Lakshminarayanan et al., 2017)
. Such ensembles are called Deep Ensembles. Deep Ensembles with an increasing ensemble size were shown to improve both classification performance and quality of uncertainty estimation of the model(Szegedy et al., 2016; Wu et al., 2016; Devlin et al., 2019; Ashukha et al., 2020).
The effect of increasing the network size or the ensemble size independently is well investigated in the literature. In this work, we investigate this effect in a fixed memory budget setting, when increasing the network size entails decreasing the ensemble size. Under a fixed memory budget, we mean a fixed number of parameters. We focus on the following question: with a fixed number of parameters, what performs better: (a) one very wide neural network, or (b) the ensemble of several medium-width neural networks, or (c) the ensemble of a lot of thin neural networks? Our main empirical result is that, for large enough memory budgets, (b) is better than (a) and (c): splitting the memory budget between several mid-size neural networks results in better performance than spending the budget on one big network or training a huge number of small networks (see figure 1). We call this effect Memory Split Advantage (MSA). We perform a rigorous empirical study of the MSA effect and show the effect for VGG (Simonyan and Zisserman, 2014) and WideResNet on CIFAR-10/100 datasets as well as Transformer on IWSLT‘14 German-to-English. We observe that for a lot of dataset–archirecture pairs, the MSA effect holds even for small architecture configurations, i. e. several times smaller than commonly used ones.
The MSA-effect leads to a simple and effective way of improving the quality of the model without changing the number of parameters. Splitting the network into an ensemble of several smaller ones allows distributed training and prediction.
2 Related work
According to the conventional bias-variance trade-off theory(Hastie et al., 2004), for a particular task and model family, there should be an optimal model size, so that larger, more complex, models, overfit and perform worse. However, a tendency of larger neural networks to achieve higher performance is observed in many practical applications of modern deep learning (Huang et al., 2018; Tan and Le, 2019; Radford et al., 2018; Devlin et al., 2019), even though such models are overparametrized and are able to fit even random labels (Zhang et al., 2017). This phenomena is actively researched nowadays and recent works (Belkin et al., 2018; Nakkiran et al., 2019) confirm that, in an overparametrized regime, increasing the size of the network leads to a better quality.
The standard ensembling approach for neural networks consists in training individual models independently and then averaging their predictions. A diversity in error distributions across member networks is essential to construct an effective ensemble (Hansen and Salamon, 1990; Krogh and Vedelsby, 1994). To achieve it, networks are usually trained on the same dataset but from different random initializations (Lakshminarayanan et al., 2017). This approach results in higher performance than standard bagging in case of neural networks (Lee et al., 2015). Ensembling of neural networks was shown to substantially increase both classification performance and quality of uncertainty estimation in a lot of practical tasks (Szegedy et al., 2016; Wu et al., 2016; Devlin et al., 2019; Beluch et al., 2018; Ashukha et al., 2020). Moreover, a lot of winning and top performing solutions of different Kaggle competitions111www.kaggle.com use ensembles of deep neural networks.
Memory efficient ensembles.
While achieving higher performance, ensembles of neural networks are also more resource consuming in terms of memory. Various compression methods may be used to compress individual networks, such as sparsification and quantization (Han et al., 2016) or more compact parametrization of weight matrices (Novikov et al., 2015). To compress an ensemble, one may compress each member network independently or apply, for example, distillation (Hinton et al., 2015) to the whole ensemble. These approaches and our findings are orthogonal and can be straightforwardly combined.
More compact ensembles may also be constructed by sharing parameters between member networks (Lee et al., 2015; Gao et al., 2019). Gao et al. (2019) propose a technique of simultaneous training of several sub-networks inside one network so that these sub-networks are then used as an ensemble. Hence, the authors also split one network into an ensemble, but in a different way: the sub-networks are dependent and share a large portion of parameters, while our findings show that even splitting one network into several independent ones boosts the performance for large enough budgets.
Multi-branch architectures (Szegedy et al., 2016; Xie et al., 2017), which split some building blocks of a network into sub-blocks and aggregate (e. g. sum or concatenate) sub-blocks’ outputs, can also be seen as an ensemble of a large number of subnetworks. These subnetworks are again closely connected and trained jointly, while in our work, we investigate the effect of independent training of networks in the ensemble.
In this paper, we are focused on solving a machine learning task by an ensemble, of one or more neural networks, with a fixed total number of parameters. For a fixed memory budget, we consider ensembles containing different number of networks and vary the width of member networks, so that the resulting combination of the two parameters, number of networks and network’s width, results in the right memory budget. We call the described ensembles, with a fixed number of parameters, the memory splits. We do not change any other architectural properties of individual networks except their width.
The dependency of individual neural network’s performance on its width usually looks as shown in figure 1: quality increases when the width grows and saturates for larger widths (Tan and Le, 2019). As a result, in practice, it is preferable to use a network, wide enough to achieve high performance, but not too wide, because it needs a lot of resources while providing only a negligible quality improvement.
The same type of dependency is observed between quality and number of networks in Deep Ensembles: the more networks the better, but for the large number of networks quality saturates (Ashukha et al., 2020). An ensemble quality curve (the network size is fixed, the ensemble size varies) may be depicted in the same plot as the quality curve of an individual network, by using the number of parameters on the x-axis (instead of width or number of networks). We show this curve for ensembles with member networks of two sizes in figure 1.
When working with a fixed memory budget, one needs to choose how to spend it: to train one large network or to split the budget into two or more parts and train several smaller networks. Intuitively, the choice depends on the budget and relative arrangement of the quality curves of an individual model and ensembles. For a small budget (left vertical line in figure 1), one model is expected to give higher performance, because the curve for one model increases faster than for an ensemble. We assume such behavior on the premise that smaller models have a high bias, while ensembling tends to decrease the variance (Tumer and Ghosh, 1996). For a large budget (the right vertical line in figure 1), the performance of an individual network saturates, while the common practices say that ensembling increases performance even for very large networks (Szegedy et al., 2016; Devlin et al., 2019). Based on these observations, we hypothesize that for large budgets the MSA effect holds: it is preferable to split the budget and train an ensemble of several networks, instead of one large network.
In this paper, we consider several dataset–architecture pairs and empirically investigate the following research questions (RQs):
Does the MSA effect hold for various datasets and architectures?
For which budgets does it hold? Does it hold for standard budgets that are usually used in practice?
How does an optimal split look like for different budgets?
4 Experimental design
We perform experiments with convolutional neural networks (WideResNet28x10 and VGG16) on CIFAR-10 (Krizhevsky et al., a) and CIFAR-100 (Krizhevsky et al., b), and Transformer on IWSLT’14 German-English (De-En) (Cettolo et al., 2014)
. For each dataset–architecture pair, we investigate memory splitting for a range of memory budgets. To obtain different memory splits, we select the width factor, a hyperparameter controlling layer widths, so that the size of the member network is approximately equal to the budget divided by the ensemble size.
Since the good performance of the model is usually obtained with a carefully tuned hyperparameters, and the optimal hyperparameters may change for different network sizes, we investigate memory splitting in two settings:
without hyperparameter tuning: we use the same hyperparameters for all memory splits, including turning off the regularization (weight decay, dropout);
with hyperparameter tuning: we use grid search to tune the hyperparameters of a single network on the validation set for each network size; to train memory splits we use hyperparameters, optimal for member networks.
We can only find good hyperparameters approximately, introducing additional noise to the results. Setting A avoids this additional noise and results in more smooth plots. Setting A is also similar to the setting of (Nakkiran et al., 2019). Setting B is more practically oriented.
For each dataset–architecture pair, in each setting, we provide a memory split plot.
Memory split plot.
Each plot contains several lines, one line corresponds to a fixed memory budget. For a budget (the number of parameters), we train several memory splits, each contains networks of size , (logarithmic scale). To obtain a network of size , we adjust the width factor of the network. We then plot the test quality vs. ensemble size and analyze what is the optimal . An ensemble, corresponding to the optimal , is called optimal memory split for budget . Optimal usually varies for different budgets .
For the majority of points on the plot, we average the test quality over 3–5 runs (3 for more computationally expensive runs, 5 — for less expensive ones), and plot mean standard deviation of quality. The most computationally expensive ensembles ( for one or two biggest budgets) were trained only once, but the standard deviation of their test quality is small because of the large ensemble size .
In the rest of the paper, we denote an ensemble of networks, each having parameters, with . The total number of parameters in equals . denotes the test quality of ensemble . denotes the standard budget, equivalent to the number of parameters in a single network of a commonly used size for the specific architecture.
Experimental details for CNNs.
We consider VGG (Simonyan and Zisserman, 2014) (16 layers) and WideResNet (Zagoruyko and Komodakis, 2016) (28 layers). We use the implementation provided by Garipov et al. (2018)222https://github.com/timgaripov/dnn-mode-connectivity
, and scale the number of filters for convolutional layers and the number of neurons for fully-connected layers. For VGG / WideResNet, we use convolutional layers with/ filters, and fully-connected layers with / neurons, and vary width factor . For VGG, we consider , corresponds to a standard, commonly used, configuration. We use the size of this configuration (15.3M parameters) as a standard budget in the experiments with VGG. For WideResNet, we consider , corresponds to a standard model, WideResNet-28-10. We use the size of this configuration (36.8M parameters) as a standard budget in the experiments with WideResNet.
We train all the networks for 200 epochs with SGD with an annealing learning schedule and a batch size of 128. In setting A, we use an initial learning rate 10 times smaller than in the reference implementation to ensure that the training converges for all considered models. In setting B, we consider both the standard learning rate and the small one. For VGG, we use weight decay, and binary dropout for fully-connected layers. For WideResNet, we use weight decay and batch normalization(Ioffe and Szegedy, 2015). Dropout does not affect quality a lot, so we do not use it for WideResNet, to reduce the hyperparameter grid size. Hyperparameter choices for all cases are listed in table LABEL:tab:hyper. Quality is measured in accuracy. More details are given in Appendix A.
Experimental details for Transformer.
We use standard Transformer architecture (Vaswani et al., 2017) with 6 layers, each with model dimensions , feed-forward dimensions , and 4 attention heads. We use to obtain neural networks with different memory budgets, corresponds to a standard, commonly used, configuration. We use the size of this configuration (39.5M parameters) as a standard budget in the experiments with Transformer. We use the implementation provided by fairseq333https://github.com/pytorch/fairseq (Ott et al., 2019).
We use label smoothing of and batches of maximum 4096 tokens. We train models using Adam (Kingma and Ba, 2015) with and inverse square-root learning rate schedule with 4000 warm-up steps. We stop training after the convergence of the validation loss or after 100/50 epochs for setting A/B, whichever comes first. In setting A, we use a learning rate of to ensure the stable training of all network sizes. In setting B, we use a weight decay and a binary dropout, and choose optimal hyperparameters using grid search (see table LABEL:tab:hyper). For evaluation, we use a beam search with a beam size of 5 and a length penalty of 1.0. The translation quality is measured in BLEU (Papineni et al., 2002). More details are given in Appendix A.
5 Memory split advantage effect
The memory split plots for CNNs and Transformer are given in figures 2 and 3 correspondingly. Two columns show the results for two settings (without hyperparameter tuning and with tuned hyperparameters), and each row corresponds to a dataset–architecture pair.
Verifying assumptions for the MSA effect.
Our experiments confirm the generally accepted view that increasing the size of a single model results in an increase and saturation of test quality , with and without hyperparameter tuning for each model size. We refer to this effect as individual quality saturation. This individual quality saturation may be seen from figures 2 and 3: going from bottom to top along the vertical line at abscissa (moving among colored lines), but we also give the more convenient plots in Appendix B. The similar results without hyperparameter tuning, and so with mere regularization, were shown in (Nakkiran et al., 2019), but we did not find a neat similar experiment for a regularized setting.
We also checked that, when the member network size is fixed, increasing the ensemble size results in an increase and saturation of (referred later as ensemble quality saturation). This observation may also be made from figures 2 and 3, and the similar results are given in (Ashukha et al., 2020). Two described observations allow us to move ahead and answer the research questions stated in section 3.
RQ1: Does the MSA effect hold for various datasets and architectures?
For a particular memory budget , the MSA effect holds, if the number of networks in the optimal memory split is larger than one:
In other words, the line on the plot, corresponding to budget , has an optimum at abscissa . We observe the MSA effect for all considered dataset–architecture pairs, for a wide range of budgets. For example, consider WideResNet with tuned hyperparameters on CIFAR-100, and a memory budget, equivalent to one standard model — WideResNet-28-10. The red line, corresponding to this budget, has an optimum at abscissa . That is, sixteen WideResNets of the width factor smaller than the standard one444The number of parameters in a network is a quadratic function of the width factor., perform significantly better than one standard WideResNet-28-10 (82.52% test accuracy v.s. 80.60%).
The MSA effect holds for both designed settings. The results in setting A are more smooth in the sense that there is no additional noise caused by hyperparameter tuning, but quality for setting A is low due to the absence of proper regularization. In setting B, hyperparameter optimization is approximate, and for some model sizes, the found hyperparameters may be more suitable, than for others. To make the hyperparameter search fairer, we used the same uniform grid for all model sizes. We perform additional experiments on the hyperparameter search below.
RQ2: For which budgets does the MSA effect hold?
The MSA effect holds for all considered budgets, except several smallest ones. Notably, for each considered architecture, the MSA effect holds for the standard budget , denoted in the red line on all plots. This means, that widely used configurations of popular architectures are not optimal, and a simple technique, memory splitting, could significantly improve their quality, retaining the same number of parameters. In the majority of cases, the MSA effect also holds for budgets, several times smaller than the standard one.
For large budgets, the MSA effect is expected because of the individual quality saturation: if we split , easily outperforms because quality gap between member networks and is negligible. However, for smaller budgets, there is usually a significant quality gap between the single network and the member network of the optimal memory split. Ensembling member networks pushes quality of the memory split up enough to cause the MSA effect.
RQ3: How does an optimal split look like for different budgets?
Figure 4 shows optimal memory splits for VGG and WideResNet on CIFAR-100 for different memory budgets. The results on other dataset–architecture pairs look similar and are presented in Appendix C. We would like to note that the depicted values of optimal ensemble size and member network size are approximate, due to the discreteness of the ensemble size grid, and the variance in accuracy of neural networks, trained from different random seeds.
For all dataset–architecture pairs, with increasing memory budget, the optimal memory split grows both in terms of ensemble size and member network size. Hence, there is no one global optimal ensemble size or member network size. For small budgets, the optimal decision is not to split memory and therefore the optimal ensemble size does not grow (see small budgets for VGG with hyperparameter tuning). For large budgets, the quality of one network saturates, therefore the optimal member network size saturates too (see large budgets for all tasks).
In the case of VGG and WideResNet, hyperparameter tuning mostly consists of choosing the right regularization and therefore, first of all, makes large networks better. As a result, optimal memory splits for the setting with hyperparameter tuning contains fewer networks of larger size.
To compare optimal memory splits between different architectures, we depict the results for VGG and WideResNet on CIFAR-100 with hyperparameter tuning together in one plot (figure 4, right). We align the budgets by measuring them in number of parameters. While achieving much higher quality than VGG, WideResNet also uses parameters more efficiently inside one network. Hence, for WideResNet, the optimal memory split consists of smaller networks, and the optimal ensemble size becomes greater than one, starting from much smaller budgets, than for VGG.
Tuning hyperparameters with Bayesian optimization.
In order to better justify that the MSA effect holds in the regularized setting, we conduct additional experiments with more careful hyperparameter tuning. We use Bayesian optimization (BO) (Snoek et al., 2012) to find hyperparameters better than ones selected by grid search.
For different model sizes of WideResNet on CIFAR-100, we choose the best learning rate, weight decay and dropout rate on the validation set using BO with 20 iterations. We use an open-source implementation555 https://github.com/fmfn/BayesianOptimization of BO. BO takes more time than grid search, so we omit largest budgets in this experiment.
The results are given in figure 5. The MSA-effect holds for both hyperparameter tuning methods, while the optimization methods differ considerably: grid search explores the hyperparameter space regularly; and BO incorporates random search, and searching within narrow region near the best discovered points.
Hyperparameters, found by BO, in most cases result in higher test accuracies of single models, than hyperparameters found by grid search. This is visible at abscissa in figure 5. However, the test accuracies of memory splits are not always higher for BO than for grid search. This is because optimal hyperparameters for the ensemble may differ from optimal hyperparameters for the single model. Optimizing hyperparameters for ensemble is extremely expensive, but potentially it could make memory splits even stronger, compared to single models. In total, the maximum achieved test accuracy, for the largest considered budget, is higher for BO (82.75%), than for grid search (82.52%).
MSA effect for uncertainty estimation
Ensembles of deep neural networks are often used for uncertainty estimation: probability estimates produced by ensembles are more accurate and reliable than ones produced by a single network(Lakshminarayanan et al., 2017), in both in-domain (Ashukha et al., 2020) and out-of-domain (Snoek et al., 2019) settings.
However, this comparison is usually performed in a fixed width setting, and so with unequal number of parameters. We show that ensembles outperform the single network in in-domain uncertainty estimation, when the total number of parameters in both models is fixed and approximately equal.
We use the calibrated test negative log-likelihood (NLL) to measure the quality of in-domain uncertainty. Ashukha et al. (2020) highlight the importance of temperature scaling for both single networks and ensembles, since comparison of log-likelihoods without temperature scaling may lead to arbitrary results. We measure the calibrated test NLL for image classification models trained in the main experiments and discussed in paragraphs RQ1–RQ3. We show the results for WideResNet with tuned hyperparameters in figure 6, more results may be found in Appendix D.
The MSA effect holds for the calibrated test NLL in all the same cases (triplets architecture–dataset–budget) as for the test accuracy. The number of networks in a memory split, optimal w. r. t. calibrated test NLL, is usually equal or a bit larger than that, optimal w. r. t. test accuracy.
In this work, we introduce the MSA effect and the arising simple method of improving the neural network performance in a limited memory setting. Investigating the MSA effect for various datasets, architectures, and budgets, we find that the effect holds even for small configurations of popular architectures and that the larger the configuration, the bigger the ensemble size and the network size , corresponding to the optimal memory split. Finding the optimal values of and without full computation of the memory split plot is an interesting direction for future work.
We would like to thank Dmitry Molchanov for the valuable feedback. Results for convolutional neural networks were supported by the Russian Science Foundation grant №19-71-30020. Results for Transformer were supported by Samsung Research, Samsung Electronics. This research was supported in part through computational resources of HPC facilities at NRU HSE.
- Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations, Cited by: §1, §2, §3, §5, §5, §5.
Training deeper neural machine translation models with transparent attention. In
Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
- Reconciling modern machine learning and the bias-variance trade-off. In arXiv preprint arXiv:1812.11118, Cited by: §2.
The power of ensembles for active learning in image classification. In , Cited by: §2.
- Report on the 11th IWSLT evaluation campaign. In IWSLT, Cited by: §A.2, §4.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §1, §2, §2, §3.
- Intra-ensemble in neural networks. In arXiv preprint arXiv:1904.04466, Cited by: §2.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, Cited by: §A.1, §A.1, §4.
- Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, Cited by: §2.
- Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12. Cited by: §1, §2.
- The elements of statistical learning: data mining, inference, and prediction. Math. Intell. 27. Cited by: §2.
- Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition. Cited by: §1.
- Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop at the Neural Information Processing Systems, Cited by: §2.
- Deep networks with stochastic depth. In European Conference on Computer Vision, Cited by: §1.
- GPipe: efficient training of giant neural networks using pipeline parallelism. In arXiv preprint arXiv:1811.06965, Cited by: §2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §4.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.
-  () CIFAR-10 (canadian institute for advanced research). . Cited by: §4.
-  () CIFAR-100 (canadian institute for advanced research). . Cited by: §4.
- Neural network ensembles, cross validation and active learning. In International Conference on Neural Information Processing Systems, Cited by: §2.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §5.
- Why M heads are better than one: training a diverse ensemble of deep networks. In arXiv preprint arXiv:1511.06314, Cited by: §2, §2.
- Deep double descent: where bigger models and more data hurt. In arXiv preprint arXiv:1912.02292, Cited by: §2, §4, §5.
- Tensorizing neural networks. In International Conference on Neural Information Processing Systems, Cited by: §2.
- Fairseq: a fast, extensible toolkit for sequence modeling. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Demonstrations, Cited by: §4.
- Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, Cited by: §4.
- Very deep self-attention networks for end-to-end speech recognition. In arXiv preprint arXiv:1904.13377, Cited by: §1.
- Language models are unsupervised multitask learners. Cited by: §2.
- Neural machine translation of rare words with subword units. In Annual Meeting of the Association for Computational Linguistics, Cited by: §A.2.
- Very deep convolutional networks for large-scale image recognition. In arXiv preprint arXiv:1409.1556, Cited by: §1, §4.
- Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, Cited by: §5.
- Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, Cited by: §5.
Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Workshop of the International Conference on Learning Representations, Cited by: §1, §2, §2, §3.
- EfficientNet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, Cited by: §1, §2, §3.
Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition 29. Cited by: §3.
- Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §1, §4.
- Learning deep transformer models for machine translation. In Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
- Google’s neural machine translation system: bridging the gap between human and machine translation. In arXiv preprint arXiv:1609.08144, Cited by: §1, §2.
- Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- Wide residual networks. In British Machine Vision Conference, Cited by: §1, §4.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, Cited by: §2.
Appendix A Experimental details
a.1 Details for CNNs
Data. We conduct experiments on CIFAR-100 and CIFAR-10 datasets, each containing 50000 training and 10000 testing examples. For tuning hyperparameters, we randomly select 5000 training examples as a validation set. After choosing optimal hyperparameters, we retrain the models on the full training dataset. We use a standard data augmentation scheme following (Garipov et al., 2018)
: zero-padding with 4 pixels on each side, random cropping to produce 32×32 images, and horizontal mirroring with probability 0.5. In the experiments without hyper parameter tuning, we do not use data augmentation.
Training. Following (Garipov et al., 2018), we train all models for 200 epochs using SGD with momentum of and the following learning rate schedule: constant (100 epochs) – linearly annealing (80 epochs) – constant (20 epochs). The final learning rate is 100 times smaller than the initial one. We use the batch size of 128.
Testing. We measure the test accuracy and the calibrated negative log-likelihood (NLL) of each ensemble or individual model.
Bayesian optimization. We use the following ranges for Bayesian optimization: for weight decay, for dropout, and [, ] for learning rate for VGG/WideResNet correspondingly.
a.2 Details for Transformers
Data. We conduct experiments on the IWSLT’14 German-English (De-En) and English-German (En-De) translation tasks (Cettolo et al., 2014) that contain 160K training sentences and 7K validation sentences randomly sampled from the training data. We test on the concatenation of tst2010, tst2011, tst2012, tst2013 and dev2010. We preprocess the data using byte pair encoding (BPE) (Sennrich et al., 2016).
a.3 Details of the memory splitting procedure
For a budget (the number of parameters), we train several memory splits , each contains networks of size , . The number of parameters in a network is a quadratic function of the width factor. To obtain a network of size , we solve the quadratic equation w. r. t. the width factor, and then round the width factor to the nearest integer. The difference in the number of parameters between the resulting memory split and budget is negligible, compared to .
To make predictions with an ensemble, we average predictions of the individual networks after softmax, i.ė. average discrete distributions over classes.
a.4 Computing infrastructure
Experiments were conducted on NVIDIA Tesla V100 GPU, Tesla P40 GPU and Tesla P100 GPU.
Appendix B Individual network quality
Figure 7 shows the test quality of individual network for different model sizes. We present the results for settings with and without hyperparameter tuning for each network size. The results confirm the generally accepted view that increasing the network size leads to the higher quality.
Appendix C Optimal memory splits
Optimal memory splits for VGG/WideResNet on CIFAR-10 and Transformer on IWSLT’14 De-En for different memory budgets are shown in figures 8 and 9 correspondingly. The general pattern for these dataset–architecture pairs look very similar to results on CIFAR-100: with the increasing memory budget, the optimal memory split grows both in terms of ensemble size and member network size.
In case of VGG and WideResNet on CIFAR-10, similar to CIFAR-100, optimal memory splits for the setting with hyperparameter tuning contain fewer networks of larger size. However, in case of Transformer, the comparison results are not that clear. The reason is that hyperparameter tuning in case of Transformer not only makes large networks better by regularizing them, but also makes smaller networks better by employing higher learning rates.
The comparison results of optimal memory splits between VGG and WideResNet on CIFAR-10 generally looks similar to the results on CIFAR-100 but a bit more noisy (figure 8, right).
Appendix D MSA effect for uncertainty estimation
Memory split plots for the calibrated test NLL for VGG with tuned hyperparameters are shown in figure 10. The results generally look similar to the case of WideResNet: the MSA effect holds for the calibrated test NLL in all the same cases as for the test accuracy.