Neural network ensembles are a popular technique to boost the performance of a model’s metrics with minimal effort. The most common approach in current literature involves training a neural architecture on the same dataset with different random initializations and averaging their output activations . This is known as ensemble averaging, or a simple type of committee machine. For instance, on image classification on the ImageNet dataset, one can typically expect a 1-2% top-1 accuracy improvement when ensembling two models this way, as demonstrated by AlexNet 
. Evidence suggests averaging ensembles works because each model will make some errors independent of one another due to the high variance inherent in neural networks with millions of parameters[3, 9, 2].
For ensembles with more than two models, accuracy can increase further, but with diminishing returns. As such, this technique is typically used in the final stages of model tuning on the largest available model architectures to slightly increase the best evaluation metrics. However, this method can be regarded as impractical for production use-cases that are under latency and size constraints, as it greatly increases computational cost for a modest reduction in error.
One may expect that increasing the number of parameters in a single network should result in higher evaluation performance than an ensemble of the same number of parameters or FLOPs, at least for models that do not overfit too heavily. After all, the ensemble network will have less connectivity than the corresponding single network. But we show cases where there is evidence to the contrary.
In this paper, we show that we can consistently find averaged ensembles of networks with fewer FLOPs and yet higher accuracy than single models with the same underlying architecture. This is true even for families of networks that are highly optimized in terms of its accuracy to FLOPs ratio. We also show how this gap widens as the number of parameters and FLOPs increase. We demonstrate this trend with a family of ResNets on CIFAR-10  and EfficientNets on ImageNet .
The results of this finding imply that a large model, especially a model that is so large and begins to overfit to a dataset, can be replaced with an ensemble of a smaller version of the model for both higher accuracy and fewer FLOPs. This can result in faster training and inference with minimal changes to an existing model architecture. Moreover, as an added benefit, the individual models in the ensemble can be distributed to multiple workers which can speed up inference even more and potentially ease the design of specialized hardware.
Lastly, we experiment with this finding by varying the architectures of the models in ensemble averaging using neural architecture search to study if it can learn more diverse information associated with each model architecture. Our experiments show that, surprisingly, we are unable to improve over the baseline approach of duplicating the same architecture in the ensemble in this manner. Several factors could be attributed to this, including the choice of search space, architectural features, and reward function. With this in mind, either more advanced methods are necessary to provide gains based on architecture, or it is the case that finding optimal single models would be more suitable for reducing errors and FLOPs than searching for different architectures in one ensemble.
2 Approaches and Experiments
For our experiments, we train and evaluate convolutional neural networks for image classification at various model sizes and ensemble them. When ensembling, we train the same model architecture independently with random initializations, produce softmax predictions from each model, and calculate a geometric mean111Since the softmax applies a transformation in log-space, a geometric mean respects the relationship. We notice slightly improved ensemble accuracy when compared to an arithmetic mean. across the model predictions. For models, we ensemble them by
where the multiplication is element-wise for each prediction vector.
We split our evaluation into two main experiments and a third follow-up experiment.
2.1 Image Classification on CIFAR-10
For the first experiment, we train wide residual networks on the CIFAR-10 dataset [13, 5]. We train and evaluate the Wide ResNets at various width and depth scales to examine the relationship between classification accuracy and FLOPs and compare them with the ensembled versions of each of those models. We train 8 models for each scale and ensemble them as described. We select a depth parameter of , increase the model width scales , and provide the corresponding FLOPs on images with a 32x32 resolution. We use a standard training setup for each model as outlined in .
Note that we use smaller models than typically used (e.g., Wide ResNet 28-10) to show that our findings can work on smaller models that are less prone to overfitting.
2.2 Image Classification on ImageNet
To further show that the ensemble behavior as described can scale to larger datasets and more sophisticated models, we apply a similar experiment using EfficientNets on ImageNet [12, 10]. EfficientNet provides a family of models using compound scaling on the network width, network depth, and image resolution, producing models from b0 to b7. We adopt the first five of these for our experiments, training and ensembling up to three of the same model architecture on ImageNet and evaluating on the validation set. We use the original training code and hyperparameters as provided by  for each model size with no additional modifications.
In this section, we plot the relationship between accuracy and FLOPs for each ensembled model. In cases of single models that are not ensembled, we plot the median accuracy. We observe that the standard deviation of the evaluation accuracy of each model architecture size never exceeds 0.1%, so we exclude it from the results for readability. For models that are ensembled, we vary the number oftrained models and choose the models randomly.
For the first experiment on CIFAR-10, Figure 1 plots a comparison of Wide ResNets with a depth parameter of and width scales . For clarity in presentation, we show a smaller subset of all the networks we trained. For each network (e.g., “wide restnet 16-8”, which stands for the depth parameter of and the width scale of ), we vary the number of models in an ensemble and label it alongside the curve.
In the second experiment on ImageNet, Figure 2 plots a comparison of EfficientNets b0 to b5. Notably, we re-train all models using the current official EfficientNet code222The EfficientNet code can be found at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet, but unlike the original paper that uses AutoAugment, here we do not use any specialized augmentation like AutoAugment or RandAugment to better observe the effects of overfitting.
First, we can see with no surprise that across the board, as the number of FLOPs increase for a single model, so too does the accuracy. This is also true of the ensembles which essentially multiply the base FLOPs by for models.
What is more interesting is that the results show that there can be cases where ensembles of models with fewer collective FLOPs can achieve higher accuracy than a single larger model. This is indicated by points that are above and to the left of other points. For instance, an ensemble of eight Wide ResNet 16-2 models achieves the same accuracy of 95% as a much wider Wide ResNet 16-8 at a fraction of the FLOPs (80M vs. 150M). An added benefit is that ensembles can easily be distributed to multiple workers to speed up computation even more.
Increasing the number of models in an ensemble will eventually be hit with diminishing returns, resulting a crossover point where an ensemble of the next largest model provides a better trade-off in terms of accuracy to FLOPs. In CIFAR-10, we observe the optimal ensemble size would be 2-4 models before the accuracy improvement slows down.
Finally, an interesting trend is that for smaller models, we can see that ensembling them has a more difficult time improving over larger single models. But as the models become larger, becoming increasingly likely to be over-parameterized and overfit to the dataset, we can see how ensembling provides a bigger accuracy boost over even larger models. For instance, the ensembles of EfficientNet-b0 do not come close to reaching the same accuracy to FLOPs trade-off as EfficientNet-b1. However, as the models become increasingly large, we see that the ensemble of two EfficientNet-b3 models achieves higher accuracy with fewer FLOPs over EfficientNet-b4, hence a better trade-off than EfficientNet-b4 provides.
Despite EfficientNet’s scaling ability producing highly optimized models, we can still see gaps in performance where ensembles can perform better under the same number of total FLOPs, especially as the model size grows from b3 onwards. In other words, ensembling offers an alternative and more effective scaling method than the compound scaling in EfficientNet when some application scenarios permit the ensembling.
5 Neural Architecture Search (NAS) for Diverse Ensembles
Having noted the observations above, we hypothesize that ensembles can be improved further by varying the architectures of each model in an ensemble rather than duplicating the same architecture. The idea is that different architectures will naturally provide alternative features and therefore may enhance ensemble diversity. This should, in turn, provide improved accuracy at no increase to the number of FLOPs.
5.1 NAS Experiment Setup
To test this hypothesis, we adopt the same NAS framework as MnasNet . We use a search space predicting model depth, width, and convolution type. We also augment the search space to include varying input resolution scales . As a result, each model provides hyperparameters to search. Additionally, we expand this to a joint search space to search for an ensemble of models by multiplying the search space times, one for each model, for a total of hyperparameters. Each model is trained individually and ensembled as described in earlier experiments.
We alter the reward function to be penalized by not the total latency of the ensemble, but the maximum latency of all of the models in an ensemble and simulate this latency on a Pixel Phone 1. Assuming that each model can be run in parallel on separate workers, this would require the search to optimize the largest model in the ensemble at any given point to reduce the likelihood of producing ensembles where one model is large and the rest are anemic. Lastly, we train each searched model for 10 epochs before evaluating the accuracy, which is part of the reward, on a held-out set.
5.2 NAS Results
We show the Pareto curves of the ensemble accuracy with respect to model maximum latency across ensembles of size one, two, and three in Figure 3. This plot demonstrates the inherent trade-off between model accuracy and computation speed, with the best models being in the outer edge of the point cloud.
Results show that one-, two-, and three-model ensembles are surprisingly close to one another. The skyline two-model ensembles tend to beat out single models, but only by 1% at best. Skyline three-model ensembles show nearly identical performance to single models. We see that the median model accuracy does increase as the ensemble size grows, but at the cost of increased maximum latency.
Out of the searched diverse models, we pick the most promising candidates for a target latency. When trained to convergence, we find that two-model and three-model ensembles perform just as well as single models (assuming roughly equal max image latency). Somehow frustratingly, we find that simply duplicating the best single model for a given latency target and ensembling them together provides the best improvement in accuracy.
This experiment presents evidence towards a conclusion that ensembles benefit the most from choosing the most accurate models and not models that are architecturally diverse, at least under our current NAS context. For a fixed computational budget, this corresponds to using the best model architecture across the ensemble. We of course caution that we only have tested this with a simple NAS setup on a single large image classification dataset. This could change with a noisier and smaller dataset, or with more stringent constraints on model losses, regularization, or architectural mechanisms.
6 Related Work
Model ensembling has a long history with many different proposed techniques. Most works in this area come before advancements in deep learning were popularized. For instance, define different subsets of the training data and use cross-validation to divide data into different groups.  developed bagging, where a different training set is given for different models to promote diversified feature learning. And  is one of the earliest attempts at constructing ensembles with different models by changing the number of hidden nodes in each network.
We have demonstrated how averaging ensembles can result in higher accuracy with fewer FLOPs than popular single models on image classification. This provides an interesting insight that smaller models can stand to provide great benefit without sacrificing on the accuracy to efficiency trade-offs of larger models. We advocate further inspections into the trade-off of ensembling especially for the applications where distributed inference is plausible.
-  (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §6.
-  (2016) Deep learning. MIT press. Cited by: §1.
-  (1990) Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence 12 (10), pp. 993–1001. Cited by: §1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §2.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
Neural network ensembles, cross validation, and active learning. In Advances in neural information processing systems, pp. 231–238. Cited by: §6.
-  (1996) Engineering multiversion neural-net systems. Neural computation 8 (4), pp. 869–893. Cited by: §6.
-  (1992) When networks disagree: ensemble methods for hybrid neural networks. Technical report BROWN UNIV PROVIDENCE RI INST FOR BRAIN AND NEURAL SYSTEMS. Cited by: §1.
Imagenet large scale visual recognition challenge.
International journal of computer vision115 (3), pp. 211–252. Cited by: §2.2.
Mnasnet: platform-aware neural architecture search for mobile.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §5.1.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. ICML 2019. arXiv preprint arXiv:1905.11946. Cited by: §1, §2.2.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §1, §2.1.