1 Introduction
Overparametrized deep models can memorize datasets with labels entirely randomized [ZBH16]
. It is consequently not entirely clear why such extremely flexible models are able to generalize well on unseen data and trained with algorithms as simple as stochastic gradient descent, although a lot of progress on these questions have recently been reported
[DR17, JGH18, BM19, MMN18, RVE18, GML18].The highcapacity of neural network models, and their ability to easily overfit complex datasets, makes them especially vulnerable to calibration issues. In many situations, standard deeplearning approaches are known to produce probabilistic forecasts that are overconfident
[GPSW17]. In this text, we consider the regime where the size of the training sets are very small, which typically amplifies these issues. This can lead to problematic behaviours when deep neural networks are deployed in scenarios where a proper quantification of the uncertainty is necessary. Indeed, a host of methods [LPB17, MGI19, SHK14, GG16, Pre98] have been proposed to mitigate these calibration issues, even though no goldstandard has so far emerged. Many different forms of regularization techniques [PW17, ZBH16, ZH05] have been shown to reduce overfitting in deep neural networks. Importantly, practical implementations and approximations of Bayesian methodologies [MGI19, WHSX16, BCKW15, Gra11, LW16, RMW14, Mac92] have demonstrated their worth in several settings, although some of these techniques are not entirely straightforward to implement in practice. Ensembling approaches such as dropouts [GG16] have been widely adopted, largely due to their ease of implementation. In this text, we investigate the practical use of DeepEnsembles [LPB17, BC17, LPC15, SLJ15, FHL19, GPSW17], a straightforward approach that displays stateoftheart performances in most regimes. Although deepensembles can be difficult to implement when training datasets are large (but calibration issues are less pronounced in this regime), the focus of this text is the datascarce regime where the computational burden associated to deepensembles is not a significant problem.Contributions: we study the interaction between three of the most simple and widely used methods for scaling deeplearning to the lowdata regime: ensembling, temperature scaling, and mixup dataaugmentation.

Despite the general belief that averaging models improves calibration properties, we show that, in general, standard ensembling practices do not lead to bettercalibrated models. Instead, we show that averaging the predictions of a set of neural networks generally leads to less confident predictions: that is generally only beneficial in the oftencountered regime when each network is overconfident. Although our results are based on Deep Ensembles, our empirical analysis extends to any class of model averaging, including samplingbased Bayesian Deep Learning.

We empirically demonstrate that networks trained with the mixup
dataaugmentation scheme, a very common practice in computer vision, are typically underconfident. Consequently, subtle interactions between ensembling techniques and modern dataaugmentation pipelines have to be taken into account for proper uncertainty quantification. The typical distributionalshift induced by the mixup dataaugmentation strategy influences the calibration properties of the resulting trained neural networks.

Postprocessing techniques such as temperature scaling can be successfully used in conjunction with deepensembling methods, but the order in which the aggregation and the calibration procedures are carried out does greatly influence the quality of the resulting uncertainty quantification. These findings lead us to formulate the straightforward PoolThenCalibrate strategy for postprocessing deepensembles: (1) in a first stage, separately train deep models (2) in a second stage, fit a single temperature parameter by minimizing a proper scoring rule (eg. crossentropy) on a validation set. In the low dataregime, this simple procedure can halve the Expected Calibration Error (ECE) on a range of benchmark classification problems when compared to standard deepensembles.
2 Background
Consider a classification task with possible classes . For a sample , the quantity represents a probabilistic prediction, often obtained as for a neural network with weight and softmax function . We set and .
Augmentation: Consider a training dataset and denote by
the onehot encoded version of the label
. A stochastic augmentation process maps a pair to another augmented pair . In computer vision, standard augmentation strategies include rotations, translations, brightness and contrast manipulations. In this text, in addition to these standard agumentations, we also make use of the more recently proposed mixup augmentation strategy [ZCDL17] that has proven beneficial in several settings. For a pair , its mixupaugmented version is defined as(1) 
for a random coefficient drawn from a fixed mixing distribution often chosen as , and a random index drawn uniformly within .
Model averaging: Ensembling methods leverage a set of models by combining them into a aggregated model. In the context of deep learning, Bayesian averaging consists in weighting the predictions according to the Bayesian posterior
on the neural weights. Instead of finding an optimal set of weights by minimizing a loss function, predictions are averaged. Denoting by
the probabilistic prediction associated to sample and neural weight , the Bayesian approach advocates to consider(2) 
Designing sensible prior distributions is still an active area of research and dataaugmentation schemes, crucial in practice, are not entirely straightforward to fit into this framework. Furthermore, the highdimensional integral (2) is (extremely) intractable: the posterior distribution is multimodal, highdimensional, concentrated along lowdimensional structures, and any local exploration algorithm (eg. MCMC, Langevin dynamics and their variations) is bound to only explore a tiny fraction of the state space. Because of the typically large number of degrees of symmetries, many of these local modes correspond to essentially similar predictions, indicating that it is likely not necessary to explore all the modes in order to approximate (2). A detailed understanding of the geometric properties of the posterior distribution in Bayesian neural networks is still lacking, although a lot of recent progress have been made. Indeed, variational approximations have been reported to improve, in some settings, over standard empirical risk minimization procedures. Deepensembles can be understood as crude, but practical, approximations of the integral in Equation (2). The highdimensional integral can be approximated by a simple nonweighted average over several modes of the posterior distribution found by minimizing the negative logposterior, or some approximations of it, with standard optimization techniques:
(3) 
We refer the interested reader to [Nea12, MMK03, WI20] for different perspectives on Bayesian neural network. Although simple and not very well understood, deepensembles have been shown to provide extremely robust uncertainty quantification when compared to more sophisticated approaches [LPB17, BC17, LPC15, SLJ15].
Postprocessing Calibration Methods: The article [GPSW17] proposes a class of postprocessing calibration methods that extend the more standard Platt Scaling approach [Pla99]. Temperature Scaling, the simplest of these methods, transforms the probabilistic outputs into a tempered version defined through the scaling function
(4) 
for a temperature parameter .
The optimal parameter is usually found by minimizing a properscoring rules [GR07], often chosen as the negative loglikelihood, on a validation dataset.
Crucially, during this postprocessing step, the parameters of the probabilistic model are kept fixed: the only parameter being optimized is the temperature .
In the lowdata regime considered in this article, the validation set being also extremely small, we have empirically observed that the more sophisticated Vector and Matrix scaling postprocessing calibration methods [GPSW17] do not offer any significant advantage over the simple and robust temperature scaling approach.
Calibration Metrics: The Expected Calibration Error (ECE) measures the discrepancy between prediction confidence and empirical accuracy. In this text, we also define the signed Expected Calibration Error (sECE) in order to differentiate underconfidence from overconfidence. For a partition of the unit interval and a labelled set , set and and . The quantities ECE and sECE are defined as
(5) 
A model is calibrated if for all , i.e. . A large (resp. low) value of the sECE indicates overconfidence (resp. underconfidence). It is often instructive to display the associated reliability curve, i.e. the curve with on the xaxis and the difference on the yaxis. Figure 1 displays examples of such reliability curves. A perfectly calibrated model is flat (i.e. ), while the reliability curve associated to an underconfident (resp. overconfident) model prominantly lies above (resp. below) the flat line . In the sequel, we sometimes report the value of the Brier score [Bri50] defined as .
3 Empirical Observations
Linear pooling: It has been observed in several studies that averaging the probabilistic predictions of a set of independently trained neural networks, i.e. deepensembles, often leads to more accurate and bettercalibrated forecasts [LPB17, BC17, LPC15, SLJ15, FHL19]. Figure 1 displays the reliability curves across three different datasets of a set of independently trained neural networks, as well as the reliability curves of the aggregated forecasts obtained by simply linear averaging the individual probabilistic predictions. These results suggest that deepensembles consistently lead to predictions that are less confident than the ones of its individual constituents. This can indeed be beneficial in the often encountered situation when each individual neural network is overconfident. Nevertheless, this phenomenon should not be mistaken with an intrinsic property of deep ensembles to lead to bettercalibrated forecasts. For example, and as discussed further in Section 4, networks trained with the popular mixup dataaugmentation are typically underconfident. Ensembling such a set of individual networks typically leads to predictions that are even more underconfident. In order to gain some insights into this phenomenon, recall the definition of the entropy functional ,
(6) 
The entropy functional is concave on the probability simplex
, i.e. for any. Furthermore, tempering a probability distribution
leads to increase in entropy if , as can be proved by examining the derivative of the function . The entropy functional is consequently a natural surrogate measure of (lack of) confidence. The concavity property of the entropy functional shows that ensembling a set of individual networks leads to predictions whose entropies are higher than the average of the entropies of the individual predictions. We have not been able to prove a similar property for the ECE functional.In order to obtain a more quantitative understanding of this phenomenon, consider a binary classification framework. For a pair of random variables
, with and , and a classification rule that approximates the conditional probability , define the Deviation from Calibration score as(7) 
The term is equivalent to the Brier score of the classification rule and the quantity is an entropic term (i.e. large for predictions close to uniform). Note that DC can take both positive and negative values and for a wellcalibrated classification rule, i.e. for all . Furthermore, among a set of classification rules with the same Brier score, the ones with less confident predictions (i.e. larger entropy) have a lesser DC score. In summary, the DC score is a measure of confidence that vanishes for wellcalibrated classification rules, and that is low (resp. high) for underconfident (resp.overconfident) classification rules. Contrarily to the entropy functional (6), the DC score is extremely tractable. Algebraic manipulations readily shows that, for a set of classification rules and nonnegative weights , the linearly averaged classification rule satisfies
(8) 
Equation (8) shows that averaging classifications rules decreases the DC score (i.e. the aggregated estimates are less confident). Furthermore, the more dissimilar the individual classification rules, the larger the decrease. Even if each individual model is wellcalibrated, i.e. for , the averaged model is not wellcalibrated as soon as at least two of them are not identical.
Distance to the training set: in order to gain some additional insights into the calibration properties of neural networks trained on small datasets, as well as the influence of the popular mixup augmentation strategy, we examine several metrics (i.e. signed ECE (sECE), Negative Loglikelihood (NLL), entropy) as a function of the distance to the (small) training set . We focus on the CIFAR10 dataset and train our networks on a balanced subset of training examples. Since there is no straightforward and semantically meaningful distance between images, we first use an unsupervised method (i.e. labels were not used) for learning a lowdimensional and semantically meaningful representation of dimension . For these experiments, we obtained a mapping , where denotes the unit sphere in , with the simCLR method of [CKNH20], although experiments with other metric learning approaches [HFW19, YZYC19] have led to essentially similar conclusions. We used the distance , which in this case is equivalent to the cosine distance between the dimensional representations of the CIFAR10 images and . The distance of a test image to the training dataset is defined as . We computed the distances to the training set for each image contained in the standard CIFAR10 test set (last column of Figure 2). Not surprisingly, we note that the average Entropy, Negative Loglikelihood and Error Rate all increase as test samples are chosen further away from the training set.

Overconfidence: the predictions associated to samples chosen further away from the training set have a higher sECE. This indicates that the overconfidence of the predictions increases with the distance to the training set. In other words, even if the entropy increases as the distance increases (as it should), calibration issues do not vanish as the distance to the training set increases. This phenomenon is irrespective of the amount of mixup used for training the network.

Effect of mixupaugmentation: The first row of Figure 2 shows that increasing the amount of mixup augmentation consistently leads to an increase in entropy, decrease in overconfidence (i.e. sECE), as well as a more accurate predictions (lower NLL and higher accuracy). Additionally, the effect is less pronounced for . This is confirmed in Figure 3 that displays the more generally the effect of the mixupaugmentation on the reliability curves, over four different datasets.

Temperature Scaling: importantly, the second row of Figure 2 indicates that a postprocessing temperature scaling for the individual models almost washesout all the differences due to the mixupaugmentation scheme. For this experiment, an ensemble of networks is considered: before averaging the predictions, each network has been individually temperature scaled by fitting a temperature parameter (through negative likelihood minimization) on a validation set of size .
4 Calibrating Deep Ensembles
In order to calibrate deep ensembles, several methodologies can be considered:

Do nothing and hope that the averaging process intrinsically leads to better calibration

Calibrate each individual network before aggregating all the results

Simultaneously aggregate and calibrate the probabilistic forecasts of each individual model.

Aggregate first the estimates of each individual model before eventually calibrating the pooled estimate.
Pooling methods: as recognized in the operation research literature [JW08, WGCLJ19], simple pooling/aggregation rules that do not require a large number of tuning parameters are usually preferred, especially when training data is scarce. Simple aggregation rules are usually robust, conceptually easy to understand, and straightforward to implement and optimize. The standard average and median pooling of a set of probabilistic predictions are defined as
(9) 
for a normalization constant , the median operation being executed componentwise over the components. Finally, , the trimmed mean [JW08] of real numbers , is obtained by first discarding the largest and smallest values before averaging the remaining elements. This means that where is a permutation such that . The trimmed mean pooling method is consequently defined as
(10) 
for a normalization constant , with the trimmedaveraging being executed componentwise.
PoolThenCalibrate: any of the abovementioned aggregation procedure can be used as a pooling strategy before fitting a temperature by a minimizing proper scoring rules on a validation set. In all our experiment, we minimized the negative loglikelihood (i.e. crossentropy). In other words, given a set of probabilistic forecasts, the final prediction is defined as
(11) 
Note that the aggregation procedure can be carried out entirely independently from the fitting of the optimal temperature .
Joint PoolandCalibrate: there are several situations when the socalled endtoend training strategy consisting in jointly optimizing several component of a composite system leads to increased performances [MKS15, MPV16, GWR16]. In our setting, this means learning the optimal temperature concurrently with the aggregation procedure. The optimal temperature is found by minimizing a proper scoring rule on a validation set ,
(12) 
where denotes the aggregated probabilistic prediction for sample .
In all our experiments, we have found it computationally more efficient and robust to use a simple grid search for finding the optimal temperature; we used temperatures equally spaced on a logarithmic scale in between and .
Importance of the Pooling and Calibration order: Figure 4 shows calibration curves when individual models are temperature scaled separately (i.e. group [B] of methods), as well as when the models are scaled with a common temperature parameter (i.e. group [C] of methods). Furthermore, the calibration curves of the pooled model (group [B] and [C] of methods) are also displayed. More formally, the group [B] of methods obtains for each individual model an optimal temperature as solution of the optimization procedure
where denotes the probabilistic output of the model for the example in validation dataset. The light blue calibration curves corresponds to the outputs for different models. The deep blue calibration curve corresponds the linear pooling of the individually scaled predictions. For the group [C] of methods, a single common temperature is obtained as solution of the optimization procedure
(13) 
where denotes the aggregated probabilistic prediction for sample . The orange calibration curves are generated using the predictions and the red one corresponds to the prediction .
Notice that when scaled separately (by ) each of the individual models (light blue) is close to being calibrated, but the resulting pooled model (deep blue) is underconfident. However, when scaled by a common temperature, the optimization chooses a temperature that makes the individual models (orange) slightly overconfident, so that the resulting pooled model is nearly calibrated. This is in line with the findings discussed in section 3 and it also shows why the ordering of pooling and scaling is important.
Figure 5 compares the four methodologies ABCD identified at the start of this section, with the three different pooling approaches and and . These methods are compared to the baseline approach (in dashed red line) consisting of fitting a single network trained with the same amount of mixup augmentation before being temperature scaled. All the experiments are executed times, on the same training set, but with different validation sets of size for CIFAR10, IMAGENETTE, IMAGEWOOF and for CIFAR100, and for the Diabetic Retinopathy dataset. The results indicate that on most metrics and datasets, the (naive) method consisting of simply averaging predictions is not competitive. Secondly, and as explained in the previous section, the method (B) consisting in first calibrating the individual networks before pooling the predictions is less efficient across metrics than the last two methods . Finally, the two methods perform comparably, the method (D) (i.e. poolthencalibrate) being slightly more straightforward to implement. As regards the pooling methods, the intuitive robustness of the median and trimmedaveraging approaches does not seem to lead to any consistent gain across metrics and datasets. Note that ensembling a set of networks (without any form of postprocessing) does lead to a very significant improvement in NLL and Brier score but lead to a serious deterioration of the ECE. The PoolThenAggregate methodology allows to benefit from the gains in NLL/Brier score, without compromising any loss in ECE.
CIFAR10 1000 samples  

Metric  Group [A]  Group [B]  Group [C]  Group [D] 
Linear Pool  Linear Pool  Linear Pool  Linear Pool  
test acc  70.67  69.94  69.93  69.95 
test ECE  13.9  11.1 3.6  4.8 2.7  4.9 2.9 
test NLL  0.961  0.956 .031  0.915 .013  0.916 .015 
test BRIER  0.431  0.431 .011  0.416 .004  0.417 .005 
CIFAR100 5000 samples  
test acc  55.32  54.03  53.99  54.05 
test ECE  17.8  13.1 1.2  3.5 0.9  2.1 .5 
test NLL  1.911  1.883 .016  1.799 .002  1.787 .002 
test BRIER  0.623  0.616 0.004  0.594 .001  0.592 .0 
Diabetic Retinopathy 5000 samples  
test acc  64.38  64.41  64.34  64.38 
test ECE  4.9  2.8 .7  2.8 .8  2.9 .8 
test NLL  0.641  0.636 .001  0.637 .002  0.637 .002 
test BRIER  0.450  0.445 .001  0.445 .001  0.446 .001 
Imagewoof 1000 samples  
test acc  66.89  66.05  66.05  66.03 
test ECE  8.9  7.5 3.5  4.2 2.2  4.3 2.1 
test NLL  1.044  1.065 0.26  1.044 .013  1.045 0.12 
test BRIER  0.452  0.463 .008  0.456 .003  0.457 .003 
Imagenette 1000 samples  
test acc  80.91  80.72  80.74  80.75 
test ECE  18.2  7.3 2.7  3.1 1.1  3.5 1.0 
test NLL  0.753  0.659 .018  0.637 .006  0.638 .005 
test BRIER  0.312  0.279 .005  0.272 .001  0.273 .001 
) and different datasets. The number of samples used for different setup are the same as mentioned in the main text. The mean and standard deviation is reported out of 50 different validation sets.
Importance of the validation set: it would be practically useful to be able to fit the temperature without relying on a validation set. We report that using the training set instead (obviously) does not lead to better calibrated models (i.e. the optimal temperature is close to ). We have tried to use a different amount of mixupaugmentation (and other types of augmentation) on the training set for fitting the temperature parameter, but have not been able to obtain satisfying results.
Size of the ensembles:
Figure 7 shows the performance of the different pooling methods (i.e. groups [B][D]) on the CIFAR10 dataset, as a function of the number of individual models in the ensemble. For clarity, the (noncalibrated) group [A] of methods are not reported. Recall that the group [A] pools the the predictions without any calibration procedure, the group [B] first calibrates each individual models separately before aggregating the results, the group [C] jointly calibrates and aggregates the prediction, and finally the group [D] first aggregates the results before calibrating the resulting prediction. Methods in group [C] and [D] performs similarly. For the CIFAR10 dataset, we observe that the performance under most metrics saturates for ensemble of sizes .
Table 1 reports the numerical results obtained when a linear averaging aggregation method is used within each group [A]–[D] of calibration procedures. Experiments are carriedout on different validations sets (and a single training set).
Role and effect of mixupaugmentation:
the mixup augmentation strategy is popular and straightforward to implement. As already empirically described in Section 3, increasing the amount of mixupaugmentation typically leads to a decrease in the confidence and increase in entropy of the predictions. This can be beneficial in some situations but also indicates that this approach should certainly be employed with care for producing calibrated probabilistic predictions. Contrarily to other geometric dataaugmentation transformations such as image flipping, rotations, and dilatations, the mixup strategy produces nonrealistic images that consequently lie outside the datamanifold of natural images: this typically leads to a large distributional shift. The mixup strategy relies on a subtle tradeoff between the increase in training data diversity, which can help mitigate overfitting problems, and the distributional shift that can be detrimental to the calibration properties of the resulting method. Figure 6 compares the performance of the PoolThenCalibrate approach when applied to a deepensemble of networks trained with different amount of mixupaugmentation. The results are compared to the same approach (i.e. PoolthenCalibrate with networks) with no mixupaugmentation. The results indicate a clear benefit in using the mixupaugmentation in conjunction with temperature scaling.
Ablation study: For our ablation study, we focus on the CIFAR10 dataset with examples. As mentioned earlier, we reduce the training dataset by 50 training examples for steps involving the validation dataset. Similar to table 1 we evaluate methods requiring postprocessing optimization on a random set of 50 different validation datasets. We provide the results of our ablation study in table 2. For setups involving training a single model, we report mean and standard deviations of the metric from a variety of 30 different trained models.
Metric  (Ours) 30 models  30 models  single model  single model  single model 

temp scaled  mixup  mixup  no mixup  no mixup  
Augment + mixup  Augment  Augment  Augment  no Augment  
test acc  69.92 .04  70.67  66.45 .61  63.73 .51  49.85 .66 
test ECE  3.3 1.9  13.9  7.03 .7  20.7 .4  23.4 1.0 
test NLL  0.910 .012  0.961  1.03 .13  1.509 .017  1.770 .045 
test BRIER  0.414 .002  0.431  0.463 .005  0.556 .006  0.718 .009 
Cold posteriors: the article [WRV20] reports gains in several metrics when fitting Bayesian neural networks to a tempered posterior of type , where is the standard Bayesian posterior, for temperatures smaller than one. Although not identical to our setting, it should be noted that in all our experiments, the optimal temperature was consistently smaller than one. In our setting, this is because simply averaging predictions lead to underconfident results. We postulate that related mechanisms are responsible for the observations reported in [WRV20].
5 Discussion
The problem of calibrating deepensembles has received surprisingly little attention in the literature. In this text, we examined the interaction between three of the most simple and widely used methods for scaling deeplearning to the lowdata regime: ensembling, temperature scaling, and mixup dataaugmentation. We highlight that ensembling in itself does not lead to bettercalibrated predictions, that the mixup augmentation strategy is practically important and relies on nontrivial tradeoffs, and that these methods subtly interact with each other. Crucially, we demonstrate that the order in which the pooling and temperature scaling procedures are executed is important to obtaining calibrated deepensembles. We advocate the PoolThenCalibrate approach consisting of first pooling the individual neural network predictions together before eventually postprocessing the result with a simple and robust temperature scaling step. Furthermore, we note that this approach is insensitive to the choice of pooling method, the simple linear averaging procedure being essentially as robust as the median and trimmed averaging methods.
References
 [BC17] Hamed Bonab and Fazli Can. Less is more: a comprehensive framework for the number of components of ensemble classifiers. arXiv preprint arXiv:1709.02925, 2017.
 [BCKW15] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. ICML, 2015.
 [BM19] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pages 12873–12884, 2019.
 [Bri50] Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
 [CB09] Jorge Cuadros and George Bresnick. EyePACS: An Adaptable Telemedicine System for Diabetic Retinopathy Screening. Journal of Diabetes Science and Technology, 3(3):509–516, May 2009.
 [CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
 [DR17] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
 [FHL19] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.

[GG16]
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
international conference on machine learning
, pages 1050–1059, 2016.  [GML18] Marylou Gabrié, Andre Manoel, Clément Luneau, Nicolas Macris, Florent Krzakala, Lenka Zdeborová, et al. Entropy and mutual information in models of deep neural networks. In Advances in Neural Information Processing Systems, pages 1821–1831, 2018.
 [GPSW17] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, 2017.
 [GR07] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
 [Gra11] A. Graves. Practical variational inference for neural networks. NIPS, 2011.
 [GWR16] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
 [HFW19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
 [How] Jeremy Howard. Imagenette and imagewoof.
 [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
 [JW08] Victor Richmond R Jose and Robert L Winkler. Simple robust averages of forecasts: Some empirical results. International journal of forecasting, 24(1):163–169, 2008.
 [LPB17] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017.
 [LPC15] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
 [LW16] C. Louizos and M. Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. arXiv preprint arXiv:1603.04733, 2016.

[Mac92]
D. J. C. MacKay.
A practical bayesian framework for backpropagation networks.
Neural Computation, 4(3):448–472, 1992.  [MGI19] W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, and A. G. Wilson. A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476, 2019.

[MKS15]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
Ostrovski, et al.
Humanlevel control through deep reinforcement learning.
Nature, 518(7540):529–533, 2015.  [MMK03] David JC MacKay and David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
 [MMN18] Song Mei, Andrea Montanari, and PhanMinh Nguyen. A mean field view of the landscape of twolayer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
 [MPV16] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
 [Nea12] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.

[Pla99]
J. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
Advances in Large Margin Classifiers, 10(3)
, 1999.  [Pre98] Lutz Prechelt. Early stoppingbut when? In Neural Networks: Tricks of the trade, pages 55–69. Springer, 1998.
 [PW17] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
 [RMW14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [RVE18] Grant M Rotskoff and Eric VandenEijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
 [SHK14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[SLJ15]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1–9, 2015.  [WGCLJ19] Robert L. Winkler, Yael GrushkaCockayne, Kenneth C. Lichtendahl, and Victor Richmond R. Jose. Probability forecasts and their combination: A research perspective. Decision Analysis, 16(4):239–260, 2019.
 [WHSX16] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
 [WI20] Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
 [WRV20] Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405, 2020.
 [YZYC19] Mang Ye, Xu Zhang, Pong C Yuen, and ShihFu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6210–6219, 2019.
 [ZBH16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 [ZCDL17] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017.
 [ZH05] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
Comments
There are no comments yet.