This repository contains code and further information for the work in https://arxiv.org/abs/2003.09946
Deep Neural Networks (DNN) represent the state of the art in many tasks. However, due to their overparameterization, their generalization capabilities are in doubt and are still under study. Consequently, DNN can overfit and assign overconfident predictions, as they tend to learn highly oscillating decision thresholds. This has been shown to affect the calibration of the confidences assigned to unseen data. Data Augmentation (DA) strategies have been proposed to overcome some of these limitations. One of the most popular is Mixup, which has shown a great ability to improve the accuracy of these models. Recent work has provided evidence that Mixup also improves the uncertainty quantification and calibration of DNN. In this work, we argue and provide empirical evidence that, due to its fundamentals, Mixup does not necessarily improve calibration. Based on our observations we propose a new loss function that improves the calibration, and also sometimes the accuracy. Our loss is inspired by Bayes decision theory and introduces a new training framework for designing losses for probabilistic modelling. We provide state-of-the-art accuracy with consistent improvements in calibration performance.READ FULL TEXT VIEW PDF
This repository contains code and further information for the work in https://arxiv.org/abs/2003.09946
Deep Neural Networks (DNN) are probabilistic models (PM) that represent the state of the art in many tasks, either as end-to-end models , or as part of complex decision systems . Many of the applications in which DNN
has widely overcome previous approaches require that the parameterized probability distributions are interpretable. This means that both the prediction, for instance, the class selected in a classification problem, and the probability assigned to that prediction, are important for the correct performance of the whole system. Examples of these applications are medical diagnosis, self-driving cars  or language recognition . In all these problems, it is very different to decide towards an action with high probability, than doing it with a more moderated one. The ultimate consequences incurred in the decision process can be drastic if these probabilities are not reliable i.e, are not well-calibrated. In other words, our model is calibrated if the probabilities assigned reflect the uncertainty present in the data distribution. Moreover, our PM must be able to discriminate between, i.e to separate, the different classes. Note that discrimination is inherent to the data distribution, which means that we cannot expect to separate our data if our data is not separable in its origins. Both good discrimination and a correct modelling of data uncertainty is mandatory to achieve optimal classification performance by the use of the Bayes Decision Rule (BDR).
The calibration and discrimination of a PM can be improved by optimizing the expected value of a proper scoring rule (PSR), an additive scalar representing both quantities . For that reason, this optimization is not a guarantee of optimal calibration, as all the effort can be pushed into having better discriminative capabilities. This effect has been recently observed in the context of DNN where  showed that although these models are typically trained by optimizing the Negative Log-Likelihood (NNL), the calibration performance is compromised in the direction of over-confidence. This means that even though the accuracy provided by these models on several benchmarks are among the best published, the probabilities assigned are ultimately extreme and badly calibrated. One should not be surprised about this generalization limitation, as many theories that study the generalization capabilities of probabilistic models, such as the VC dimension  or the use of marginal likelihoods and Bayes rule for model selection , are instances of the Occam’s Razor principle . For instance a recent work  has shown that DNN can memorize the data input distribution and  has shown that many state of the art models overfit the test set.
For that reason, the community has been exploring different regularization techniques that can improve the generalization of these models, being Data Augmentation (DA) one of the gold standards. These techniques aim to increase the support on the input manifold, through transformations that are typically driven by expert knowledge, e.g. rotations or translations when the inputs are images. Thus, in many domains, it is not clear which kind of augmentations might be useful, which motivates the analysis of general-purpose DA techniques such as Mixup , whose fundamentals rely on empirical risk minimization (ERM) . However, both Mixup and human-driven DA techniques share a common issue: they are not designed by analyzing the properties of the input distribution and the intersection of these with the PM; mainly because modern instances of these, such as DNN, are difficult to interpret. For that reason, the selection and performance of DA techniques depend, basically, on cross-validation; but there is no principled way to establish if a particular DA technique might boost the performance of a particular application or not.
Motivated by the fundamentals and good performance of Mixup, a very recent work  has studied how Mixup affects the uncertainty quantification and the calibration performance on DNN. They show that Mixup improves the calibration, and they attribute this fact to the smoothness that Mixup induces in the decision regions learned by a PM. Our work is built on top of this observation. We argue that the fundamentals of ERM and Mixup do not allow us to claim that learning smoother decision thresholds are a sufficient condition for having properly calibrated PM, because this decision is not based on the uncertainty of our input distribution. This also extrapolates to other strategies that have shown good regularization in terms of accuracy, uncertainty quantification or calibration, such as label smoothing  or more recently DA techniques [24, 13].
In this work, we first provide empirical evidence that Mixup can degrade calibration. Secondly, we propose a new loss function to correct this calibration degradation by encouraging the PM to learn its discriminative capabilities, through the incorporation of a simple measure of data uncertainty. Thus, our loss function is inspired by how optimality is achieved in a BDR scenario, and we claim that this has to be done to achieve reliable probability distributions. Note that learning to assign probabilities only make sense if the input distribution does not present any kind of overlapping, which is something really hard to assess. For that reason, it should not be surprising that a modern PM, such as a DNN, can have undesirables effects such as memorization  or overconfident-badly-calibrated probabilities  when forced to achieve this assignment, as it happens by learning through the categorical cross-entropy (CE). Note that a modern DNN, due to overparameterization, can successfully assign without any guarantees of generalization, and they typically rely on learning highly oscillating decision thresholds , which are also responsible for being vulnerable to adversarial attacks. The results of this work open new perspectives to design losses in this fashion, aiming at representing more sophisticated forms of data uncertainty.
The first work that showed the badly calibrated probabilities of DNN is found in , where different classical calibration techniques are compared. The authors proposed Temperature Scaling. On top of this work  has shown how complex techniques can be employed for offline calibration if uncertainty is correctly incorporated, through the use of Bayesian Neural Networks. On the other hand, [12, 11]
has shown that using self-supervised learning and pre-trained models improves model robustness, uncertainty and calibration. Moreover, the same author has measured robustness against common perturbations, and  has measured the performance on calibration and uncertainty of several strategies under dataset shift. On the other hand, deep ensembles have also shown good performance for uncertainty quantification and calibration . Finally, on the side of DA strategies,  measure the robustness and calibration of Mixup training and  propose On-Manifold Adversarial Data Augmentation, which attempts to generate challenging examples by following an on-manifold adversarial attack path in the latent space of a generative model. Moreover,  propose a similar technique to Mixup but on the hidden layers of a DNN, with good results in robustness against perturbations. Finally, Augmix has been proposed in  with good results in uncertainty quantification and robustness.
In this section we describe calibration in the context of image classification and provide the fundamentals of Mixup before presenting our loss function in the next section. We are given pairs of observed i.i.d. samples drawn from some unknown joint probability . We then learn a categorical posterior distribution by means of a function that maps input images to class probabilities by maximum a posterior. To make decision we rely on BDR and chose the action that minimize Bayes risk:
where represents the loss incurred when taking the action if the ground truth is . In this work we consider equal losses
, which means that we choose the class with maximum posterior probability. This rule guarantees optimality when we plug in the data generating distribution. In practice this distribution is substituted with the modeland thus, the lower the gap between the model and the data generating distribution, the closer we will be to an optimal decision.
In a classification scenario, we say that a model is calibrated if the confidences assigned by this model to a set of samples towards class are equal to the real proportion of samples in that the model assigns to this class. This means that to be calibrated, a model should assign confidences considering the proportion of samples assigned to each of the classes. Moreover, in addition to calibration, a model should also present a sharpened probability distribution, a property known as discrimination or refinement [2, 5]. With this property, we guarantee that our model can discriminate between classes. Thus, both good calibration and discrimination imply recovering how the data from the different classes is distributed or, in other words, good calibration and discrimination imply recovering data uncertainty. By doing so, our model will be forced to match the data generating distribution and this will guarantee asymptotic optimality in the decisions to be taken.
Note that the goal of a PM is to map any data distribution to a linear separable manifold. Thus, we can only achieve separability if: 1) the data is separable in its origins and 2) the model has enough capacity to do so. Thus, if 1) or 2) does not hold (which is something that we will not typically know), then it seems unreasonable to force the model to learn towards probabilities; and we should expect an overparameterized model to experiment different pathologies such as overfitting , memorization  or bad calibration . A very illustrative example of this pathology is: Why should we push probabilities towards
in a 1-dimensional input generative Gaussian classifier if Gaussians have support over? Based on this observation a training loss in a modern PM should somehow consider this inherent structure (uncertainty) in the data to reliably target the underlying distribution, and avoid the great ability of DNN to assign probabilities when we do not know if the distribution to be model is or can be linearly separated. This is the core idea of our proposed loss function, and we will further use it to justify why Mixup should not necessarily provide calibrated distributions.
Mixup has its fundamentals in vicinal risk minimization (VRM) 111For unfamiliar readers we provide a wider description in appendix 0.A., which is derived as a solution to the limitations present in ERM [31, 37, 32]
. Contrary to other vicinal distributions, Mixup assumes that the samples in the vicinity distribution do not belong to the same class. For that reason, is defined as the expected value of a linear interpolation between two input samples and their corresponding labels. The interpolation is given by the coefficient
, which is drawn from a beta distribution. An unbiased estimate of the empirical risk can be obtained by evaluating the average loss function on a set of samples drawn from this distribution as follows:
As a consequence, training with Mixup ensures a linear-soft transition between the confidence assigned by a model in the different parts of the input space. However, this only ensures smoothness in the confidence assigned to different regions of the input space, reducing the overconfidence but without any guarantee of an improved calibration, because the uncertainty is not considered at all. Note that Mixup just relies on an assumption on how the samples in the vicinity are distributed but do not take into consideration the proportion of samples present, which is at the core of a proper calibration.
As a consequence, only if the data distribution presents a linear relation between their corresponding classes, one could expect the ultimate distribution to be calibrated when applying this technique. In the experimental section, we show that some models trained with Mixup do not necessarily improve the calibration, as recently noted in . In fact, we show that Mixup tends to worse the calibration in many cases.
As illustrated in previous sections, our objective is to benefit from the improved accuracy of Mixup, but providing better calibrated distributions. To do so, we introduce a new loss function, which is a weighted combination of our proposed loss, named Auto-Regularized-Confidence (ARC), and the categorical cross entropy (CE). The ARC loss is inspired by the Expected Calibration Error (ECE)222See appendix 0.B for a detailed description of calibration metrics.. The idea, as argued in above sections, is to incorporate data uncertainty in the predictions. This is done by matching the confidence assigned to a set of samples (i.e., a batch) to its own accuracy by means of any of these two variants:
The idea is to partition the confidence range assigned to a set of samples into bins. For each bin, the accuracy is computed, and either the average confidence () or the individual confidences () are forced to match the accuracy. If we set then our loss function is computed over the entire batch. We make the accuracy a constant value so learning gradients only depend on the confidence assigned by the model. Our loss is combined with the CE to avoid the local minimum in which the network parameterize a prior classifier (i.e., the one which assigns prior confidences to samples), as we found in our initial analysis. This is because a prior classifier is useless, but the trivial way of optimizing calibration. Thus the overall loss is given by:
is a hyperparameter that controls the relative importance given to each of the losses and is established with a validation set. As mentioned, this new loss targets the uncertainty of the learned representation, through the accuracy. The accuracy is used to summarize the proportion of samples from different classes that are being“mixed”. So it somehow represents how the data, or the representation that the model can learn, is distributed. It is clear that the accuracy is a very simple statistical summary of the data uncertainty and it is let to future work the search for other quantifiers that could encode more useful information such as how samples are distributed in the input space. Consequently, we can expect that by evaluating the CE loss on the Mixup image , and the ARC loss on the mixing images and , one can benefit from the improved discrimination as learned by the CE, but the ultimate confidences are assigned by how the classifier classifies samples from the generating distribution and not those virtually generated by Mixup. It is then clear that ARC incorporates data uncertainty, which will improve the model representation of the underlying distribution, and thus its calibration. To validate this procedure, in our work we experiment with variants that compute ARC loss over and ; and we also compute ARC loss over . In general, all datasets benefit more from the latter. A discussion is provided in the experimental section.
An additional analysis of this loss function is provided in appendix 0.C and the experimental section. This includes the motivation beside experimenting with and and an analysis of why this loss might improve the accuracy, as we have found that some datasets improve this metric by applying the ARC loss.
Finally, we discuss one drawback of our proposal as being used as a general-purpose calibration tool. Note that, if applied on a DNN that presents near accuracy on the training dataset (which is the case in many of the standard databases tested) then the ARC loss will provide the same learning signal as the CE, because it will for the average confidences to be . This means that it will not work in datasets where the training error is overfitted, as in CIFAR100. To solve this, we experiment with the following variant. We take a validation split from the training dataset where the DNN presents uncalibrated over-confidences. Let say that this validation set presents an 80% accuracy, with a average confidence. Thus, we use the validation set to compute the ARC loss while the training dataset is only used for the CE.
We perform several experiments that illustrate the main claims of this work. We show average results in the main work and provide specific results in Github333Github link
, alongside code and details on loss hyperparameters. We evaluate a collection of classical benchmarks for this task: CIFAR100, CIFAR10, SVHN; and we also evaluate our model on more realistic problems such as the ones provided by Caltech Birds and Standford Cars, which contain bigger and more realistic images. Due to computational restrictions, we did not evaluate our model on ImageNet. We experiment with state-of-the-art configurations of computer visionDNN: Residual Networks, Wide Residual Networks and Densely Connected Neural Networks. Moreover, for each variant, we evaluate several configurations and models with and without dropout. We use the pre-trained models on ImageNet for Birds and Cars. We evaluate different calibration metrics, detailed in appendix 0.B. In the main work, we report the accuracy and ECE (with a partition of 15 bins) while the rest of the calibration metrics are reported in appendix 0.D.1. We compare to a recent technique designed for implicitly calibrate a probabilistic DNN named MMCE over their best performing approach . More details provided in appendix 0.D.
For the sake of illustration, we provide average results of all the models in table 1, and for the best-performing model per task in table 2. First, as shown in rows B and B+M in the tables, we see how Mixup degrades the calibration except in CIFAR100. By comparing with the results reported in , we can conclude that Mixup behaves particularly well in CIFAR100, probably because the intersection between classes can be explained through a linear relation. However, our tables demonstrate that this is not a general behaviour of Mixup as shown in the rest of datasets. It is surprising how Mixup degrades calibration in Birds and Cars, even though the DNN used for these datasets are pre-trained models which have been shown to provide better calibrated distributions . In general, our results contrast with those reported in  where they provide general improvement in calibration performance due to Mixup. We can explain this difference with the fact that different models are used. For instance, while they use a VGG-16 and a ResNet-34, we are using much deeper models, such as a ResNet-101 or a DenseNet-121. The difference can be connected to the observation in  where they show that calibration is further degraded by deeper architectures. Moreover, we shall emphasize that our results on CIFAR10 are on the state-of-the-art ( ACC) and much better calibrated ( top ECE and average ECE) than in , while they report a value of ECE.
|baseline + Mixup (B+M)||96||4.23||80.04||4.44||96.45||5.55||79.63||14.22||86.67||18.13|
|MMCE + Mixup (M+M)||94.95||3.74||78.27||4.83||96.59||2.83||79.99||12.37||86.03||13.07|
|ARC + Mixup (A+M)||95.90||1.62||80.20||2.46||96.02||2.17||79.74||4.95||89.63||2.84|
|baseline + Mixup (B+M)||97.19||4.65||82.34||1.42||96.97||4.91||82.09||10.14||89.45||18.10|
|MMCE + Mixup (M+M)||97.02||1.11||81.31||4.46||97.17||3.69||82.41||10.93||88.47||11.56|
|ARC + Mixup (A+M)||97.09||1.03||82.02||0.98||96.82||2.20||82.45||1.28||91.13||2.40|
Analyzing our loss function, we see how it can correct the miscalibration introduced by Mixup training. In CIFAR10 and CIFAR100 A+M is the best performing approach. In SVHN we see that A+M corrects the calibration error introduced in B+M, but the approach behaves similar to the others. SVHN is a dataset that presents good calibration in many models over the test set, as noted also in [8, 22]. Finally, regarding Birds and Cars we see how our loss can highly correct the miscalibration introduced by Mixup. This means that our approach also performs well with pre-trained models on ImageNet. It should be noted that in this case, we do not achieve the same ECE error in Birds and Cars as with the baseline model. However, we have much better accuracy (over on average results in Cars). In fact, our work reports nearly state of the art accuracy in Cars using a Dense-Net, where the best performing reported model has an accuracy only two points above but using much more complex architectures such as efficient net  or inception. On the other hand, our method is better than the recently proposed MMCE . We found this method to be unstable in some cases, as some models saturated during training or tended to degrade the accuracy, as shown in the tables.
Regarding the parameterization of the loss function, we found that most of the times the best configuration of hyperparameters was with of our loss function. This can be explained by the fact that DNN typically learn invariant representations and thus, we avoid the pathological behaviour that can present, which is discussed in appendix 0.C. Besides, we found that only in Birds, the ARC loss computed over the Mixup image worked better than when computed over and , even though this configuration also improved the calibration. Thus, as we claim in section 4, it seems reasonable that a loss function that takes into account, separately, the underlying structure present in the data can provide better calibrated uncertainties.
Finally, by looking at the results of applying ARC loss over the Baseline model (A in the tables) we see that the improvements in calibration are not significant, or at least not as when combined with Mixup. We have already argued the reason in section 4. We mentioned that a possible solution could be to apply the ARC loss on a separate validation set. Surprisingly, the DNN learns to minimize the ARC loss by increasing the accuracy of this validation set rather than by relaxing the confidences assigned.
This work has shown that Mixup does not ensure calibrated class distributions. The results and theory presented suggest that a similar analysis should be employed over different DA techniques, which is let for future work. We have also opened a new perspective to reduce overconfidence in DNN. As we cannot control how a model might overfit the dataset to achieve high discriminative performance, a good practice is to auto-regularize the model to incorporate the uncertainty of the learned representations. This work has shown a way of doing this on Mixup training, reporting state-of-the-art results in accuracy and calibration. Future work is concerned with the exploration of new loss functions for this purpose.
We gratefully acknowledge the support of NVIDIA-CORPORATION through the donation of two NVIDIA TITAN XP. The research leading to these results has received funding from the European Union through Programa Operativo del Fondo Europeo de Desarrollo Regional (FEDER) from Comunitat Valencia (2014-2020) under project Sistemas de frabricación inteligentes para la industria 4.0 (grant agreement IDIFEDER/2018/025). Juan Maroñas is supported by grant FPI-UPV. Daniel Ramos is supported by the Spanish National Ministry of Education through grant RTI2018-098091-B-I00.
Gal, Y., et al.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: ICML. pp. 1050–1059. ICML’16, JMLR.org (2016)
Gulcehre, C., et al.: On integrating a language model into neural machine translation. Comput. Speech Lang.45(C), 137–148 (Sep 2017)
Tan, M., et al.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) ICML. pp. 6105–6114 (2019)
Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)
In this appendix, we provide an extended explanation to the one in section 4, regarding the reasons why Mixup should not necessarily provide calibrated distributions. Mixup has its fundamentals in vicinal risk minimization (VRM) , which is derived as a solution to the limitations present in ERM [31, 37, 32]. The ideal learning signal in a frequentist paradigm should be provided by the gradient of the expected value of a loss function over the underlying probability density ,
which in practice is impossible as we only have access to an i.i.d sample . As a consequence, we attempt to minimize the expected value of the loss function over the empirical distribution , a process known as empirical risk minimization (ERM). This distribution is given by Dirac delta distribution centred at the observed points:
Thus, ERM clearly lacks of support in many different parts of the input space, which makes this learning paradigm present some limitations such as over/under-fitting, memorization , or sensitivity to adversarial examples . VRM is proposed to solve this lack of support in the input manifold. To achieve this goal, the Dirac Delta distribution is substituted with a vicinity distribution, which aims at exploring different parts of the input space in the vicinity of the observed set . For instance, a vicinity distribution can be implemented as a Gaussian centred at each sample
. In practice we then sample from this Gaussian distribution and recover an unbiased estimate ofcomputed with this new set of generated samples, which is used in conjunction with stochastic gradient guided learning algorithms. Thus, any DA technique, such as Gaussian noise addition, can be understood under the VRM paradigm.
The main motivation behind Mixup is that DA techniques assume that the samples in the vicinity distribution belong to the same class. For that reason, Mixup vicinity distribution is defined as the expected value of a linear interpolation between two input samples and their corresponding labels . This interpolation is parameterized by a coefficient which is drawn from a beta distribution. An unbiased estimate of can be obtained by evaluating the average loss function on a set of samples drawn from this distribution as follows:
As a consequence, training with Mixup smooths the predictions performed by a model in the intersection between samples from the unknown distribution . However, even if this might reduce high-oscillations in the predictions performed in these regions of the feature space, or smooth the ultimate confidence assigned to these regions, this only ensures that the model will be less overconfident, which does not necessarily mean that the ultimate probability distribution will be calibrated. This is because Mixup only ensures a linear-soft transition between the confidence assigned by the model in different parts of the input space. As a consequence, only if the data distribution presents a linear relation between their corresponding classes, one could expect the ultimate distribution to be calibrated. It is clear that Mixup interpolation does not consider the proportion of samples present in the input distribution, which is at the core for a proper calibration. In the experimental section we show that some models trained with Mixup do not necessarily improve the calibration, as recently noted . In fact, our results show that Mixup can highly degrade calibration in many cases.
Calibration can be measured in different ways, each one with their own properties. While some metrics are directly PSR, such as the Brier score (BS) or the logarithmic score (NNL), averaged over empirical samples; some others are merely measures of calibration, such as the expected calibration error (ECE)  and the maximum calibration error (MCE). Given a set of samples, each of these metrics can be computed in the following way:
where the [0,1] confidence range is equally divided into bins . In each of these bins, the accuracy (acc) and the average confidence (conf) of the samples that lie in that particular bin are computed.
Note, for instance, that the NNL score highly penalizes important errors (i.e. extreme and wrong probabilities), but it is not able of separating which part of those errors are due to discrimination and which to calibration. This means that the NNL will always be penalized under non-separable data manifolds even though we face the ideal situation in which the model has recovered the data generation distribution (thus, when it presents perfect calibration and has recovered data discrimination). In this situation, a perfect model will present because ECE is just a measure of calibration.
On the other hand, while ECE is not sensible to high extreme errors made in only one sample (assigning 1.0 confidence towards an incorrect class), NNL penalizes this error by assigning an infinity score. In this sense, NNL is much more sensitive to strong overconfidence error than the rest of the calibration performance metrics and can be more useful in applications where overconfidence errors must be avoided in a very restrictive way.
In this appendix, we provide additional analysis of the proposed loss function. First, note that while the CE loss aims at pushing the probability of a given sample towards confidence of belonging to its associated class , our loss encourages the model to auto-adjust its confidence depending on the accuracy of each batch of data being forwarded through the model. It is clear that this loss function and the CE play different, and opposite, roles regarding the probabilistic information that the model should provide. For this reason, we might think that the combination of both losses could lead to a suboptimal result, as each of the losses pushes in opposite directions. In order words, both losses play a give-and-take game. However, note that learning signals provided by the losses are somehow complementary. At the beginning of the learning processes, when the network is initialized at random, the network typically parameterizes a quasi-constant output distribution, and the accuracy provided by the model is near to that of a prior classifier. Thus, the learning signal provided by the ARC loss is negligible as compared to the one provided by the CE. On the other hand, when the optimization of the CE stalls, then the ARC loss plays its role by adjusting the ultimate confidences if they are uncalibrated. This trade-off between ARC and CE can be seen as a type of regularizer of the CE by ARC, preventing CE to reach discrimination without taking care of calibration. Our loss will not let the CE push the probability towards extreme values.
Moreover, it should be noted that this cost presents other desirable properties that aim at improving regularization. First, consider a set of samples lying in the confidence range . If the accuracy of these samples is located in this range then our loss function will encourage the model to adjust them to be as close as possible to the accuracy. Second, if the accuracy provided by the model has a value over this range, e.g , then the model will raise these confidences to recover a calibrated model. It should be noted that in this case, our loss function will not change the accuracy as we are just pushing upwards the confidence of the samples which are originally correctly/incorrectly assigned, and thus the decision of which class should be assigned to each sample remains intact. Third, consider the same set of samples but with a provided accuracy of . Our loss function will encourage these set of samples to reduce its confidences. It is clear that reducing this confidence has to be done at the cost of raising the confidence towards other classes. By doing this, we have a chance of changing the decision made by the model towards another class, thus helping to improve the discrimination of the model and consequently raising the accuracy.
On the other hand, the idea of experimenting with the two variants of our loss named and is based on the following observation. The only difference between the two variants is whether we force the average confidence of a set of samples to match the accuracy, as performed by , or we force each individual sample to match the accuracy, as done by .
is proposed to avoid solutions in which the set of confidences assigned by the model presents high variance. This will avoid solutions in which, for instance, the network present aaccuracy on a set of samples, and the model assigns confidence to half of the samples and to the other half. In such a setting, the loss being minimized will be , but the ultimate goal will not be achieved. The possibility of computing our loss over separate bins is incorporated to reduce this effect. However, in practice, we expect both losses to work, as the ideal behaviour of a good representation as learned by a model should be to map all the samples of a given class to the same (ideally linearly separable) representation. If this happens, the aforementioned variance on the confidence assigned by the model is reduced.
We choose datasets to evaluate our approach. We rely on classical benchmarks such as (number of classes into the brackets) CIFAR100 (100), CIFAR10 (10), SVHN (10), and we also evaluate our model on more realistic problems such as the ones provided by Caltech-Birds (200), Standford-Cars (196)
. These datasets are made up of bigger and more realistic images, and a padding preprocessing must be done. Due to computational restrictions, we did not evaluate our model on ImageNet.
We evaluate our model on several state-of-the-art configurations of computer vision neural networks, over the mentioned datasets: Residual Networks , Wide Residual Networks  and Densely Connected Neural Networks . Moreover, for each variant, we evaluate a model with and without Dropout. We find this interesting because a dropout model can be used to quantify uncertainties [6, 15]. For the ResNet we add a Dropout layer after the whole network. We set the Dropout values according to the ones provided in the original works, or the model implementations, except for the ResNet where we use a
Dropout rate. We use the pre-trained models on ImageNet provided by the PyTorch API for Birds and Cars datasets. On these pre-trained models, we add a Dropout layer at the end. Models are optimized with stochastic gradient descent with momentum and by placing a Gaussian prior over the parameters. The precision of this Gaussian prior is set accordingly to the provided implementations. For all the databases except Birds and Cars we use a learning rate starting from. For Birds and Cars the initial learning rate is set to . We use step learning rate scheduler that varies depending on the model. Additional details can be found in the code.
Regarding Mixup hyperparameters we used the ones provided in the original work. On the datasets where these techniques were not evaluated, we searched for the optimal value on a validation set. This hyperparameter is then fixed for the rest of the experiments carried out. More details on Github.
Our loss hyperparameters: , the number of bins and the type of cost used (-) were searched using a validation set with the ResNet-18 for all the models with and without Dropout. This is because we wanted to extract conclusions on a possible good configuration of our loss function and to do that we need to do a big battery of experiments (we trained more than 1000 Neural Networks to evaluate the loss); and this big experimental search came at the cost of computational restrictions.
As explained in the experimental section, this allows us to conclude that and bins are a reliable choice of the hyperparameters. Our search include all the possible combinations of: loss and ; number of bins: , and (for this one the loss is computed three times, one per each value of , and the three losses are then averaged); and evaluation of the ARC loss over the Mixup image or the separate images and . This experiment was essential to validate our claim regarding data uncertainty and calibration, as exposed in section 4. We run experiments over all these combinations, searching for the optimal value. We select the value of that provides good accuracy with low calibration error. In some cases we found this hyperparameter to be times greater than the CE loss, see details on Github. This enhances the beneficial influence that our loss function can have in several problems.
Note that this way of searching for hyperparameters is not optimal. In general, the extrapolated hyperparameter performed well in the rest of the models as detailed in the models on GitHub. However, sometimes, we experimented accuracy degradation in the training set. This is because a pathological solution of optimizing the ARC
loss is by setting the parameters to output the data prior probability. This solution evaluates theARC loss to , but at the cost of parameterizing a useless prior classifier. As an example consider, for instance, that on the ResNet-18 we found that the optimal hyperparameter was , but when training a DenseNet-121 this hyperparameter degraded the accuracy over the training set at the cost of providing perfect calibration. When this effect was observed we just picked the next hyperparameter that provided the next top performance over the ResNet-18; until the training accuracy was not degraded.
On the other hand, in CIFAR100 with Mixup we found this way of searching for the hyperparameter not to be as effective. As provided in the specific results for each model trained on CIFAR100 in the tables provided in Github, we can see that all the models except ResNet-18 improve calibration when using Mixup. Thus, we cannot expect the hyperparameter to extrapolate as with other datasets. This was observed by training any of the deeper models with a validation set. To solve this, we simply perform a hyperparameter search over one of the deeper models in which Mixup showed great calibration performance, and use this parameter with the rest of the models. Due to computational limitations, we did not perform such an exhaustive search as we did with the ResNet-18, and just select a subset of the hyperparameters based on the wider analysis performed over the ResNet-18.
We finally include additional results in our experiments. Table 3 show average results and table 4 show best performing model result for other calibration metrics. We can see that the results extrapolate from those in the experimental section, showing the improvement achieved by A+M.
Finally, table 5 shows the results of applying ARC loss only to a validation set that is uncalibrated. Surprisingly, the DNN can increase the accuracy of this validation set without using the CE, instead of relaxing the confidences. This shows the great ability of DNN
to overfit, and manifest the unpredictable behaviour of these models when used in probabilistic machine learning. This motivates the search of new losses that can encourage these powerful models to better represent the underlying distribution, and thus move them towards a better generalization, mandatory for critical applications.