Neural Networks Out-of-Distribution Detection: Hyperparameter-Free Isotropic Maximization Loss, The Principle of Maximum Entropy, Cold Training, and Branched Inferences

06/07/2020 ∙ by David Macêdo, et al. ∙ UFPE 0

Current out-of-distribution detection (ODD) approaches present severe drawbacks that make impracticable their large scale adoption in real-world applications. In this paper, we propose a novel loss called Hyperparameter-Free IsoMax that overcomes these limitations. We modified the original IsoMax loss to improve ODD performance while maintaining benefits such as high classification accuracy, fast and energy-efficient inference, and scalability. The global hyperparameter is replaced by learnable parameters to increase performance. Additionally, a theoretical motivation to explain the high ODD performance of the proposed loss is presented. Finally, to keep high classification performance, slightly different inference mathematical expressions for classification and ODD are developed. No access to out-of-distribution samples is required, as there is no hyperparameter to tune. Our solution works as a straightforward SoftMax loss drop-in replacement that can be incorporated without relying on adversarial training or validation, model structure chances, ensembles methods, or generative approaches. The experiments showed that our approach is competitive against state-of-the-art solutions while avoiding their additional requirements and undesired side effects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have been used in classification tasks in many real-world applications (devries2018leveraging). In such cases, the system usually needs to be able to identify whether a given input belongs to any of the classes on which it was trained. hendrycks2017baseline called this capability out-of-distribution detection (ODD) and proposed datasets and metrics to allow standardized performance evaluation and comparison. Major ODD approaches such as Mahalanobis (lee2018simple) achieve high ODD performance in datasets like CIFAR10, CIFAR100, and SVHN. However, current ODD solutions present severe limitations from a practical use perspective, which present additional constraints and concerns (macdo2019isotropic).

Firstly, ODD solutions commonly present hyperparameters that usually require access to out-of-distribution (OOD) samples to be defined (liang2018enhancing; Liang2017PrincipledDO; lee2018simple; lee2017training; DeVries2018LearningNetworks)

. A direct consequence of using OOD samples to validate hyperparameters and the same distribution to evaluate ODD performance is the production of overestimated performance estimations

(shafaei2018biased). To avoid unrealistic access to OOD samples and overestimated performance, (lee2018simple) proposed to validate hyperparameters using adversarial samples. However, it implicates in the introduction of the cumbersome procedure of generating adversarial examples. Moreover, such a procedure also requires the determination of hyperparameters typically unknown when dealing with novel datasets. Similar arguments hold for adversarial training based solutions (Hein2018WhyRN; lakshminarayanan2017simple; li2018anomaly; kliger2018novelty; lee2017training)

, which also result in higher training time. Approaches based on the generation of adversarial examples or the use of adversarial training may also have limited scalability when dealing with large size images such as those presented in the ImageNet

(Deng2009ImageNet:Database). Hsu2020GeneralizedOD proposed to use the in-distribution validation set to avoid the need for accessing OOD samples to determine the hyperparameters presented by the solution. However, considering that CIFAR10 and CIFAR100 do not have separated sets for validation and test, the results may also be overestimated because the validation sets used to define the hyperparameters were reused during the test for ODD performance estimation. A more realistic ODD performance estimation could be achieved by removing the in-distribution validation set from in-distribution training data.

Additionally, many solutions make use of the so-called input preprocessing technique introduced in ODIN (liang2018enhancing)

. However, the use of the mentioned technique increases at least three times the inference time and power consumption since a combination of a first inference followed by backpropagation, and a second inference is required

(liang2018enhancing; lee2018simple; Hsu2020GeneralizedOD; DeVries2018LearningNetworks) for a single useful inference. From a practical point of view, this is indeed a severe drawback as inferences are performed thousands or millions of times in the field. Therefore, such approaches are prohibitive (not sustainable) from an environmental (Schwartz2019GreenA) and a real-world cost-based perspective. Furthermore, lee2018simple introduced a technique called feature ensemble that requires access to the features of many neural network layers at inference time. It may present scalability problems when applied to real size images as the number and size of layers increase dramatically (macdo2019isotropic).

In some cases, an ensemble of classifiers themselves is used

(vyas2018out). In Deep Ensembles, lakshminarayanan2017simple proposed an ensemble of same-architecture models trained with different random initial weights. However, despite all the complexity and additional power processing required, this approach produces ODD performance lower than ODIN (shafaei2018biased). Some proposals require undesired model structural changes to tackle ODD (yu2019unsupervised). Another harmful common side effect is the so-called classification accuracy drop (techapanurak2019hyperparameterfree; Hsu2020GeneralizedOD). In such cases, higher ODD performance is achieved at the expense of a drop in the classification accuracy. From a practical perspective, this is extremely undesired because the detection of out-of-distribution samples may be a rare event while the classification is the main aim of the designed system (carlini2019evaluating).

There are some trials to use uncertainty or confidence estimation/calibration techniques (kendall2017uncertainties; Leibig2017LeveragingUI; malinin2018predictive; kuleshov2018accurate; subramanya2017confidence). However, Bayesian Neural Networks used in most of them are usually harder to implement and require much more computational processing to train. Moreover, computational constraints usually implicate in approximations that compromise the performance, which is also affected by the prior distribution used (lakshminarayanan2017simple). For example, MC-Dropout uses pretrained models with dropout activated during the test time. An average of many inferences is used to perform a single decision (gal2016dropout). Despite all the complexity and additional power processing required, MC-Dropout produces ODD performance lower than ODIN (shafaei2018biased).

The combination of IsoMax loss and Entropic Scores (ES) avoids all previously mentioned requirements, drawbacks, and side effects besides avoiding classification accuracy drop (macdo2019isotropic). Indeed, the mentioned solution entirely avoids any source of possible ODD performance overestimation, as it does not require access to OOD or adversarial samples neither uses the in-distribution validation set (the global hyperparameter is not validated for each novel dataset used). Moreover, since input preprocessing is not used, the inferences do not require higher energy consumption, nor are three times slower. Finally, IsoMax loss trained networks are as scalable as regular SoftMax loss trained ones (macdo2019isotropic) because it does not require adversarial training or examples. However, the global hyperparameter may not be optimal for all datasets. Additionally, as is the same for all classes, harder classes may present higher features-prototypes distances and consequently higher entropies. This imbalance may implicate in lower overall ODD performance.

Contributions.

Our contribution in this paper is threefold: First, we replace the IsoMax loss global hyperparameter by learnable class-dependent scales and tilts. This procedure simultaneously solves the IsoMax loss previously mentioned imbalance and the fact that the global hyperparameter

may not be optimal for all datasets. Besides, since we do not know a priori an optimal mathematical expression to produce logits (log probabilities) from

features-prototypes distances, it indeed makes sense to allow logits to be expressed as a learnable function of the features-prototypes distances rather than a predefined one. Second, we present a theoretical motivation to explain our proposal’s high ODD performance based on the Principle of Maximum Entropy. Third, to simultaneously maximize the classification accuracy and ODD performance, we formulate slightly modified expressions for probability for classification and ODD entropy calculation. We call our approach Hyperparameter-Free Isotropic Maximization Loss or IsoMax Loss Version 2 (IsoMax2).

Our experiments showed that IsoMax2 loss overcomes IsoMax loss performance besides keeping all its advantages. Indeed, IsoMax2 loss is accurate (no classification accuracy drop), fast, energy-efficient, sustainable (inferences do not use input preprocessing, which makes them at least three times slower and three times less power efficient), scalable (no feature ensemble, no model ensemble, no adversarial training), turnkey (no hyperparameters to be validated), and seamless (no architectural changes required, no undesired side effects). Additionally, considering IsoMax2 loss works as a SoftMax loss drop-in replacement, incorporating it into current projects is straightforward. Furthermore, our solution overcomes ODIN (liang2018enhancing), ACET (Hein2018WhyRN), original IsoMax loss (macdo2019isotropic) besides being competitive with Mahalanobis (lee2018simple).

(a)
(b)
Figure 1: (a) SoftMax loss simultaneously minimizes both the cross-entropy and the average entropy. (b) In agreement with the Principle of Maximum Entropy, IsoMax2 loss minimizes the cross-entropy while keeping high mean entropy, which is calculated using ODD inference probabilities.

2 Hyperparameter-Free Isotropic Maximization Loss

Learnable Class-Dependent Scales and Tilts.

Consider an input applied to a neural network which performs a parametrized transformation . Moreover, also consider be the learnable prototype associated with the class . Additionally, let the expression represent the non-squared Euclidean distance between and . Finally, consider the learnable prototype associated with the correct class for the input . Therefore, we can write the IsoMax loss (macdo2019isotropic) for a batch of examples by the equation bellow:

(1)

In the equation (1), is global hyperparameters set to ten. From the mentioned equation, we can observe that the logits in regular IsoMax loss are expressed by . A fixed may not be adequate for all classes. Additionally, we must recognize that we do not know a priori an optimal predetermined expression to obtain logits from features-prototypes distances. To solve this, we introduce learnable parameters in the logits formation. For each class, we introduce scales (initialized to ten) and tilts (initialized to zero) that are learned by the usual training procedure. Therefore, the IsoMax2 loss learnable class-dependent logits scales and tilts may be expressed by . By returning this into (1), we obtain the IsoMax2 loss:

(2)
(a)
(b)
Figure 2: Robustness analyses: (a) SoftMax and IsoMax2 losses classification accuracy for a different number of training examples per class on a set of datasets. (b) Mean AUROC considering all possible out-distributions for SoftMax and IsoMax2 losses for distinct numbers of training examples per class. The Entropic Score was used.

Theoretical Motivation: The Principle of Maximum Entropy and Cold Training.

The Principle of Maximum Entropy, which was formulated by E. T. Jaynes to essentially unify the statistical mechanics and information theory entropy concepts (PhysRev.106.620; PhysRev.108.171)

, states that when estimating probability distributions, we should choose the one which produces the maximum entropy consistent with the given constraints

(10.5555/1146355). Following this principle, we avoid introducing additional assumptions or bias111https://mtlsites.mit.edu/Courses/6.050/2003/notes/chapter10.pdf. In other words, from a set of trial probability distributions that satisfactorily describe the prior knowledge available, the one which presents the maximal information entropy (the least informative option) represents the best choice.

(3)

Nevertheless, the currently used neural networks loss minimization process does not prioritize posterior probability distributions with high average entropy. Actually, exactly the opposite is true. Indeed, the maximum likelihood estimation process, equivalent to the minimization of cross-entropy loss in neural networks, tends to generate posterior probability distributions with extremely low mean entropy. We emphasize that both SoftMax and IsoMax

2 are cross-entropy based losses, since all minimize the cross-entropy during training. However, while SoftMax loss also minimizes the mean entropy of the posterior probability distribution, the IsoMax2 loss produces posterior probability distributions with high average entropy. The equation (3) explains the behavior of the cross-entropy and entropy for the SoftMax loss. When minimizing the first term of the mentioned equation (the first part before implying signal), extremely high probabilities are generated (the part between implies signals). Consequently, very low entropy posterior probability distributions are produced (the last part after the second imply signal). The cross-entropy loss minimization tends to generate unrealistic overconfident probabilities. Therefore, we have a contradiction between cross-entropy loss minimization and the principle of maximum entropy, as the first usually produces extremely low entropy posterior distributions.

(4)

The IsoMax2 loss reconciliates this two apparently contradictory objectives by combining what we call cold training and ODD inference. The cold training is performed by introducing the high multiplicative term , which is equivalent to use a low temperate (), during training. For calculating the entropy to perform ODD (macdo2019isotropic), we remove the mentioned high multiplicative term to construct what we call the ODD inference. In other words, the ODD inference is performed in normal or regular temperature (). The equation (4) demonstrates the reconciliation between cross-entropy loss minimization and the maximum entropy principle in the IsoMax2 loss. Indeed, the high multiplicative term used during cold training (the first part before implying signal) allows the cross-entropy loss goes to zero, avoiding the ODD inference probabilities, which are performed at normal or regular temperature, to become extremely high. Hence, it is possible to build posterior probability distributions with high entropies. Therefore, cold training and ODD inference are essential to simultaneously minimize cross-entropy loss while allowing the construction of high entropy posterior probability distributions at inference time. The Fig. 1 shows that the SoftMax loss simultaneously minimizes both the cross-entropy and the entropy of the posterior distribution. Contrary to the SoftMax loss, IsoMax2 loss is capable of minimizing the cross-entropy keeping high average posterior probability entropy as recommend by the Principle of Maximum Entropy.

Model In-Data (training) Out-Data (unseen) Out-of-Distribution Detection:
Accurate, Fast, Energy-Efficient, Scalable, Turnkey, and Seamless
TNR@TPR95 (%) AUROC (%) DTACC (%)
SoftMax/MPS / SoftMax/ES / IsoMax/ES / IsoMax2/ES
DenseNet CIFAR10 SVHN 43.9 / 47.4 / 66.0 / 87.2 91.8 / 92.2 / 94.9 / 97.8 85.9 / 85.9 / 88.4 / 93.5
ImageNet 45.1 / 48.8 / 77.7 / 84.3 90.1 / 91.0 / 96.4 / 97.1 83.7 / 83.8 / 90.4 / 91.8
LSUN 56.5 / 61.6 / 87.6 / 87.0 93.8 / 94.5 / 97.7 / 97.5 88.1 / 88.2 / 93.0 / 92.2
CIFAR100 SVHN 16.9 / 18.8 / 23.9 / 59.2 77.0 / 78.1 / 88.7 / 93.7 71.1 / 71.4 / 83.7 / 86.9
ImageNet 20.3 / 25.3 / 47.9 / 52.8 77.6 / 79.4 / 90.7 / 91.2 71.2 / 71.7 / 83.1 / 83.4
LSUN 23.2 / 28.8 / 54.3 / 67.5 78.9 / 80.9 / 92.7 / 93.9 72.0 / 72.7 / 85.9 / 86.7
SVHN CIFAR10 81.8 / 84.8 / 93.8 / 92.5 96.5 / 97.0 / 98.5 / 98.2 91.9 / 92.0 / 94.8 / 94.2
ImageNet 87.8 / 90.4 / 96.6 / 95.6 97.7 / 98.3 / 99.1 / 98.9 93.3 / 93.6 / 95.9 / 95.4
LSUN 84.5 / 87.3 / 94.4 / 95.5 97.1 / 97.6 / 98.6 / 98.8 92.5 / 92.8 / 95.0 / 95.4
ResNet CIFAR10 SVHN 52.7 / 54.8 / 82.2 / 79.5 92.9 / 93.3 / 97.0 / 96.7 87.2 / 87.2 / 91.6 / 91.5
ImageNet 46.6 / 48.3 / 62.2 / 64.7 90.8 / 91.1 / 91.7 / 93.1 85.2 / 85.2 / 83.6 / 86.5
LSUN 56.3 / 58.4 / 75.1 / 73.0 93.5 / 93.9 / 95.0 / 95.2 88.0 / 88.1 / 87.6 / 89.4
CIFAR100 SVHN 11.9 / 11.6 / 26.2 / 61.0 69.8 / 70.7 / 85.5 / 93.7 64.7 / 64.9 / 78.6 / 87.5
ImageNet 18.3 / 21.7 / 48.8 / 43.5 75.5 / 77.1 / 91.0 / 88.2 69.1 / 69.5 / 84.0 / 80.5
LSUN 20.0 / 23.6 / 45.5 / 40.1 77.8 / 79.6 / 91.1 / 88.9 71.4 / 71.8 / 85.4 / 81.8
SVHN CIFAR10 66.7 / 67.2 / 48.1 / 81.2 92.0 / 92.0 / 90.7 / 96.4 87.4 / 87.4 / 84.0 / 91.2
ImageNet 69.2 / 69.8 / 35.4 / 83.9 93.1 / 93.2 / 86.3 / 96.8 88.3 / 88.3 / 79.4 / 91.7
LSUN 67.2 / 67.7 / 32.1 / 81.4 92.3 / 92.3 / 84.5 / 96.2 87.6 / 87.6 / 77.7 / 90.9
Table 1: Fair comparison: Accurate, fast, energy-efficient, scalable, turnkey, and seamless out-of-distribution detection approaches. The best results are in bold (1% tolerance). To the best of our knowledge, IsoMax2/ES presents state-or-the-art performance under these assumptions (accurate, fast, energy-efficient, scalable, turnkey, seamless, and no side effects out-of-distribution detection).

Branched Inferences: Classification and Out-of-Distribution Detection Probabilities.

Based on the above theoretical motivation, we propose slightly different expressions for the probabilities used for classification and out-of-distribution detection. We call this branched inferences. For classification inference, we define the probabilities based on the loss actually used to train the neural network (Equation 2). Therefore, to maximize the classification accuracy, the probabilities need to be calculated in cold temperature by keeping scales and tilts:

(5)

However, the probabilities to be used in the Entropic Score (ES) (macdo2019isotropic) to perform ODD should use normal or regular temperature to maximize the entropy of the posterior probability distribution. Therefore, in ODD inference, the expression for the probabilities is given by:

(6)
Model In-Data (training) Out-Data (unseen) ODIN/ACET/Mahalanobis present special requirements.
ODIN/ACET/Mahalanobis produce undesired side effects.
AUROC (%) DTACC (%)
ODIN / ACET / IsoMax2/ES / Mahalanobis
DenseNet CIFAR10 SVHN 92.8 / NA / 97.8 / 97.6 86.5 / NA / 93.5 / 92.6
ImageNet 97.2 / NA / 97.1 / 98.8 92.1 / NA / 91.8 / 95.0
LSUN 98.5 / NA / 97.5 / 99.2 94.3 / NA / 92.2 / 96.2
CIFAR100 SVHN 88.2 / NA / 93.7 / 91.8 80.7 / NA / 86.9 / 84.6
ImageNet 85.3 / NA / 91.2 / 97.0 77.2 / NA / 83.4 / 91.8
LSUN 85.7 / NA / 93.9 / 97.9 77.3 / NA / 86.7 / 93.8
SVHN CIFAR10 91.9 / NA / 98.2 / 98.8 86.6 / NA / 94.2 / 96.3
ImageNet 94.8 / NA / 98.9 / 99.8 90.2 / NA / 95.4 / 98.9
LSUN 94.1 / NA / 98.8 / 99.9 89.1 / NA / 95.4 / 99.2
ResNet CIFAR10 SVHN 86.5 / 98.1 / 96.7 / 95.5 77.8 / NA / 91.5 / 89.1
ImageNet 93.9 / 85.9 / 93.1 / 99.0 86.0 / NA / 86.5 / 95.4
LSUN 93.7 / 85.8 / 95.2 / 99.5 85.8 / NA / 89.4 / 97.2
CIFAR100 SVHN 72.0 / 91.2 / 93.7 / 84.4 67.7 / NA / 87.5 / 76.5
ImageNet 83.6 / 75.2 / 88.2 / 87.9 75.9 / NA / 80.5 / 84.6
LSUN 81.9 / 69.8 / 88.9 / 82.3 74.6 / NA / 81.8 / 79.7
SVHN CIFAR10 92.1 / 97.3 / 96.4 / 97.6 89.4 / NA / 91.2 / 94.6
ImageNet 92.9 / 97.7 / 96.8 / 99.3 90.1 / NA / 91.7 / 98.8
LSUN 90.7 / 99.7 / 96.2 / 99.9 88.2 / NA / 90.9 / 99.5
Table 2: Unfair comparison: distinct requirements out-of-distribution approaches. ODIN and Mahalanobis present 3X slower and 3X less power efficient inferences. IsoMax2/ES is the only compared solution that is turnkey (no hyperparameters to validate) and seamless (no model changes, no training modifications, no undesired side effects). The best results are in bold (1% tolerance).

3 Experimental Results

To allow standardized evaluation and comparison, we used the datasets, models, training procedures, and metrics that were established in hendrycks2017baseline and adopted in many subsequent ODD studies such as liang2018enhancing; lee2018simple; Hein2018WhyRN (Supplementary Material A). We compared our solution with ODIN (liang2018enhancing) since it demonstrates better performance (shafaei2018biased) than some ODD proposals (gal2016dropout; lakshminarayanan2017simple; salimans2017pixelcnn++; Bendale2016TowardsOS) and similar to others (masana2018metric; shalev2018out). We also compared to the more recent methods Mahalanobis (lee2018simple) and ACET (Hein2018WhyRN). None of the compared approaches produces classification accuracy drop. The deterministic code222Many runs of any experiment return the same values to all decimal places. to reproduce the results is available333https://github.com/dlmacedo/hyperparameter-free-isotropic-maximization-loss. We performed an ablation study to define whether use weight decay and an optimal learning rate for the scales and tilts (Supplementary Material B). We present the classification accuracy of the experiments to demonstrate that IsoMax2 loss produces no classification accuracy drop (Supplementary Material C).

Out-of-Distribution Detection Performance Assessment: Fair Scenario.

The Table 1 summarizes the results of the fair ODD comparison. In the mentioned table, all approaches are accurate (no classification accuracy drop), fast and power-efficient (inferences are performed without input preprocessing), scalable (no feature ensemble, no model ensemble, no adversarial training), and turnkey (no validation required to define hyperparameters). SoftMax/MPS means SoftMax loss training followed by ODD using Maximum Probability Score (MPS) (hendrycks2017baseline). IsoMax/ES (IsoMax2/ES) means IsoMax (IsoMax2) loss training followed by ODD using Entropic Score (ES) (macdo2019isotropic). The best results are in bold. The use of entropy as an ODD score slightly improves the SoftMax loss performance compared to using MPS. Nevertheless, the Entropic Score applied to the IsoMax/IsoMax2 loss provides better results. In most cases, IsoMax2 loss overcomes IsoMax loss performance and provides more stable results.

(a)
(b)
Figure 3: (a) In SoftMax loss, out-distribution logits mimic in-distribution inter-class. (b) In IsoMax2 loss, out-distribution logits mimic in-distribution intra-class.

Classification and ODD Performance: Robustness Study.

Fig. 2 shows that IsoMax2 loss presents no classification accuracy drop compared to SoftMax loss despite consistently and substantially providing higher ODD performance (mean AUROC for all out-distributions). These results are observed regardless of the density (number of examples per class) used for training the network.

Out-of-Distribution Detection Performance Assessment: Unfair Scenario.

The Table 2 summarizes the results of the unfair ODD comparison. The methods compared have different requirements and produce distinct side effects. ODIN (liang2018enhancing) and Mahalanobis (lee2018simple) are not turnkey approaches as they require adversarial samples to be generated to validate their hyperparameters for each dataset. Moreover, ODIN and Mahalanobis use input preprocessing, which make their inferences at least there times slower and also three times less energy-efficient. Additionally, Mahalanobis (lee2018simple) uses feature ensemble, which makes this approach not scalable to large size images. ACET (Hein2018WhyRN) uses adversarial training. Adversarial samples validation and adversarial training are cumbersome procedures to be performed from scratch on novel datasets as parameters such as optimal adversarial perturbations are unknown in such cases. IsoMax2/ES is hyperparameter-free and does not have those special requirements nor produce the mentioned undesired side effects. ACET overcomes ODIN in some cases, and in other situations, ODIN outperforms ACET. In most cases, IsoMax2/ES provides higher results than ACET and ODIN. In some situations, by a large margin. IsoMax2/ES is competitive with Mahalanobis. In some situations, besides all the advantages, IsoMax2/ES even outperforms Mahalanobis despite presenting much fewer requirements, undesired side effects, and being a seamless approach.

(a)
(b)
Figure 4: (a) SoftMax loss produces extremely low entropy outputs, sometimes even for out-distribution samples. (b) IsoMax2 loss produces high entropy outputs. Additionally, a better separation between in-distribution and out-distribution is observed.

Additional Analyses.

Fig. 3 shows that while out-of-distribution logits tend to mimic inter-class ones in SoftMax loss trained networks, out-of-distribution logits tend to mimic intra-class ones in IsoMax2 loss ones. This fact allows the Entropic Score to produce higher ODD performance when using IsoMax2 loss, as many more in-distribution logits are different from out-of-distribution ones. Fig. 4 shows that SoftMax loss trained networks produce overconfident predictions and that the entropy is a high-quality metric to distinguish out-of-distributions samples when using IsoMax2 loss. Maximum probability analyses are presented in Supplementary Material D. Training metrics are shown in the Supplementary Material E to demonstrate that those measures are extremely similar in both SoftMax and IsoMax2 losses. See Supplementary Material F for an extended version of Table 1.

4 Conclusion

In this paper, we enhanced the IsoMax loss by replacing its global hyperparameter by learnable scales and tilts. We also created two slightly different expressions for calculating the inference probabilities for classification and Entropic Score calculation (branched inferences). The experiments showed that these modifications produce higher and more stable ODD performance while keeping all IsoMax loss desired characteristics (no classification accuracy drop, fast and power-efficient inference, scalability, turnkey, and seamless). Additionally, we presented a theoretical motivation to explain the high ODD performance of IsoMax2 loss based on the Principle of Maximum Entropy. Finally, we explained how cold training may be used to produce high entropy posterior probability distributions.

Broader Impact

Our proposal may make the fundamental task of detecting out-of-distribution samples a practical reality, as it allows the incorporation of such functionality economically and environmentally sustainable. Many real-world applications in medicine, finance, agriculture, and busyness would benefit from adding ODD capabilities into their solutions. Indeed, the solution presented makes inferences of ODD enabled systems as energy-efficient as the ones of usual neural networks, which is not the case of the majority of current out-of-distribution approaches since they use input preprocessing. Additionally, unlike current methods, our solution is seamless in the sense that it can be trivially incorporated into future projects by only replacing a loss without classification accuracy drop or any other undesired side effect.

References

Appendix A Experiment Details

In this work, we trained many 100 layers DenseNets (Huang2017DenselyNetworks) and 34 layers ResNets (He_2016) on CIFAR10 (Krizhevsky2009LearningImages), CIFAR100 (Krizhevsky2009LearningImages), and SVHN (Netzer2011ReadingLearning) datasets with SoftMax, IsoMax and IsoMax2 losses using exactly the same protocol (initial learning rate, learning rate schedule, weight decay, etc.) presented in lee2018simple.

We used resized images from the datasets TinyImageNet (Deng2009ImageNet:Database)444https://github.com/facebookresearch/odin

, and the Large-scale Scene UNderstanding dataset (LSUN)

(Yu2015LSUN:Loop)4 following lee2018simple to create out-distributions samples. To evaluate the competing approaches ODD performances, we added these out-of-distribution images to the test set presented in CIFAR10, CIFAR100, and SVHN datasets to form the final test sets.

The ODD performance was evaluated using the following metrics. First, we calculated the True Negative Rate at 95% True Positive Rate (TNR@TPR95). Besides, we evaluated the Area Under the Receiver Operating Characteristic Curve (AUROC). The Detection Accuracy (DTACC) corresponds to the maximum classification probability over all possible thresholds

:

where is an ODD score. We assume that both positive and negative samples have equal probability of being in the test set, i.e., . All the mentioned metrics follow the same calculation procedures detailed in lee2018simple.

Appendix B Ablation Study

We performed an ablation study to decide whether we should use weight decay or a slower learning rate to scales and tilts for IsoMax2 loss. We trained DenseNets on CIFAR10 using resized fooling images (NguyenYC15)555http://anhnguyen.me/project/fooling/ as out-of-distribution. Based on the results presented in Table 3, we decide not to use weight decay for scales and tilts. Moreover, we also decided to use ten times slower learning rate for them.

Weight Decay
Learning Rate Metric
AUROC (%) DTACC (%)
Yes
Regular 93.30 86.10
10X Slower 93.34 86.37
No
Regular 92.73 86.47
10X Slower 93.63 86.87
Table 3: IsoMax2 loss out-of-distribution performance comparison for a set of scales and tilts training options. The best results are in bold.

Appendix C Classification Accuracy

The Table 4 presents the classification results for the experiments performed. It clearly chows that IsoMax2 loss provides no classification accuracy drop compared to SoftMax loss.

Model Data Train Accuracy (%) / Test Accuracy (%)
SoftMax Loss IsoMax Loss IsoMax2 Loss
SVHN 98.0 / 96.6 98.1 / 96.6 98.1 / 96.6
DenseNet CIFAR10 100.0 / 95.0 100.0 / 94.9 100.0 / 94.9
CIFAR100 99.8 / 75.9 99.9 / 75.8 99.9 / 76.4
SVHN 100.0 / 96.5 99.5 / 96.4 100.0 / 96.5
ResNet CIFAR10 100.0 / 94.2 100.0 / 94.3 100.0 / 94.4
CIFAR100 100.0 / 76.7 100.0 / 76.3 100.0 / 76.4
Table 4: Classification accuracy of networks trained using SoftMax, IsoMax, and IsoMax2 losses.

Appendix D Maximum Probability

The Fig. 5 shows that SoftMax loss trained networks usually make predictions with extremely high confidence while IsoMax2 loss trained ones are much more overconfident. It also shows that IsoMax2 loss never predicts out-of-distribution with high probability (for example, predictions with probabilities higher than 0.3 are almost certainly in-distribution samples), which is usually the case in SoftMax loss trained neural networks.

(a)
(b)
Figure 5: (a) SoftMax loss produces extremely high probabilities outputs, even for out-distribution samples. (b) IsoMax2 loss does not produce high probabilities outputs. Additionally, better separation between in-distribution and out-distribution is observed.

Appendix E Training Metrics

Figure 6: Classification accuracy during training for a set of models, datasets, and losses.
Figure 7: Loss values during training for a set of models, datasets, and losses.

Appendix F Extended Fair Comparison

The Table 5 is an extended version of the Table 1. The fooling out-distribution is formed of resized fooling images from NguyenYC15666http://anhnguyen.me/project/fooling/

. The Gaussian out-distribution is formed of Gaussian noise images, in which each every pixel is independently and identically sampled from a Gaussian distribution with mean 0.5 and unit variance with values clipped into the range [0, 1]. The uniform out-distribution is formed of uniform noise images, in which each every pixel is independently and identically sampled from a uniform distribution on [0, 1].

Model In-Data (training) Out-Data (unseen) Out-of-Distribution Detection:
Accurate, Fast, Energy-Efficient, Scalable, Turnkey, and Seamless
TNR@TPR95 (%) AUROC (%) DTACC (%)
SoftMax/MPS / SoftMax/ES / IsoMax/ES / IsoMax2/ES
DenseNet CIFAR10 SVHN 43.9 / 47.4 / 66.0 / 87.2 91.8 / 92.2 / 94.9 / 97.8 85.9 / 85.9 / 88.4 / 93.5
ImageNet 45.1 / 48.8 / 77.7 / 84.3 90.1 / 91.0 / 96.4 / 97.1 83.7 / 83.8 / 90.4 / 91.8
LSUN 56.5 / 61.6 / 87.6 / 87.0 93.8 / 94.5 / 97.7 / 97.5 88.1 / 88.2 / 93.0 / 92.2
Fooling 43.0 / 46.3 / 62.6 / 62.9 89.9 / 90.3 / 93.4 / 93.3 83.5 / 83.6 / 86.6 / 86.4
Gaussian 0.0 / 0.0 / 99.9 / 100.0 60.3 / 60.6 / 97.4 / 99.0 74.2 / 74.3 / 91.6 / 99.0
Uniform 0.0 / 0.0 / 96.8 / 100.0 68.4 / 68.5 / 96.5 / 98.2 77.4 / 77.5 / 89.2 / 98.4
CIFAR100 SVHN 16.9 / 18.8 / 23.9 / 59.2 77.0 / 78.1 / 88.7 / 93.7 71.1 / 71.4 / 83.7 / 86.9
ImageNet 20.3 / 25.3 / 47.9 / 52.8 77.6 / 79.4 / 90.7 / 91.2 71.2 / 71.7 / 83.1 / 83.4
LSUN 23.2 / 28.8 / 54.3 / 67.5 78.9 / 80.9 / 92.7 / 93.9 72.0 / 72.7 / 85.9 / 86.7
Fooling 19.9 / 21.8 / 27.6 / 55.0 75.9 / 77.0 / 83.0 / 90.3 70.1 / 70.4 / 75.8 / 81.7
Gaussian 0.0 / 0.0 / 0.0 / 0.0 45.7 / 45.4 / 76.4 / 89.5 61.5 / 61.4 / 82.7 / 91.9
Uniform 2.2 / 0.4 / 5.2 / 51.0 70.6 / 70.0 / 88.2 / 94.9 71.7 / 71.8 / 87.3 / 95.1
SVHN CIFAR10 81.8 / 84.8 / 93.8 / 92.5 96.5 / 97.0 / 98.5 / 98.2 91.9 / 92.0 / 94.8 / 94.2
ImageNet 87.8 / 90.4 / 96.6 / 95.6 97.7 / 98.3 / 99.1 / 98.9 93.3 / 93.6 / 95.9 / 95.4
LSUN 84.5 / 87.3 / 94.4 / 95.5 97.1 / 97.6 / 98.6 / 98.8 92.5 / 92.8 / 95.0 / 95.4
Fooling 79.0 / 80.7 / 79.8 / 73.8 96.1 / 96.4 / 96.5 / 93.6 90.2 / 90.4 / 89.1 / 86.1
Gaussian 91.3 / 95.1 / 99.6 / 99.2 98.1 / 99.0 / 99.6 / 99.1 95.0 / 95.5 / 97.7 / 97.3
Uniform 85.5 / 90.4 / 99.3 / 99.7 97.5 / 98.4 / 99.5 / 99.2 93.7 / 94.2 / 97.5 / 97.9
ResNet CIFAR10 SVHN 52.7 / 54.8 / 82.2 / 79.5 92.9 / 93.3 / 97.0 / 96.7 87.2 / 87.2 / 91.6 / 91.5
ImageNet 46.6 / 48.3 / 62.2 / 64.7 90.8 / 91.1 / 91.7 / 93.1 85.2 / 85.2 / 83.6 / 86.5
LSUN 56.3 / 58.4 / 75.1 / 73.0 93.5 / 93.9 / 95.0 / 95.2 88.0 / 88.1 / 87.6 / 89.4
Fooling 43.6 / 44.8 / 67.4 / 66.0 88.6 / 88.8 / 93.9 / 94.0 82.9 / 83.0 / 87.2 / 87.9
Gaussian 45.2 / 46.8 / 0.0 / 92.5 94.1 / 94.2 / 73.8 / 97.1 91.2 / 91.2 / 76.1 / 95.9
Uniform 59.7 / 61.9 / 0.1 / 76.4 95.3 / 95.5 / 75.8 / 95.9 92.0 / 92.1 / 78.0 / 95.5
CIFAR100 SVHN 11.9 / 11.6 / 26.2 / 61.0 69.8 / 70.7 / 85.5 / 93.7 64.7 / 64.9 / 78.6 / 87.5
ImageNet 18.3 / 21.7 / 48.8 / 43.5 75.5 / 77.1 / 91.0 / 88.2 69.1 / 69.5 / 84.0 / 80.5
LSUN 20.0 / 23.6 / 45.5 / 40.1 77.8 / 79.6 / 91.1 / 88.9 71.4 / 71.8 / 85.4 / 81.8
Fooling 19.6 / 20.6 / 40.1 / 57.4 78.3 / 79.4 / 87.8 / 91.4 71.9 / 72.2 / 80.5 / 83.7
Gaussian 0.0 / 0.0 / 13.4 / 99.9 53.9 / 55.1 / 93.4 / 98.8 70.0 / 70.4 / 94.4 / 97.8
Uniform 0.1 / 0.0 / 59.3 / 99.9 63.4 / 65.4 / 95.1 / 99.1 73.1 / 73.6 / 95.4 / 98.0
SVHN CIFAR10 66.7 / 67.2 / 48.1 / 81.2 92.0 / 92.0 / 90.7 / 96.4 87.4 / 87.4 / 84.0 / 91.2
ImageNet 69.2 / 69.8 / 35.4 / 83.9 93.1 / 93.2 / 86.3 / 96.8 88.3 / 88.3 / 79.4 / 91.7
LSUN 67.2 / 67.7 / 32.1 / 81.4 92.3 / 92.3 / 84.5 / 96.2 87.6 / 87.6 / 77.7 / 90.9
Fooling 62.3 / 62.7 / 46.3 / 69.2 88.3 / 88.2 / 88.3 / 93.7 84.6 / 84.6 / 80.6 / 87.8
Gaussian 79.3 / 80.1 / 5.27 / 87.5 96.3 / 96.5 / 72.8 / 97.7 92.3 / 92.3 / 69.8 / 93.4
Uniform 84.3 / 85.1 / 5.35 / 88.6 97.0 / 97.1 / 73.9 / 97.7 93.4 / 93.5 / 71.1 / 93.8
Table 5: Fair comparison: Accurate, fast, energy-efficient, scalable, turnkey, and seamless out-of-distribution detection approaches. The best results are in bold (1% tolerance). To the best of our knowledge, IsoMax2/ES presents state-or-the-art performance under these assumptions (accurate, fast, energy-efficient, scalable, turnkey, seamless, and no side effects out-of-distribution detection).