1 Introduction
Deep learning has significantly improved the performance of machine learning systems in fields such as computer vision, natural language processing, and speech. In turn, these algorithms are integral in commercial applications such as autonomous driving, medical diagnosis, and web search. In these applications, it is critical to detect sensor failures, unusual environments, novel biological phenomena, and cyber attacks. To accomplish this, systems must be capable of detecting when inputs are anomalous or outofdistribution (OOD). In this work, we propose an outofdistribution detection method for deep neural networks and demonstrate its performance across several outofdistribution classification tasks on the stateoftheart deep neural networks such as DenseNet
[8] and Wide ResNet(WRN)[22].We propose a novel marginbased loss term, added to crossentropy loss over indistribution samples, which maintains a margin of at least between the average entropy of OOD and ID samples respectively. We propose an ensemble of leaveout classifiers for OOD detection. The training dataset with classes is partitioned into subsets such that the classes of each partition are mutually exclusive with respect to each other. Each classifier samples one of the subsets without replacement as outofdistribution training data and the rest of the subsets as indistribution training data. We also propose a new OOD detection score which combines both softmax prediction score and entropy with temperature scaling [13]. We demonstrate the efficacy of our method on standard benchmarks proposed in ODIN [13] and outperform them. Our contributions are (i) proposing a novel loss for OOD detection, (ii) demonstrating a selfsupervised OOD detection method, and (iii) moving the stateoftheart by outperforming the current best methods.
The rest of the paper is organized as follows. Section 2 describes the previous work on the OOD detection. Section 3 describes our method in detail. Section 4
describes various evaluation metrics to measure the performance of OOD detection algorithms. The ablation results of various design choices and hyperparameters are also presented. We then compare our method against the recently proposed ODIN algorithm
[13] and demonstrate that it outperforms it on various OOD detection benchmarks. Finally, section 5 discusses observations about our method, future directions and conclusions.2 Related Work
Traditionally, based on the availability of the data labels, OOD detection methods can be categorized into supervised [16], semisupervised [4] and unsupervised methods [15], [3]. All these classes of methods have access to the OOD data while training but differ in access to labels. It is assumed that the classifier has labels for normal as well as OOD classes during training for supervised OOD detection, while labels for only the normal classes are available in case of semisupervised methods, and no labels are provided for unsupervised OOD detection methods which typically rely on the fact that anomalies occur in much less frequency than normal data. Our method is able to detect anomalies in test OOD datasets the very first time it encounters them during testing. We use one OOD dataset as validation set to search for hyperparameters.
Notable OOD detection algorithms which work in the same setting as ours are isolation forests [14], Hendrycks and Gimpel [7], ODIN [13] and Lee et.al, [12]. The work reported in isolation forests [14] exploits the fact that anomalies are scarce and different and while constructing the isolation tree, it is observed that the anomalous samples appear close to the root of the tree. These anomalies are then identified by measuring the length of the path from the root to a terminating node; the closer a node is to the root, the higher is its chance of representing an OOD. Hendrycks and Gimpel [7]
is based on the observation that prediction probability of incorrect and outofdistribution samples tends to be lower than the prediction probability of correct samples. Lee
et al. modify the formulation of generative adversarial networks [6] to generate OOD samples for the given indistribution. They achieve this by simultaneously training GAN [6]and standard supervised neural network. The joint loss consists of individual losses and an additional connecting term which reduces the KL divergence between the generated sample’s softmax distribution and the uniform distribution.
Another set of related works are open set classification methods [17], [18], [19], [1], [2]. Scheirer et.al, [19] introduces and formalizes “open space risk” which intuitively is the risk associated with labeling those areas in the output feature space as positive where there is no density support from the training data. Thus the approximation to the ideal risk is defined as a linear combination of “open space risk” and the standard “empirical risk”. Bendale and Boult [1] extend the definition of “open set risk” to open world recognition where the unknown samples are not static set. The open world recognition defines a multiclass open set recognition function, a labeling process and an incremental learning function. The multiclass open set recognition function detects novel classes which are labeled using the labeling process and finally are fed to incremental learning function which updates the model. The OSDN [2]
work proposes openMax function which extends the softmax function by adding an additional unknown class to the classification layer. The value for unknown class is computed by taking the weighted average of all other classes. The weights are obtained from Weibull distribution learnt over the pairwise distances between penultimate activation vectors (AV) of the top farthest correctly classified samples. For an OOD test sample these weights will be high while for an indistribution sample these scores will be low. The final activation vector is renormalized using softmax function.
The current stateoftheart is ODIN [13]
which proposes to increase the difference between the maximum softmax scores of indistribution and OOD samples by (i) calibrating the softmax scores by scaling the logits that feed into softmax by a large constant (referred to as temperature) and (ii) preprocessing the input by perturbing it with the loss gradient.
ODIN [13] demonstrated that at high temperature values, the softmax score for the predicted class is proportional to the relative difference between largest unnormalized output (logit) and the remaining outputs (logits). Moreover, they empirically showed that the difference between the largest logit and the remaining logits is higher for the indistribution images than for the outofdistribution images. Thus temperature scaling pushes the softmax scores of in and outofdistribution images further apart when compared to plain softmax. Perturbing the input image through gradient ascent w.r.t to the score of predicted label was demonstrated [13] to have stronger effect on the in distribution images than that on outofdistribution images, thereby, further pushing apart the softmax scores of in and outofdistribution images. We leverage the effectiveness of both these methods. The proposed method outperforms all the above methods by considerable margins.3 OutofDistribution (OOD) Classifier
In this section, we introduce three important components of our method: entropy based marginloss function (
3.1), training ensemble of leaveout classifiers (3.2), and OOD detection scores (3.3).3.1 Entropy based MarginLoss
Given a labeled set of indistribution (ID) samples and of outofdistribution (OOD) samples, we propose a novel loss term in addition to the standard crossentropy loss on ID samples. This loss term seeks to maintain a margin of at least between the average entropy of OOD and ID samples. Formally, a multilayer neural network which maps an input to probability over classes and parametrized by is learned by minimizing the marginloss over the difference of average entropies over OOD samples and ID samples, and cross entropy loss on ID samples. The loss function is given by Equation 1,
(1) 
where is the predicted probability of sample whose ground truth class , is the entropy over the softmax distribution, is the margin and is the weight on margin entropy loss.
The new loss term evaluates to its minimum value zero when the difference between the average entropy of OOD and ID samples is greater than the margin . For ID samples, the entropy loss encourages the softmax probabilities of non groundtruth classes to decrease and the crossentropy loss encourages the softmax probability of groundtruth class to increase. For OOD samples, the entropy loss encourages the probabilities of all the classes to be equal. When the OOD entropy is higher than ID entropy by a margin , the new loss term evaluates to zero. Our experiments suggest that maximizing OOD entropy leads to overfitting. Bounding the difference of average entropies of ID samples and OOD samples entropies with margin has helped in preventing overfitting, and thus is better for model generalization[11].
3.2 Training Ensemble of leaveout classifiers
Given an indistribution training data which consists of classes, the data is divided into partitions such that the classes of each partition are mutually exclusive to all other partitions. A set of classifiers are learned where classifier uses the partition as OOD data and rest of the data as indistribution data . A particularly simple way of partitioning the classes is to divide them into partitions with equal number of classes. For example, dividing a dataset of classes into random and equal partitions gives us a partition with size of classes. Each of the classifiers would then use classes for OOD and classes as ID. Each classifier is learned by minimizing the proposed margin entropy loss (eqn 1) using the assigned OOD and ID data.
During the training, we also assume a small number of outofdistribution images to be available as a validation dataset. At every epoch, we save the model with best OOD detection rate on this small OOD validation data and within a
accuracy bound of the current best accuracy. The complete algorithm for training the leaveout classifiers in presented in Algorithm 1.3.3 OOD Detection Score for Test Image
At the time of testing, an input image is forward propagated through all the leaveout classifiers and the softmax vectors of all the networks are remapped to their original class indices. For the leftout classes, a score of zero is assigned. For classification of an input sample, first the softmax vectors of all the classifiers are averaged and the class with the highest averaged softmax score is considered as the prediction. For the OOD detection score, for each of the classifiers, we first compute both the maximum value and negative entropy of the softmax vectors with temperature scaling. We then compute the average of all these values to obtain the OOD detection score.
An indistribution sample with class labels acts as an OOD for exactly one of the classifiers. This is because the classes are divided into mutually exclusive partitions and class can be part of only one of these partitions. When an indistribution sample is forward propagated through these classifiers, we would expect the negative entropy and maximum softmax score to be high for classifiers where it was sampled as indistribution dataset. However, for an OOD sample we expect the negative entropy and maximum softmax score to be relatively low for all the classifiers. We thus expect a higher OOD detection score for ID samples than the OOD samples thus differentiating them.
Following the work of ODIN [13]
, we use both temperature scaling and input preprocessing while testing. In temperature scaling, the logits feeding into softmax layer are scaled by a constant factor
. It has been established that temperature scaling can calibrate the classification score and in the context of OOD detection [13], it pushes the softmax scores of in and outofdistribution samples further apart when compared to plain softmax. We modify input preprocessing by perturbing over entropy loss instead of crossentropy loss used by ODIN [13]. Perturbing using the entropy loss decreases the entropy of the ID samples much more than the OOD samples. For an input test image , after it is forward propagated through the neural network , the gradient of the entropy loss with respect to is computed and the input is perturbed with Equation 2. The OOD detection score is then calculated by the combination of maximum softmax value and entropy as described previously, both with temperature scaling.(2) 
The complete algorithm for OOD detection on a test image is presented in Algorithm 2.
4 Experimental Results
In this section, we describe our experimental results. The details such as indistribution and OOD datasets, the neural network architectures and evaluation metrics are described in detail. The ablation studies on various hyperparameters of the algorithms are described and conclusions are drawn. Finally, our method is compared against the current stateoftheart ODIN [13] and is shown to significantly outperform it.
Architecture  CIFAR10  CIFAR100 
DenseNetBC  5.0  19.9 
WRN2810  5.0  20.4 
4.1 Experimental Setup
We use CIFAR10 (contains 10 classes) and CIFAR100 (contains 100 classes) [10] datasets as indistribution datasets to train deep neural networks for image classification. They both consist of 50,000 images for training, and 10,000 images for testing. The dimensions of an image in both the datasets is . The classes of both CIFAR10 and CIFAR100 are randomly divided into five parts. As described in Section 3.2, each part is assigned as OOD to a unique network which is then trained. For each network, the other parts act as indistribution samples.
Following the benchmarks given in [13], the following OOD datasets are used in our experiments. The datasets are described in ODIN[13] and provided as a part of their code release; here we are simply restating the description for comprehensiveness.

TinyImageNet[9] (TIN)
is a subset of ImageNet dataset
[13]. Tiny ImageNet contains 200 classes which is drawn from original 1,000 classes of ImageNet. In total, there are 10,000 images in the Tiny ImageNet. By randomly cropping and downsampling each image to , two datasets TinyImageNetcrop (TINc) and TinyImageNetresize (TINr) are constructed. 
LSUN is the Large Scale UNderstanding dataset (LSUN)[21] created by Princeton, using deep learning classifiers with humans in the loop. It contains 10,000 images of 10 scene categories. By randomly cropping and downsampling each image to size , two datasets LSUNc and LSUNr are constructed.

iSUN[20] is collected by gaze tracking from Amazon Mechanical Turk using a webcam. It contains 8925 scene images. Similar to the above dataset as other datasets, images are downsampled to size .

Uniform Noise (UNFM) is synthetic dataset consists of 10,000 noise images. The RGB value of each pixel in an image is drawn from uniform distribution in the range .

Gaussian Noise (GSSN)
is synthetic dataset consists of 10,000 noise images. The RGB value of each pixel is drawn from independent and identically distributed Gaussian with mean 0.5 and unit variance and each pixel value is clipped to the range
.
Neural network architecture Following ODIN[13], two stateoftheart neural network architectures, DenseNet [8] and Wide ResNet(WRN) [22], are adopted to evaluate our method. For DenseNet, we use the DenseNetBC setup as in [8], with depth , growth rate and dropout rate 0. For Wide ResNet, we use WRN2810 setup, with depth 28, width 10 and dropout rate of 0.3. We train both DenseNetBC and Wide ResNet on CIFAR10 and CIFAR100 for 100 epochs with batch size 100, momentum 0.9, weight decay 0.0005, and margin 0.4. The initial training rate is 0.1 and it is linearly dropped to 0.0001 over the whole training process. During training, we augment our training data with random flip and random cropping. We use the smallest OOD dataset, iSUN, as validation data for hyperparameter search. We test the rest four outofdistribution datasets except iSUN on our trained network. During testing, we use batch size 100. Similar to ODIN[13], input preprocessing with is used.
Table 1 shows the test error rates when our method is trained and tested on CIFAR10 and CIFAR100 respectively using the algorithms 1 and 2. For CIFAR10, the vanilla DenseNetBC [8] and the proposed method gives error rates of and respectively. For CIFAR100, the error rates are and respectively. On both these datasets, the difference in error rates is marginal. For WRN [22] with depth 40, , the test error rate on CIFAR10 for the vanilla network is and for the proposed network is . For CIFAR100, the error rates are and for the vanilla network and the proposed network respectively. The small difference in the performance on CIFAR10 can be explained by the fact that the our method did not use the ZCA whitening preprocessing while the vanilla network did.
4.2 Evaluation Metrics
To measure the effectiveness of our method to distinguish between indistribution and outofdistribution samples, we adopt five different metrics, same as what was used in ODIN[13] paper. We restate these metrics below for comprehensiveness. In the rest of manuscript, TP, TN, FP, FN are used to denote true positives, true negatives, false positives and false negatives respectively.
FPR at 95% TPR measures the probability that an outofdistribution sample is misclassified as indistribution when the true positive rate (TPR) is 95%. In this metric, TPR is computed by , and FPR is computed by .
Detection Error measures the minimum misclassification probability over all possible score thresholds, as defined in ODIN[13]. To have a fair comparison with ODIN, the same number of positive and negative samples are used during testing.
AUROC
is the Area Under the Receiver Operating Characteristic curve. In a ROC curve, the TPR is plotted as a function of FPR for different threshold settings. AUROC equals to the probability that a classifier will rank a randomly chosen positive sample higher than a randomly chosen negative one. AUROC score of 100% means perfect separation between positive and negative samples.
AUPRIn measures the Area Under the PrecisionRecall curve. In a PR curve, the precision, is plotted as a function of recall, for different threshold settings. Since precision is directly influenced by class imbalance (due to FP), PR curves can highlight performance differences that are lost in ROC curves for imbalanced datasets[5]. AUPR score of 100% means perfect distinguish between positive and negative samples. For AUPRIn metric, indistribution images are specified as positive.
AUPROut is similar to the metric AUPRIn. The difference lies in that for AUPROut metric, outofdistribution images are specified as positive.
CLS Acc is the classification accuracy for ID samples.
4.3 Ablation Studies

Parameters 







Number of Splits 
3  32.37  13.94  93.50  94.39  92.22  76.41  
5  22.95  10.79  95.69  96.55  94.3  80.01  
10  28.71  12.53  94.48  95.37  93.26  81.94  
20  23.85  10.95  95.49  96.24  94.36  82.33  
Type of splits 

Random  22.95  10.79  95.69  96.55  94.3  80.01  
Manual  40.16  16.26  91.57  92.90  89.57  79.79  
Epsilon 
0.000000  53.51  16.37  90.75  93.16  86.71  80.32  
0.000313  41.62  14.37  92.8  94.52  90.16  80.22  
0.000625  34.64  12.83  94.09  95.4  92.19  80.17  
0.001250  25.74  11.19  95.38  96.31  94.04  80.08  
0.002000  22.95  10.79  95.69  96.55  94.3  80.01  
0.003000  29.07  11.79  94.73  95.9  92.43  79.97  
Temp rature 
1  38.57  17.32  91.44  92.7  90.12  80.01  
10  27.84  11.93  94.86  95.81  93.39  80.01  
100  24.44  10.86  95.6  96.5  94.17  80.01  
1000  22.95  10.79  95.69  96.55  94.3  80.01  
5000  22.7  10.81  95.66  96.53  94.28  80.01  
Loss Function 

SFX  84.09  36.55  68.96  72.38  63.77  54.18  
SFX+MaxEntropyDiff  50.70  19.65  88.26  89.71  86.18  72.99  
SFX+MarginEntropy  22.95  10.79  95.69  96.55  94.3  80.01  
OOD Detection Score 
SFX  50.52  19.91  88.69  90.91  86.19  80.01  
Entropy  36.23  16.48  91.92  93.03  90.74  80.01  
SFX+Entropy  38.57  17.32  91.44  92.7  90.12  80.01  
SFX@Temp  22.71  10.83  95.65  96.52  94.26  80.01  
Entropy@Temp  37.0  14.05  93.33  94.76  91.04  80.01  
(SFX+Entropy)@Temp  22.95  10.79  95.69  96.55  94.3  80.01 
In this section, we perform ablation studies to study the effects of various hyper parameters used in our model. We perform the ablation studies on DenseNetBC [8] network with CIFAR100 [10] as indistribution while training and iSUN [20] as the OOD validation data while testing. By default, we use random splits for CIFAR100, , SFX+MarginEntropy loss to train network, accuracy bound to save the models, and use (Softmax + Entropy)@Temperature with to detect outofdistribution samples. Results are given in Table 2.
(1) Number of splits: This analysis characterizes the sensitivity of our algorithm to the number of splits of the training classes which is same as the number of classifiers in the ensemble. As the number of splits increase, the number of times a particular training class being indistribution for the leaveout classifiers increases too. This enables the ensemble to discriminate an indistribution sample from the OOD sample. But it also increases the computational cost. For CIFAR100, we studied 3, 5, 10 and 20 splits. Our results show that while 5 splits gave the best result, 3 splits also provides a good tradeoff between accuracy and computational cost. We choose the number of splits as 5 as default value.
(2) Type of splits: This study characterizes the way in which the classes are split into mutually exclusive sets. We experiment with splitting the classes manually using prior knowledge and splitting randomly. For the manual split, the class labels are first clustered into semantically consistent groups and classes from each group are then distributed across the splits. The results show that the OOD detection rates for random selections are better than the manual selection.This ensures that we can achieve good OOD detection rates even by random selection of classes when the number of classes is huge.
(3) Different for input preprocessing: For input preprocessing, we sweep over . Our results show that as increases from 0, the performance of outofdistribution detector increases, and it reaches the best performance at . The further increase of does not help performance.
(4) Different for temperature scaling: For temperature scaling, we sweep over . Our results show that for DenseNetBC with CIFAR100, as increases from 1 to 1000, the performance of outofdistribution detector increases. Beyond , the performance does not change much.
(5) Loss function variants: We study the effects of training our method with different types of losses. The training regime follows the strategy given in section 3.2, where the training data is split into partitions . A total of classifiers are trained where classifier uses the partition as OOD data and as indistribution data.

SFX: We assign an additional label to all the OOD samples and train classification network using the cross entropy loss.

SFX+MaxEntropyDiff: Along with the cross entropy loss, we maximize the difference between the entropy of in and outofdistribution samples across all indistribution classes.

SFX+MarginEntropy: Along with the cross entropy, we maximize the difference between the entropy of in and outofdistribution samples across all indistribution classes, but is bounded by a margin as given in Equation 1.
Our results show that the proposed SFX+MarginEntropy loss works dramatically better than all other types of losses for detecting outofdistribution samples as well as for accurate classification. The results demonstrate that the proposed novel loss function SFX+MarginEntropy (equation 1) is the major factor in significant improvements over the current stateoftheart ODIN [13].
(6) Outofdistribution detector: We study different OOD scoring methods to discriminate outofdistribution samples from indistribution samples.

Softmax score: Given an input image, the score is given by the average of maximum softmax outputs over all the classifiers in the ensemble.

Entropy score: Given an input image, the score is given by the average of entropy of softmax vector over all the classifiers in the ensemble.

Softmax + Entropy: Given an input image, both the above scores are added.

Softmax@Temperature: Given an input image, the above described Softmax score is computed on temperature scaled () softmax vectors.

Entropy@Temperature: Given an input image, the above described Entropy score is computed on temperature scaled () softmax vectors.

(Softmax + Entropy)@Temperature: Given an input image, the above described Entropy@Temperature and Softmax@Temperature are computed () on softmax vectors and then added.
Among the above OOD scoring methods, the (Softmax + Entropy)@Temperature () achieved the best performance. Softmax@Temperature () achieved the second best performance.
4.4 Results and Analysis







each cell in ODIN[13]/Our Method format  
DenseNetBC CIFAR10 
TINc  4.30/1.23  4.70/2.63  99.10/99.65  99.10/99.68  99.10/99.64  
TINr  7.50/2.93  6.10/3.84  98.50/99.34  98.60/99.37  98.50/99.32  
LSUNc  8.70/3.42  6.00/4.12  98.20/99.25  98.50/99.29  97.80/99.24  
LSUNr  3.80/0.77  4.40/2.1  99.20/99.75  99.30/99.77  99.20/99.73  
UNFM  0.00/2.61  0.20/3.6  100/98.55  100/98.94  100/97.52  
GSSN  0.00/0.00  0.50/0.2  99.90/99.84  100/99.89  99.90/99.6  
DenseNetBC CIFAR100 
TINc  17.30/8.29  8.80/6.27  97.10/98.43  97.40/98.58  96.80/98.3  
TINr  44.30/20.52  17.50/9.98  90.70/96.27  91.40/96.66  90.10/95.82  
LSUNc  17.60/14.69  9.40/8.46  96.80/97.37  97.10/97.62  96.50/97.18  
LSUNr  44.00/16.23  16.80/8.77  91.50/97.03  92.40/97.37  90.60/96.6  
UNFM  0.50/79.73  2.50/9.46  99.50/92.0  99.60/94.77  99.00/83.81  
GSSN  0.20/38.52  1.90/8.21  99.60/94.89  99.70/96.36  99.10/90.01  
WRN2810 CIFAR10 
TINc  23.40/0.82  11.60/2.24  94.20/99.75  92.80/99.77  94.70/99.75  
TINr  25.50/2.94  13.40/3.83  92.10/99.36  89.00/99.4  93.60/99.36  
LSUNc  21.80/1.93  9.80/3.24  95.90/99.55  95.80/99.57  95.50/99.55  
LSUNr  17.60/0.88  9.70/2.52  95.40/99.7  93.80/99.72  96.10/99.68  
UNFM  0.00/16.39  0.20/5.39  100/96.77  100/97.78  100/94.18  
GSSN  0.00/0.00  0.10/1.03  100/99.58  100/99.71  100/99.2  
WRN2810 CIFAR100 
TINc  43.90/9.17  17.20/6.67  90.80/98.22  91.40/98.39  90.00/98.07  
TINr  55.90/24.53  23.30/11.64  84.00/95.18  82.80/95.5  84.40/94.78  
LSUNc  39.60/14.22  15.60/8.2  92.00/97.38  92.40/97.62  91.60/97.16  
LSUNr  56.50/16.53  21.70/9.14  86.00/96.77  86.20/97.03  84.90/96.41  
UNFM  0.10/99.9  2.20/14.86  99.10/83.44  99.40/89.43  97.50/71.2  
GSSN  1.00/98.26  2.90/16.88  98.50/93.04  99.10/88.64  95.90/71.62 






TINc  
TINr  
LSUNc  
LSUNr  
UNFM  
GSSN  

Mean and standard deviation of FPR at 95% TPR
Table 3 shows the comparison between our results and ODIN [13]
on various benchmarks. The results are reported on all neural network, indataset and OOD dataset combinations. Our hyperparameters are tuned using iSUN dataset. From the Table
3, it is very clear that our approach significantly outperforms ODIN [13]across all neural network architectures on almost all of the dataset pairs. The combination of novel loss function, OOD scoring method, and the ensemble of models has enabled our method to significantly improve the performance of OOD detection on more challenging datasets, such as LSUN (resized), iSUN and ImageNet(resized), where the images contain full objects as opposed to the cropped parts of objects. The proposed method is slightly worse on the uniform and some of Gaussian distribution results. Moreover, our method achieves significant gains on both CIFAR10 and CIFAR100 with the same number of splits which is
, even though the number of classes have increased by a factor of ten from CIFAR10 to CIFAR100. Thus the number of splits need not be scaled linearly with the number of classes, making our method practical. We implicitly outperform Hendrycks and Gimpel [7] and Lee et.al, [12] as ODIN outperforms both these works and our method outperform ODIN on all but two benchmarks.All three components in our method, namely novel loss function, the ensemble of leaveout classifiers and improved OOD detection metric contributed to the improvement in performance over stateoftheart ODIN (refer to table 2). The contribution of these components can be seen in Table 2 in the rows marked as “Loss function”, “Number of splits” and “OOD detection scores”.
Our algorithm has stochasticity in the form of random splits of the classes. Given 100 classes in CIFAR100, there are many ways to split 100 classes into 5 partitions. Table 4 gives the mean and standard deviation across five random ways to partition data when we use 5 number of splits for training. We note that even our worst case results outperform ODIN [13] on more challenging datasets.
Figure 1 compares the histogram of OOD detection scores on ID and OOD samples when different loss functions are used for training. Figure 1(a) is trained with only cross entropy loss, while Figure 1(b) is trained with proposed margin entropy loss and cross entropy, the proposed OOD detector is used in both bases. As shown in Figure 1, the proposed margin entropy loss helps to better separate ID and OOD distributions than using cross entropy loss alone. Figure 2 presents the histogram plot of OOD detection scores on ID and OOD samples for both our method and ODIN [13]. As shown in Figure 2, the proposed method has less overlap between OOD samples and ID samples compared to ODIN [13] and thus separates ID and OOD distributions better.
5 Conclusion and Future Work
As deep learning is widely adopted in many commercially important applications, it is very important that anomaly detection algorithms are developed for these algorithms. In this work, we have proposed an anomaly detection algorithm for deep neural networks which is an ensemble of leaveoutclassifiers. These classifiers are learned by maximizing the marginloss between the entropy of OOD samples and indistribution samples. A random subset of training data serves as OOD data while the rest of the data serves as indistribution. We show our algorithm significantly outperforms the current stateofart methods [7], [12] and [13] across almost all the benchmarks. Our method contains three important components, novel loss function, the ensemble of leaveout classifiers, and novel outofdistribution detector. Each of this component improves OOD detection performance. Each of them can be applied independently on top of other methods.
We also note that this method opens up several directions of research to pursue. First, the proposed method of the ensemble of neural networks requires large memory and computational resources. This can potentially be alleviated by all the networks sharing most of the parameters and branch away individually. Also, the number of splits can be used to trade off between detection performance and computational overhead. Notice that based on ablation study (Table 2) and detailed 3 splits results in supplementary document, even 3 splits outperform ODIN [13]. For use cases where reducing computational time is critical, we recommend to use 3 splits. Please see supplementary material for detailed results on 3 splits. Our current work requires an OOD dataset for hyperparameter search. This problem can potentially be solved by investigating other surrogate functions for entropy which are better behaved with the epochs.
References

[1]
Bendale, A., Boult, T.E.: Towards open world recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1893–1902 (2015)
 [2] Bendale, A., Boult, T.E.: Towards open set deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1563–1572 (2016)
 [3] Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
 [4] Fujimaki, R., Yairi, T., Machida, K.: An approach to spacecraft anomaly detection problem using kernel feature space. In: KDD. pp. 401–410 (2005)
 [5] Goadrich, M., Oliphant, L., Shavlik, J.: Creating ensembles of firstorder clauses to improve recallprecision curves. Machine Learning 64, 231–262 (2006)
 [6] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS) (2014)
 [7] Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and outofdistribution examples in neural networks. In: International Conference on Learning Representations (ICLR) (2017)
 [8] Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks (2016), arXiv preprint arXiv:1608.06993
 [9] https://tiny imagenet.herokuapp.com:
 [10] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images
 [11] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energybased learning. In: Predicting structured data (2006)
 [12] Lee, K., Lee, H., Lee, K., Shin, J.: Training confidencecalibrated classifiers for detecting outofdistribution samples. In: International Conference on Learning Representations (ICLR) (2018), https://openreview.net/forum?id=ryiAv2xAZ
 [13] Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of outofdistribution image detection in neural networks. In: International Conference on Learning Representations (ICLR) (2018)
 [14] Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: ICDM. pp. 413–422 (2008)

[15]
Lu, W., Traoré, I.: Unsupervised anomaly detection using an evolutionary extension of kmeans algorithm. IJICS
2(2), 107–139 (2008) 
[16]
Phua, C., Alahakoon, D., Lee, V.C.S.: Minority report in fraud detection: classification of skewed data. SIGKDD Explorations
6(1), 50–59 (2004)  [17] Rudd, E.M., Jain, L.P., Scheirer, W.J., Boult, T.E.: The extreme value machine. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 762–768 (2018)
 [18] Scheirer, W.J., Jain, L.P., Boult, T.E.: Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2317–2324 (2014)
 [19] Scheirer, W.J., de Rezende Rocha, A., Sapkota, A., Boult, T.E.: Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1757–1772 (2013)
 [20] Xu, P., Ehinger, K.A., Zhang, Y., Finkelstein, A., Kulkarni, S.R., Xiao, J.: Turkergaze: Crowdsourcing saliency with webcam based eye tracking (2015), arXiv preprint arXiv:1504.06755
 [21] Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Lsun: Construction of a large scale image dataset using deep learning with humans in the loop (2015), arXiv preprint arXiv:1506.03365
 [22] Zagoruyko, S., Komodakis, N.: Wide residual networks (2016), arXiv preprint arXiv:1605.07146