When deploying machine learning models in the open world, it is important to ensure the reliability of the model in the presence of out-of-distribution (OOD) inputs—samples from an unknown distribution that the network has not been exposed to during training, and therefore should not be predicted with high confidence at test time. We desire models that are not only accurate when the input is drawn from the known distribution, but are also aware of the unknowns outside the training categories. This gives rise to the task of OOD detection, where the goal is to determine whether an input is in-distribution (ID) or not.
A plethora of OOD detection algorithms have been developed recently, among which distance-based methods demonstrated promise . These approaches circumvent the shortcoming of using the model’s confidence score for OOD detection , which can be abnormally high on OOD samples  and hence not distinguishable from ID data. Distance-based methods leverage feature embeddings extracted from a model, and operate under the assumption that the test OOD samples are relatively far away from the centroids or prototypes of ID classes. For example, Lee et al.  used the maximum Mahalanobis distance from the test sample to all class centroids for OOD detection.
Arguably, the efficacy of distance-based approaches can depend largely on the quality of feature embeddings. Prior works [41, 51] directly employ off-the-shelf contrastive losses for OOD detection. However, existing training objectives produce embeddings that suffice for classifying ID samples, but remain sub-optimal for OOD detection. Tack et al.  employ sophisticated data augmentations and ensembling in testing, which can be difficult to search and use in practice. It remains underexplored what properties of learned embeddings can benefit OOD detection, and how to design training methods to directly achieve them.
In this work, we bridge the gap by proposing CIDER, a Compactness and DispErsion Regularized learning framework designed for OOD detection. Our method is motivated by the desirable properties of embeddings for OOD detection. Intuitively, we desire embeddings where different classes are relatively far apart (i.e. high inter-class dispersion), and samples in each class form a compact cluster (i.e. high intra-class compactness). This is illustrated in Figure 1, where fox (OOD) can be viewed as a hard OOD example w.r.t cat (ID) and dog (ID). Larger angular distance between ID classes cat and dog in the hyperspherical space improves the separability from fox, and allows for more effective OOD detection. To formalize our idea, we introduce two losses: a new dispersion loss that promotes large angular distances among different class prototypes, along with a compactness loss that encourages samples to be close to their class prototypes. During training, the class-conditional prototypes are dynamically updated in an exponential-moving-average (EMA) manner.
We quantitatively measure how the feature embedding quality affects the downstream OOD detection performance. In particular, we show that inter-class dispersion is the key to stronger ID-OOD separability. Unlike prior works that directly use existing contrastive losses such as SupCon, CIDER explicitly encourages prototype-wise dispersion and establishes state-of-the-art OOD detection results with competitive ID classification accuracy. Our key results and contributions are summarized as follows:
We propose CIDER, a simple and effective representation learning framework for OOD detection. CIDER establishes state-of-the-art results and is easy to use. Compared to the current best method SSD+, CIDER reduces the average FPR95 by 22.5% on CIFAR-100.
We establish the unexplored relationship between OOD detection performance and the embedding quality in the hyperspherical space, and provide measurements based on the notion of compactness and dispersion. This allows future research to quantify the embedding in the hyperspherical space for effective OOD detection.
We conduct extensive ablations to understand the efficacy and behavior of our method under various settings and hyperparameters, including different loss components, batch size, temperature, regularization weights, and the model architecture.CIDER is effective under a wide range of hyperparameters.
Notations. We consider multi-class classification, where denotes the input space and denotes the ID labels. The training set is drawn i.i.d. from . Let denote the marginal distribution over , which is called the in-distribution (ID).
Out-of-distribution detection. OOD detection can be viewed as a binary classification problem. At test time, the goal of OOD detection is to decide whether a sample is from (ID) or not (OOD). In practice, OOD is often defined by a distribution that simulates unknowns encountered during deployment, such as samples from an irrelevant distribution whose label set has no intersection with and therefore should not be predicted by the model. Mathematically, let denote an OOD test set where the label space . The decision can be made via a thresholding mechanism:
where by convention samples with higher scores are classified as ID and vice versa. The threshold is typically chosen so that a high fraction of ID data (e.g. 95%) is correctly classified .
In this section, we start with an overview of the architecture. Next, we characterize and provide quantitative measures on the desirable properties of hyperspherical embeddings for OOD detection (Section 3.1). We then present our training objectives that are designed to promote the desirable properties (Section 3.2).
Architecture overview. As shown in Figure 2
, the general architecture consists of two components: (1) a deep neural network encoderthat maps the augmented input to a high dimensional feature embedding (often referred to as the penultimate layer); (2) a projection head that maps the high dimensional embedding to a lower dimensional feature representation . Following convention [4, 41, 50], the loss is applied to the normalized feature embedding .
The normalized embeddings are also referred to as hyperspherical embeddings, since they are on a unit hypersphere as shown in Figure 1. Our goal is to shape the hyperspherical embedding space so that the learned embeddings should (1) be sufficiently discriminative for ID classification, and (2) effectively separate unseen OOD data from ID data.
3.1 Characterizing Hyperspherical Embeddings for OOD Detection
We start by characterizing the desirable properties of hyperspherical embeddings for OOD detection, which inspire the design of an effective training objective, as we will show in Section 3.2. Intuitively, we desire embeddings where samples in each class form a compact cluster on the hypersphere, and different classes are relatively far apart. This is beneficial for distance-based OOD detection, where OOD samples are hypothesized to be relatively far from in-distribution classes.
To formalize our idea, we introduce two measurements: inter-class dispersion and intra-class compactness. We will show in Section 4 that these are key properties to determine the OOD detection performance.
Inter-class dispersion. We denote by the prototype embeddings for the ID classes
. The prototype for each sample is assigned based on the ground-truth class label. The extent of inter-class dispersion can be measured by the average cosine similarity among pair-wise class prototypes:
The unnormalized prototype
can be estimated empirically by averaging the embeddings of all samples in each class. The prototype is the
-normalized vector with unit length; that is,. Prototypes with larger pairwise angular distances are desirable for OOD detection, as we will show later in Section 4.2. The importance of inter-class dispersion can also be seen in Figure 1, where samples in the fox class (OOD) are semantically close to cat (ID) and dog (ID). A larger angular distance (i.e. smaller cosine similarity) between ID classes cat and dog in the hyperspherical space improves the separability from fox, and allows for more effective detection.
Intra-class compactness. The embeddings of samples in the same class should ideally form a compact cluster on the hypersphere. The compactness can be measured by the average cosine similarity between each feature embedding and its corresponding class prototype. We define the compactness (e.g. of the training set) as follows:
where , and is the normalized embedding of for all . We proceed by introducing our training objective which encourages these two characteristics of the learned embeddings.
3.2 CIDER: Compactness and Dispersion Regularized Learning
Training objective. To facilitate both inter-class dispersion and intra-class compactness for improving OOD detection, we introduce a novel training objective termed CIDER (compactness and dispersion regularized learning):
CIDER consists of two losses: a dispersion loss that promotes large angular distances among different class prototypes, along with a compactness loss that encourages samples to be aligned with its class prototype. Specifically, the two terms are defined as follows:
where is the batch size, is the total number of ID classes, denotes the class index of sample , and is the temperature parameter.
Prototype estimation and update. During training, an important step is to estimate the class prototype for each class . One canonical way to estimate the prototypes is to use the mean vector of all training samples for each class and update it frequently during training. Despite its simplicity, this method incurs a heavy computational toll and causes undesirable training latency. Instead, we update the class-conditional prototypes in an exponential-moving-average (EMA) manner:
where the prototype for class is updated during training as the moving average of all embeddings with label , and denotes the normalized embedding of samples of class . We ablate the effect of prototype update factor in Section 4.3.
Remark 1: CIDER vs. SSD+. A recent work SSD+  directly use the supervised contrastive loss (i.e. SupCon ) for OOD detection. However, prior works do not dive deeper into how such representation impacts OOD detection. As we will show in Section 4.2, SupCon produces embeddings that suffice for classifying ID samples but lack sufficient inter-class dispersion needed for OOD detection. Moreover, it can been shown that SupCon leads to a uniform hyperspherical distribution only when the number of negative samples goes to infinity , which in practice requires a large batch size and incurs significant computational burden. In contrast, CIDER enforces the inter-class dispersion by explicitly maximizing the angular distances among different ID class prototypes. We show that CIDER achieves strong inter-class dispersion and OOD detection performance with moderate batch sizes (e.g. 256).
Remark 2: CIDER vs. Proxy-based methods. Our work also bears significant differences w.r.t
prior proxy-based metric learning methods. (1) Our primary task is OOD detection, whereas deep metric learning is commonly used for face verification and image retrieval tasks; (2) Prior methods such asProxyAnchor  lack explicit prototype-to-prototype dispersion, which we show is essential for improving OOD detection; (3) ProxyAnchor initializes the proxies randomly and updates through gradients, while we estimate prototypes directly from sample embeddings using EMA. We provide experimental comparisons in the next section.
4.1 Common Setup
Datasets and training details. Following the common benchmarks in the literature, we consider CIFAR-10 and CIFAR-100  as in-distribution datasets. For OOD test datasets, we use a suite of natural image datasets including SVHN , Places365 , Textures , LSUN , and iSUN 
. In our main experiments, we use ResNet-18 as the backbone for CIFAR-10 and ResNet-34 for CIFAR-100. We train the model using stochastic gradient descent with momentum 0.9, and weight decay. To demonstrate the simplicity and effectiveness of CIDER, we adopt the same hyperparameters as in SSD+  with the SupCon loss: the initial learning rate is with cosine scheduling, the batch size is , and the training time is epochs. We choose the default weight , so that the value of different loss terms are similar upon model initialization. Following the literature , we use the embedding dimension of 128 for the projection head. The temperature is , and for prototype updates is set to by default. We adjust the batch size, temperature, loss weight, prototype update factor, training time, and model architecture in our ablation study (Section 4.3). We report the ID classification results for SSD+ and CIDER following the common linear evaluation protocol , where a linear classifier is trained on top of the penultimate layer features. More details are provided in Appendix B. Code and data will be released publicly for reproducible research.
OOD detection scoring function. During test time, we employ a distance-based method for OOD detection. An input is considered OOD if it is relatively far from the ID data in the embedding space. Under the same OOD detection scoring function, we will show the effect of representation quality with different training methods (Section 4.2). In particular, we consider the commonly used maximum Mahalanobis distance as in [22, 41, 51]:
where , is the embedding of test sample , is the estimated class centroid for class , and is the estimated covariance matrix for ID data:
4.2 Main Results and Discussions
|Without Contrastive Learning|
|With Contrastive Learning|
|CE + SimCLR ||24.82||94.45||86.63||71.48||56.40||89.00||66.52||83.82||63.74||82.01||59.62||84.15||73.54|
CIDER achieves SOTA results on OOD detection with high ID accuracy. Table 1
contains a wide range of competitive methods for OOD detection. All methods are trained on ResNet-34 using CIFAR-100, without assuming access to auxiliary outlier datasets during training. For clarity, we divide the methods into two categories: trained with and without contrastive losses. For pre-trained model-based scores such asMSP , ODIN , Mahalanobis , and Energy , the model is trained with the softmax cross-entropy (CE) loss by convention. GODIN  is trained using the DeConf-C loss, which does not involve contrastive loss either. For methods involving contrastive losses, we use the same network structure and embedding dimension, while only varying the training objective. The maximum Mahalanobis distance (Eq. 6) is used for OOD detection in SSD+ , CE+SimCLR , ProxyAnchor , and CIDER.
As shown in Table 1, OOD detection performance is significantly improved with CIDER. Three trends can be observed: (1) Compared to the current best method SSD+, CIDER explicitly optimizes for inter-class dispersion, which is beneficial for OOD detection. Under the same training settings as SSD+, CIDER (with default parameters) reduces FPR95 by due to more desirable embeddings for OOD detection. The improvement becomes even more pronounced under , where CIDER outperforms SSD+ by 22.15% in FPR95; (2) While CSI  relies on sophisticated data augmentations and ensembles in testing, CIDER only uses the default data augmentations and thus is simpler. Performance wise, CIDER reduces the average FPR95 by compared to CSI; (3) Lastly, as a result of the improved embedding quality, CIDER improves the ID accuracy by compared to SSD+ with the SupCon loss, and by compared to training with the CE loss. We provide results on the less challenging task (CIFAR-10 as ID) in Appendix D, where CIDER’s strong performance remains.
CIDER learns distinguishable representations. We visualize the learned feature embeddings in Figure 3 using UMAP , where the colors encode different class labels. A salient observation is that embeddings obtained with CIDER (2(b)) enjoy much better compactness compared to embeddings trained with the CE loss (2(a)). Moreover, the classes are distributed more uniformly in the space, highlighting the efficacy of the dispersion loss.
CIDER improves inter-class dispersion and intra-class compactness. Beyond visualization, we also quantitatively measure the intra-class compactness by Eq. 2 and the inter-class dispersion by Eq. 1. To make the measurements more interpretable, we convert cosine similarities to angular degrees. Hence, a higher inter-class dispersion (in degrees) indicates more separability among class prototypes, which is desirable. Similarly, lower intra-class compactness (in degrees) is better. The results are shown in Table 2 based on the CIFAR-10 test set. Compared to SSD+ (with SupCon loss), CIDER significantly improves the inter-class dispersion by 12.03 degrees. Different from SupCon, CIDER explicitly optimizes the inter-class dispersion, which especially benefits OOD detection.
|Training Loss||Dispersion (ID)||Compactness (ID)||ID-OOD Separability (in degree)|
|(in degree)||(in degree)||CIFAR-100||LSUN||iSUN||Texture||SVHN||AVG|
CIDER improves ID-OOD separability. Next, we quantitatively measure how the feature embedding quality affects the ID-OOD separability. We introduce a separability score, which measures on average how close the embedding of a sample from the OOD test set is to the closest ID class prototype, compared to that of an ID sample. The traditional notion of “OOD being far away from ID classes” is now translated to “OOD being somewhere between ID clusters on the hypersphere”. A higher separability score indicates that the OOD test set is easier to be detected. Formally, we define the separability measurement as:
where is the OOD test dataset and denotes the normalized embedding of sample . Table 2 shows that our method leads to higher separability and consequently superior OOD detection performance (c.f. Table 1). Averaging across 5 OOD test datasets, our method displays a relative improvement of ID-OOD separability compared to SupCon. This further verifies the effectiveness of the compactness and dispersion losses for improving OOD detection.
CIDER benefits hard OOD detection. OOD samples that are semantically similar to ID samples are particularly challenging for OOD detection algorithms . As a concrete example, in Figure 1, the fox can be viewed as a hard OOD w.r.t cat and dog classes (ID). Larger dispersion (in degrees) between the two ID classes improves the ID-OOD separability from fox. In the literature, CIFAR-10 (ID) v.s. CIFAR-100 (OOD) and vice versa are often used as hard OOD detection tasks. As shown in Table 2 and Figure 2(c), when CIFAR-10 is used as ID, the embedding of CIFAR-100 is barely separable from CIFAR-10 under the cross-entropy loss (c.f. the ID-OOD separability is only 7.1 degrees). In contrast, CIDER displays a large ID-OOD separability of 31.4 degrees, as shown in Figure 2(d).
To measure the detection performance, we contrast different training loss functions while keeping the test-time OOD detection score to be the maximum Mahalanobis distance for all. As shown in Table 3, CIDER improves the CE baseline by 17.99% in AUROC on CIFAR-10 (ID) v.s. CIFAR-100 (OOD). For the more challenging case where CIFAR-100 is used as ID, CIDER improves AUROC by 14.18% over the CE baseline. This further highlights the importance of inter-class dispersion via CIDER for hard OOD detection.
4.3 Ablation Studies
In this section, we provide ablation results on how different factors impact the performance of CIDER. For consistency, we present the analyses below based on CIFAR-100. Similar trends also hold for the less challenging task (CIFAR-10). Note that when ablating on one factor, we keep the other hyperparameters the same as default (c.f. Section 4.1).
Inter-class dispersion is critical for OOD detection. Here we examine the effects of loss components on OOD detection. As shown in Table 4, we have the following observations: (1) for ID classification, training with alone leads to an accuracy of , similar to the ID accuracy of SupCon (). This suggests that promoting intra-class compactness and a moderate level of inter-class dispersion (as a result of sample-to-prototype negative pairs in ) are sufficient to discriminate different ID classes; (2) For OOD detection, further inter-class dispersion is beneficial, which is explicitly encouraged through the dispersion loss . As a result, adding improved the average AUROC by . However, promoting inter-class dispersion via alone without is not sufficient for neither ID classification nor OOD detection. Our ablation suggests that and work synergistically to improve the hyperspherical embeddings that are desirable for both ID classification and OOD detection.
|Loss Components||AUROC||ID ACC|
Ablation on the loss weights. In the main results (Table 1), we demonstrate the effectiveness of CIDER where the loss weight is simply set to balance the initial scale between the and . In fact, CIDER can be further improved by adjusting . As shown in Figure 3(a), the performance of CIDER is relatively stable for moderate adjustments of (e.g. 0.5 to 2), with the best FPR95 of at around . This indicates that CIDER provides a simple and effective solution for improving OOD detection performance, without much need for hyperparameter tuning on the loss scale.
Ablation on the learning rate. Prior works [17, 41] use a default initial learning rate (lr) of to train contrastive losses, which is also the default setting of CIDER. We further investigate the impact of the initial learning rate on OOD detection. As shown in Figure 3(b), a relatively higher initial lr (e.g. 0.4-0.5) is indeed desirable for competitive performance. Too small of an lr (e.g. 0.1) would lead to performance degradation.
Adjusting prototype update factor improves CIDER. We show in Figure 3(c) the performance by varying the moving-average discount factor in Eq. 5. We can observe that the detection performance (averaged over 5 test sets) is still competitive across a wide range of . Our main results are based on without tuning. In particular, a moderate results in the best performance with FPR95 of .
Small temperature leads to better performance. Figure 3(d) demonstrates the detection performance as we vary the temperature parameter . We observe that the OOD detection performance is desirable at a relatively small temperature. Complementary to our finding, a relatively small temperature is shown to be desirable for ID classification [17, 48] which penalizes hard negative samples with larger gradients and leads to separable features.
CIDER is effective under various batch sizes. Figure 4(a) shows that CIDER remains competitive under different batch size configurations compared to SupCon. To explain this, the standard SupCon loss requires instance-to-instance distance measurement, whereas compactness loss reduces the complexity to instance-to-prototype. The class-conditional prototypes are updated during training, which capture the average statistics of each class and alleviate the dependency on the batch size. This leads to an overall memory-efficient solution for OOD detection.
|Loss Components||AUROC||ID ACC|
The potential of using CIDER as regularizers. Following the convention , the ID classification results in the main paper are obtained via the linear evaluation scheme. We further investigate the benefits of the explicit geometric regularization in CIDER by jointly training with (dubbed as Triple), where denotes the cross-entropy loss. We realize by adding a linear classifier on top of the penultimate layer. By comparing the first and second row in Table 5, jointly training the cross-entropy loss with CIDER improves the OOD detection performance significantly, while the ID accuracy is maintained. This further highlights the simplicity of CIDER as regularizers that facilitate desirable geometry in the representation space. We also investigate the effects of adding CIDER to further regularize SupCon in Appendix C.
Ablation on network capacity and architecture. Lastly, we verify the effectiveness of CIDER under networks with higher capacity such as ResNet-50. The results are shown in Figure 4(b). The trend is similar to what we observed with ResNet-34. Specifically, as a result of the improved representation, training with CIDER improves the FPR95 for various test sets compared to training with the SupCon loss. We provide ablations on different network architectures such as DenseNet-101 in Appendix D.2.
5 Related Works
Out-of-distribution detection. The majority of works in the OOD detection literature focus on the supervised setting, where the goal is to derive a binary ID-OOD classifier along with a classification model for the in-distribution data. Compared to generative model-based methods [20, 32, 38, 42, 53], OOD detection based on supervised discriminative models typically yield more competitive performance. For deep neural networks-based methods, most OOD detection methods derive confidence scores either based on the output [2, 10, 12, 14, 16, 25, 26, 43], gradient information , or the feature embeddings [22, 40, 41, 44, 51]. Our method can be categorized as a distance-based OOD detection method by exploiting the hyperspherical embedding space. While some works assume access to auxiliary outlier datasets during training [3, 13, 30, 36], our method does not rely on any external information and is hence more generally applicable and flexible.
Contrastive representation learning. Contrastive representation learning [7, 46] aims to learn a discriminative embedding where positive samples are aligned while negative ones are dispersed. It has demonstrated remarkable success for visual representation learning in unsupervised [4, 6, 11, 39], semi-supervised , and supervised settings . Recently, Li et al. 
propose a prototype-based contrastive loss for unsupervised learning where prototypes are generated via a clustering algorithm, while our method is supervised where prototypes are updated based on labels. Liet al.  incorporate a prototype-based loss to tackle data noise. Wang and Isola  analyze the asymptotic behavior of contrastive losses theoretically, while Wang and Liu 
empirically investigate properties of contrastive losses for classification. However, none of the works focus on OOD detection. We aim to fill the gap and facilitate the understanding of contrastive losses for OOD detection in supervised learning.
Contrastive learning for out-of-distribution detection. Self-supervised learning has been shown to improve OOD detection. Prior works [41, 51] verify the effectiveness of directly applying the off-the-shelf multi-view contrastive losses such as SupCon and SimCLR for OOD detection. However, these training objectives are primarily designed for ID classification, lacking ID-OOD separability consideration in design. Similarly, CSI  investigates the type of data augmentations that are particularly beneficial for OOD detection. Different from all prior works, we take the perspective from the embedding and propose a new learning framework that explicitly encourages the desirable properties for OOD detection, and thus alleviates the dependence on specific data augmentations or self-supervision. We provide comparisons with all these approaches in Section 4.
Deep metric learning. Learning a desirable embedding with neural networks have been the fundamental goal in the deep metric learning community. Various losses have been proposed for applications such as face verification [9, 27, 49], person re-identification [5, 52], and image retrieval [19, 31, 35, 45]. However, none of works focus on desirable embeddings for OOD detection. Compared to other proxy-based metric learning methods, CIDER bears several key differences: (1) prior methods such as ProxyAnchor  lacks explicit prototype-to-prototype dispersion, which we show is essential for improving OOD detection; (2) ProxyAnchor initializes the proxies randomly and updates through gradients, while we estimate prototypes directly from sample embeddings using EMA.
6 Conclusion and Outlook
In this work, we propose CIDER, a novel representation learning framework that exploits hyperspherical embeddings for OOD detection. CIDER jointly optimizes the compactness and dispersion losses to promote strong ID-OOD separability. We show that CIDER achieves state-of-the-art performance on common OOD benchmarks, including hard OOD detection tasks. Moreover, we introduce new measurements to quantify the hyperspherical embedding, and establish the relationship with OOD detection performance. We conduct extensive ablations to understand the efficacy and behavior of CIDER under various settings and hyperparameters. We hope our work can inspire future methods of exploiting hyperspherical representations for OOD detection.
Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples.
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8443–8452. Cited by: §5.
-  (2016) Towards open set deep networks. In CVPR, Cited by: §5.
-  (2021) ATOM: robustifying out-of-distribution detection using outlier mining. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Cited by: §5.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §3, §4.1, §5.
Beyond triplet loss: a deep quadruplet network for person re-identification.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 403–412. Cited by: §5.
-  (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §5.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 539–546 vol. 1. Cited by: §5.
-  (2014) Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613. Cited by: §4.1.
Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699. Cited by: §5.
-  (2018) Learning confidence for out-of-distribution detection in neural networks. External Links: Cited by: §5.
-  (2020) Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 9726–9735. Cited by: §5.
-  (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: Appendix B, Table 6, Table 7, §1, §2, §4.2, Table 1, §5.
Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, Cited by: §5.
-  (2020) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951–10960. Cited by: Table 6, Table 7, §4.2, Table 1, §5.
-  (2021) On the importance of gradients for detecting distributional shifts in the wild. In Advances in Neural Information Processing Systems, Cited by: §5.
-  (2021) Towards scaling out-of-distribution detection for large semantic space. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Cited by: §5.
-  (2020) Supervised contrastive learning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 18661–18673. Cited by: Appendix B, §3.2, §4.1, §4.3, §4.3, §4.3, §5.
-  (2020) Proxy anchor loss for deep metric learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Table 7, §3.2, §4.2, Table 1, §5.
-  (2019) Deep metric learning beyond binary supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2288–2297. Cited by: §5.
-  (2020) Why normalizing flows fail to detect out-of-distribution data. Advances in Neural Information Processing Systems 33. Cited by: §5.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
-  (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: Appendix B, Appendix B, Table 6, Table 7, §1, §4.1, §4.2, Table 1, §5.
-  (2021) Learning from noisy data with robust representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9485–9494. Cited by: §5.
-  (2021) Prototypical contrastive learning of unsupervised representations. In ICLR, Cited by: §5.
-  (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: Appendix B, Table 6, Table 7, §4.2, Table 1, §5.
-  (2020) Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems. Cited by: Appendix B, Table 6, Table 7, §4.2, Table 1, §5.
-  (2017) SphereFace: deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
-  (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §D.1.
UMAP: uniform manifold approximation and projection.
The Journal of Open Source Software3 (29), pp. 861. Cited by: Figure 3, §4.2.
Self-supervised learning for generalizable out-of-distribution detection.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5216–5223. Cited by: §5.
-  (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §5.
-  (2019) Do deep generative models know what they don’t know?. In International Conference on Learning Representations, Cited by: §5.
-  (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1.
-  (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
-  (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: §5.
-  (2021) Outlier exposure with confidence control for out-of-distribution detection. Neurocomputing 441, pp. 138–150. Cited by: §5.
-  (2021) A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022. Cited by: Appendix B.
-  (2019) Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, pp. 14680–14691. Cited by: §5.
-  (2021) Contrastive learning with hard negative samples. In International Conference on Learning Representations, Cited by: §5.
-  (2020) Detecting out-of-distribution examples with Gram matrices. In Proceedings of the 37th International Conference on Machine Learning, Cited by: §5.
SSD: a unified framework for self-supervised outlier detection. In International Conference on Learning Representations, Cited by: Appendix B, Appendix B, §D.1, §D.2, Table 6, Table 7, §1, §3.2, §3, §4.1, §4.1, §4.2, §4.3, Table 1, Table 3, §5, §5.
-  (2020) Input complexity and out-of-distribution detection with likelihood-based generative models. In International Conference on Learning Representations, Cited by: §5.
-  (2021) ReAct: out-of-distribution detection with rectified activations. In Advances in Neural Information Processing Systems, Cited by: §5.
CSI: novelty detection via contrastive learning on distributionally shifted instances. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Table 6, §1, §4.2, Table 1, Table 3, §5, §5.
-  (2020) Proxynca++: revisiting and revitalizing proxy neighborhood component analysis. In European Conference on Computer Vision, pp. 448–464. Cited by: §5.
-  (2019) Representation learning with contrastive predictive coding. External Links: Cited by: §5.
-  (2021) Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2495–2504. Cited by: §5.
-  (2021) Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2495–2504. Cited by: §4.3.
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §5.
-  (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929–9939. Cited by: §3.2, §3, §5.
-  (2020) Contrastive training for improved out-of-distribution detection. arXiv preprint arXiv:2007.05566. Cited by: Appendix B, Appendix B, §D.1, Table 6, Table 7, §1, §4.1, §4.2, §4.2, Table 1, Table 3, §5, §5.
-  (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3415–3424. Cited by: §5.
-  (2020) Likelihood regret: an out-of-distribution detection score for variational auto-encoder. Advances in Neural Information Processing Systems 33. Cited by: §5.
-  (2020) Distance-based learning from errors for confidence calibration. In International Conference on Learning Representations, Cited by: §1.
-  (2015) Turkergaze: crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755. Cited by: §4.1.
Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.1.
Places: a 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.1.
Appendix A CIDER Training Scheme
The first stage of the training scheme of our compactness and dispersion regularized (CIDER) learning framework is shown in Algorithm 1. We jointly optimize: (1) a compactness loss to encourage samples to be close to their class prototypes, and (2) a dispersion loss to encourage larger angular distances among different class prototypes. Note that the second stage of training (for ID classification) follows the convention by training a linear classifier based on the extracted penultimate layer features from the first stage.
Appendix B Experimental Details
Software and hardware.
All methods are implemented in Pytorch 1.10. We run all the experiments on NVIDIA GeForce RTX-2080Ti GPU for small to medium batch size and on NVIDIA A100 GPU for large batch size and larger network encoder.
Architecture. As shown in Figure 2, the overall architecture of CIDER consists of a projection head on top of a deep neural network encoder . Following common practice and fair comparison with prior works , , we fix the output dimension of the projection head to be 128. We use a linear projection head for the simpler task CIFAR-10 and a two-layer non-linear projection head for the more complex task CIFAR-100.
Training. For methods based on pre-trained models such as MSP , ODIN , Mahalanobis , and Energy , we follow the common practice and train with the cross-entropy loss for 200 epochs with an initial learning rate of . For fair comparison, methods involving contrastive learning ,, are trained for 500 epochs. For CIDER, we adopt the same key hyperparameters for contrastive losses such as initial learning rate (0.5), temperature (0.1), and batch size (512) as SSD+  in main experiments to demonstrate the effectiveness and simplicity of our loss.
Evaluation. For each OOD test set, if the dataset size is larger than the size of the ID test set , we randomly select a subset with size for fairness consideration. During test time, we fix the OOD detection score throughout our study, which allows us to isolate the effect of representation quality under different training methods. For simplicity, we consider the maximum Mahalanobis distance  based on a single layer and with shared covariance estimated from the training set as in Eq. 6. Note that more complicated variants exist such as estimating the class-wise covariance 
and calculating the Mahalanobis score from multiple intermediate layers and training a logistic regression to find a good combination of scores from different layers. A recent work  proposed a Relative Mahalanobis score to improve near-OOD detection. We believe that variants of the Mahalanobis score also benefit from improved representation quality and leave as future work.
Appendix C The Potential of Using CIDER to Regularize SupCon
In Section 4.3, we demonstrate the effectiveness of embedding regularization via CIDER by jointly training CIDER with the cross-entropy loss. In this section, we provide another set of experiments where we investigate whether CIDER also facilitates better inter-class dispersion when jointly trained with SupCon. Specially, we use SupConCIDER to denote the loss . As shown in Figure 5(a) and 5(b), compared to training with the SupCon loss alone, adding CIDER improves the OOD detection performance across different batch size configurations. However, under moderate to large batch sizes (e.g. 512 to 1024), training with CIDER alone still results in the best performance.
Appendix D Results on CIFAR-10
d.1 Main Results
In the main paper, we mainly focus on the more challenging task CIFAR-100. In this section, we additionally evaluate on CIFAR-10, a commonly used benchmark in literature. For methods involving contrastive losses, we use the same network encoder and embedding dimension, while only varying the training objective. The Mahalanobis distance (Eq. 6) is used for OOD detection in SSD+ , CE+SimCLR , SupCon, and CIDER. The results are shown in Table 6. Similar trends also hold as we describe in Section 4.2: (1) CIDER achieves the state-of-the-art OOD detection performance in CIFAR-10 as a result of better inter-class dispersion and intra-class compactness. For example, using the same Mahalanobis score for detection, compared to training with the cross-entropy loss , training with CIDER reduces the FPR95 by averaged over 5 diverse test sets; (2) Although the ID classification accuracy of CIDER is similar to another proxy-based loss ProxyAnchor, CIDER significantly improves the OOD detection performance due to the addition of explicit inter-class dispersion which we show is critical for OOD detection in Section 4.3. The significant improvements highlight the importance of representation learning for OOD detection; (3) CIDER also improves the ID classification accuracy by compared to training with cross-entropy loss.
|Without Contrastive Learning|
|With Contrastive Learning|
|CE + SimCLR ||6.98||99.22||54.39||86.70||64.53||85.60||59.62||86.78||16.77||96.56||40.46||90.97||93.12|
d.2 Ablation on Other Architectures
In this section, we further verify the generality of CIDER based on an alternative popular network architecture DenseNet-101. For fair comparison and memory efficiency considerations, we use a batch size of 128 for all methods during training. The results are shown in Table 7. The trends are similar to Table 1 (Section 4.2) with ResNet-34 and Table 6 with ResNet-18: (1) training with contrastive losses generally improves the OOD detection performance compared to training with softmax cross-entropy (CE) loss; (2) CIDER achieves the state-of-the-art OOD detection performance. For example, compared to SSD+ , CIDER further reduces the the average FPR95 by ; (3) Compared to training with the CE loss, CIDER also improves the ID classification accuracy by .
|Without Contrastive Learning|
|With Contrastive Learning|
|CE + SimCLR ||12.19||97.57||81.50||69.35||22.19||96.19||18.42||96.70||28.95||90.83||32.65||90.13||94.66|