DeepAI
Log In Sign Up

CIDER: Exploiting Hyperspherical Embeddings for Out-of-Distribution Detection

Out-of-distribution (OOD) detection is a critical task for reliable machine learning. Recent advances in representation learning give rise to developments in distance-based OOD detection, where testing samples are detected as OOD if they are relatively far away from the centroids or prototypes of in-distribution (ID) classes. However, prior methods directly take off-the-shelf loss functions that suffice for classifying ID samples, but are not optimally designed for OOD detection. In this paper, we propose CIDER, a simple and effective representation learning framework by exploiting hyperspherical embeddings for OOD detection. CIDER jointly optimizes two losses to promote strong ID-OOD separability: (1) a dispersion loss that promotes large angular distances among different class prototypes, and (2) a compactness loss that encourages samples to be close to their class prototypes. We show that CIDER is effective under various settings and establishes state-of-the-art performance. On a hard OOD detection task CIFAR-100 vs. CIFAR-10, our method substantially improves the AUROC by 14.20 by the cross-entropy loss.

READ FULL TEXT VIEW PDF
09/04/2020

Class Interference Regularization

Contrastive losses yield state-of-the-art performance for person re-iden...
04/13/2022

Out-of-distribution Detection with Deep Nearest Neighbors

Out-of-distribution (OOD) detection is a critical task for deploying mac...
04/23/2022

Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Out-of-distribution (OOD) detection is essential to handle the distribut...
08/23/2022

Semantic Driven Energy based Out-of-Distribution Detection

Detecting Out-of-Distribution (OOD) samples in real world visual applica...
12/10/2020

Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection

Machine learning models are often used in practice if they achieve good ...
06/16/2021

A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection

Mahalanobis distance (MD) is a simple and popular post-processing method...
08/26/2021

Semantically Coherent Out-of-Distribution Detection

Current out-of-distribution (OOD) detection benchmarks are commonly buil...

1 Introduction

When deploying machine learning models in the open world, it is important to ensure the reliability of the model in the presence of out-of-distribution (OOD) inputs—samples from an unknown distribution that the network has not been exposed to during training, and therefore should not be predicted with high confidence at test time. We desire models that are not only accurate when the input is drawn from the known distribution, but are also aware of the unknowns outside the training categories. This gives rise to the task of OOD detection, where the goal is to determine whether an input is in-distribution (ID) or not.

A plethora of OOD detection algorithms have been developed recently, among which distance-based methods demonstrated promise [54]. These approaches circumvent the shortcoming of using the model’s confidence score for OOD detection [12], which can be abnormally high on OOD samples [34] and hence not distinguishable from ID data. Distance-based methods leverage feature embeddings extracted from a model, and operate under the assumption that the test OOD samples are relatively far away from the centroids or prototypes of ID classes. For example, Lee et al. [22] used the maximum Mahalanobis distance from the test sample to all class centroids for OOD detection.

Figure 1: Illustration of desirable hyperspherical embeddings for OOD detection. The embeddings of images from the same class are clustered; each cluster is well separated from another on the hypersphere. OOD samples lie between clusters of ID samples.

Arguably, the efficacy of distance-based approaches can depend largely on the quality of feature embeddings. Prior works [41, 51] directly employ off-the-shelf contrastive losses for OOD detection. However, existing training objectives produce embeddings that suffice for classifying ID samples, but remain sub-optimal for OOD detection. Tack et al. [44] employ sophisticated data augmentations and ensembling in testing, which can be difficult to search and use in practice. It remains underexplored what properties of learned embeddings can benefit OOD detection, and how to design training methods to directly achieve them.

In this work, we bridge the gap by proposing CIDER, a Compactness and DispErsion Regularized learning framework designed for OOD detection. Our method is motivated by the desirable properties of embeddings for OOD detection. Intuitively, we desire embeddings where different classes are relatively far apart (i.e. high inter-class dispersion), and samples in each class form a compact cluster (i.e. high intra-class compactness). This is illustrated in Figure 1, where fox (OOD) can be viewed as a hard OOD example w.r.t cat (ID) and dog (ID). Larger angular distance between ID classes cat and dog in the hyperspherical space improves the separability from fox, and allows for more effective OOD detection. To formalize our idea, we introduce two losses: a new dispersion loss that promotes large angular distances among different class prototypes, along with a compactness loss that encourages samples to be close to their class prototypes. During training, the class-conditional prototypes are dynamically updated in an exponential-moving-average (EMA) manner.

We quantitatively measure how the feature embedding quality affects the downstream OOD detection performance. In particular, we show that inter-class dispersion is the key to stronger ID-OOD separability. Unlike prior works that directly use existing contrastive losses such as SupCon, CIDER explicitly encourages prototype-wise dispersion and establishes state-of-the-art OOD detection results with competitive ID classification accuracy. Our key results and contributions are summarized as follows:

  • We propose CIDER, a simple and effective representation learning framework for OOD detection. CIDER establishes state-of-the-art results and is easy to use. Compared to the current best method SSD+, CIDER reduces the average FPR95 by 22.5% on CIFAR-100.

  • We establish the unexplored relationship between OOD detection performance and the embedding quality in the hyperspherical space, and provide measurements based on the notion of compactness and dispersion. This allows future research to quantify the embedding in the hyperspherical space for effective OOD detection.

  • We conduct extensive ablations to understand the efficacy and behavior of our method under various settings and hyperparameters, including different loss components, batch size, temperature, regularization weights, and the model architecture.

    CIDER is effective under a wide range of hyperparameters.

2 Preliminaries

Notations. We consider multi-class classification, where denotes the input space and denotes the ID labels. The training set is drawn i.i.d. from . Let denote the marginal distribution over , which is called the in-distribution (ID).

Out-of-distribution detection. OOD detection can be viewed as a binary classification problem. At test time, the goal of OOD detection is to decide whether a sample is from (ID) or not (OOD). In practice, OOD is often defined by a distribution that simulates unknowns encountered during deployment, such as samples from an irrelevant distribution whose label set has no intersection with and therefore should not be predicted by the model. Mathematically, let denote an OOD test set where the label space . The decision can be made via a thresholding mechanism:

where by convention samples with higher scores are classified as ID and vice versa. The threshold is typically chosen so that a high fraction of ID data (e.g. 95%) is correctly classified [12].

Figure 2: Overview of our compactness and dispersion regularized (CIDER) learning framework for OOD detection. We jointly optimize two complementary terms to encourage desirable properties of the embedding space: (1) a dispersion loss to encourage larger angular distances among different class prototypes, and (2) a compactness loss to encourage samples to be close to their class prototypes.

3 Method

In this section, we start with an overview of the architecture. Next, we characterize and provide quantitative measures on the desirable properties of hyperspherical embeddings for OOD detection (Section 3.1). We then present our training objectives that are designed to promote the desirable properties (Section 3.2).

Architecture overview. As shown in Figure 2

, the general architecture consists of two components: (1) a deep neural network encoder

that maps the augmented input to a high dimensional feature embedding (often referred to as the penultimate layer); (2) a projection head that maps the high dimensional embedding to a lower dimensional feature representation . Following convention [4, 41, 50], the loss is applied to the normalized feature embedding .

The normalized embeddings are also referred to as hyperspherical embeddings, since they are on a unit hypersphere as shown in Figure 1. Our goal is to shape the hyperspherical embedding space so that the learned embeddings should (1) be sufficiently discriminative for ID classification, and (2) effectively separate unseen OOD data from ID data.

3.1 Characterizing Hyperspherical Embeddings for OOD Detection

We start by characterizing the desirable properties of hyperspherical embeddings for OOD detection, which inspire the design of an effective training objective, as we will show in Section 3.2. Intuitively, we desire embeddings where samples in each class form a compact cluster on the hypersphere, and different classes are relatively far apart. This is beneficial for distance-based OOD detection, where OOD samples are hypothesized to be relatively far from in-distribution classes.

To formalize our idea, we introduce two measurements: inter-class dispersion and intra-class compactness. We will show in Section 4 that these are key properties to determine the OOD detection performance.

Inter-class dispersion. We denote by the prototype embeddings for the ID classes

. The prototype for each sample is assigned based on the ground-truth class label. The extent of inter-class dispersion can be measured by the average cosine similarity among pair-wise class prototypes:

(1)

The unnormalized prototype

can be estimated empirically by averaging the embeddings of all samples in each class. The prototype is the

-normalized vector with unit length; that is,

. Prototypes with larger pairwise angular distances are desirable for OOD detection, as we will show later in Section 4.2. The importance of inter-class dispersion can also be seen in Figure 1, where samples in the fox class (OOD) are semantically close to cat (ID) and dog (ID). A larger angular distance (i.e. smaller cosine similarity) between ID classes cat and dog in the hyperspherical space improves the separability from fox, and allows for more effective detection.

Intra-class compactness. The embeddings of samples in the same class should ideally form a compact cluster on the hypersphere. The compactness can be measured by the average cosine similarity between each feature embedding and its corresponding class prototype. We define the compactness (e.g. of the training set) as follows:

(2)

where , and is the normalized embedding of for all . We proceed by introducing our training objective which encourages these two characteristics of the learned embeddings.

3.2 CIDER: Compactness and Dispersion Regularized Learning

Training objective. To facilitate both inter-class dispersion and intra-class compactness for improving OOD detection, we introduce a novel training objective termed CIDER (compactness and dispersion regularized learning):

(3)

CIDER consists of two losses: a dispersion loss that promotes large angular distances among different class prototypes, along with a compactness loss that encourages samples to be aligned with its class prototype. Specifically, the two terms are defined as follows:

(4)

where is the batch size, is the total number of ID classes, denotes the class index of sample , and is the temperature parameter.

Prototype estimation and update. During training, an important step is to estimate the class prototype for each class . One canonical way to estimate the prototypes is to use the mean vector of all training samples for each class and update it frequently during training. Despite its simplicity, this method incurs a heavy computational toll and causes undesirable training latency. Instead, we update the class-conditional prototypes in an exponential-moving-average (EMA) manner:

(5)

where the prototype for class is updated during training as the moving average of all embeddings with label , and denotes the normalized embedding of samples of class . We ablate the effect of prototype update factor in Section 4.3.

Remark 1: CIDER vs. SSD+. A recent work SSD+ [41] directly use the supervised contrastive loss (i.e. SupCon [17]) for OOD detection. However, prior works do not dive deeper into how such representation impacts OOD detection. As we will show in Section 4.2, SupCon produces embeddings that suffice for classifying ID samples but lack sufficient inter-class dispersion needed for OOD detection. Moreover, it can been shown that SupCon leads to a uniform hyperspherical distribution only when the number of negative samples goes to infinity [50], which in practice requires a large batch size and incurs significant computational burden. In contrast, CIDER enforces the inter-class dispersion by explicitly maximizing the angular distances among different ID class prototypes. We show that CIDER achieves strong inter-class dispersion and OOD detection performance with moderate batch sizes (e.g. 256).

Remark 2: CIDER vs. Proxy-based methods. Our work also bears significant differences w.r.t

prior proxy-based metric learning methods. (1) Our primary task is OOD detection, whereas deep metric learning is commonly used for face verification and image retrieval tasks; (2) Prior methods such as

ProxyAnchor [18] lack explicit prototype-to-prototype dispersion, which we show is essential for improving OOD detection; (3) ProxyAnchor initializes the proxies randomly and updates through gradients, while we estimate prototypes directly from sample embeddings using EMA. We provide experimental comparisons in the next section.

4 Experiments

4.1 Common Setup

Datasets and training details. Following the common benchmarks in the literature, we consider CIFAR-10 and CIFAR-100 [21] as in-distribution datasets. For OOD test datasets, we use a suite of natural image datasets including SVHN [33], Places365 [57], Textures [8], LSUN [56], and iSUN [55]

. In our main experiments, we use ResNet-18 as the backbone for CIFAR-10 and ResNet-34 for CIFAR-100. We train the model using stochastic gradient descent with momentum 0.9, and weight decay

. To demonstrate the simplicity and effectiveness of CIDER, we adopt the same hyperparameters as in SSD+ [41] with the SupCon loss: the initial learning rate is with cosine scheduling, the batch size is , and the training time is epochs. We choose the default weight , so that the value of different loss terms are similar upon model initialization. Following the literature [17], we use the embedding dimension of 128 for the projection head. The temperature is , and for prototype updates is set to by default. We adjust the batch size, temperature, loss weight, prototype update factor, training time, and model architecture in our ablation study (Section 4.3). We report the ID classification results for SSD+ and CIDER following the common linear evaluation protocol [4], where a linear classifier is trained on top of the penultimate layer features. More details are provided in Appendix B. Code and data will be released publicly for reproducible research.

OOD detection scoring function. During test time, we employ a distance-based method for OOD detection. An input is considered OOD if it is relatively far from the ID data in the embedding space. Under the same OOD detection scoring function, we will show the effect of representation quality with different training methods (Section 4.2). In particular, we consider the commonly used maximum Mahalanobis distance as in [22, 41, 51]:

(6)

where , is the embedding of test sample , is the estimated class centroid for class , and is the estimated covariance matrix for ID data:

Evaluation metrics.

We report the following metrics: (1) the false positive rate (FPR95) of OOD samples when the true positive rate of in-distribution samples is at 95%, (2) the area under the receiver operating characteristic curve (AUROC), and (3) ID classification accuracy (ID ACC).

4.2 Main Results and Discussions

Method OOD Dataset Average
SVHN Places365 LSUN iSUN Texture
FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC ID ACC
Without Contrastive Learning
MSP [12] 78.89 79.80 84.38 74.21 83.47 75.28 84.61 74.51 86.51 72.53 83.12 75.27 74.59
ODIN [25] 70.16 84.88 82.16 75.19 76.36 80.10 79.54 79.16 85.28 75.23 78.70 79.11 74.59
Mahalanobis [22] 87.09 80.62 84.63 73.89 84.15 79.43 83.18 78.83 61.72 84.87 80.15 79.53 74.59
Energy [26] 66.91 85.25 81.41 76.37 59.77 86.69 66.52 84.49 79.01 79.96 70.72 82.55 74.59
GODIN [14] 74.64 84.03 89.13 68.96 93.33 67.22 94.25 65.26 86.52 69.39 87.57 70.97 74.92
With Contrastive Learning
ProxyAnchor [18] 87.21 82.43 70.10 79.84 37.19 91.68 70.01 84.96 65.64 84.99 66.03 84.78 NA
CE + SimCLR [51] 24.82 94.45 86.63 71.48 56.40 89.00 66.52 83.82 63.74 82.01 59.62 84.15 73.54
CSI [44] 44.53 92.65 79.08 76.27 75.58 83.78 76.62 84.98 61.61 86.47 67.48 84.83 NA
SSD+ [41] 31.19 94.19 77.74 79.90 79.39 85.18 80.85 84.08 66.63 86.18 67.16 85.90 75.02
CIDER () 11.33 97.75 76.94 79.08 67.92 86.99 80.11 84.90 40.20 91.96
CIDER () 13.86 97.07 78.38 78.64 32.62 93.62 53.96 88.59 44.41 90.46
Table 1: Main results. OOD detection performance based on Mahalanobis distance for ResNet-34 trained on CIFAR-100. Training with CIDER significantly improves both OOD detection performance and ID classification accuracy.

CIDER achieves SOTA results on OOD detection with high ID accuracy. Table 1

contains a wide range of competitive methods for OOD detection. All methods are trained on ResNet-34 using CIFAR-100, without assuming access to auxiliary outlier datasets during training. For clarity, we divide the methods into two categories: trained with and without contrastive losses. For pre-trained model-based scores such as

MSP [12], ODIN [25], Mahalanobis [22], and Energy [26], the model is trained with the softmax cross-entropy (CE) loss by convention. GODIN [14] is trained using the DeConf-C loss, which does not involve contrastive loss either. For methods involving contrastive losses, we use the same network structure and embedding dimension, while only varying the training objective. The maximum Mahalanobis distance (Eq. 6) is used for OOD detection in SSD+ [41], CE+SimCLR [51], ProxyAnchor [18], and CIDER.

As shown in Table 1, OOD detection performance is significantly improved with CIDER. Three trends can be observed: (1) Compared to the current best method SSD+, CIDER explicitly optimizes for inter-class dispersion, which is beneficial for OOD detection. Under the same training settings as SSD+, CIDER (with default parameters) reduces FPR95 by due to more desirable embeddings for OOD detection. The improvement becomes even more pronounced under , where CIDER outperforms SSD+ by 22.15% in FPR95; (2) While CSI [44] relies on sophisticated data augmentations and ensembles in testing, CIDER only uses the default data augmentations and thus is simpler. Performance wise, CIDER reduces the average FPR95 by compared to CSI; (3) Lastly, as a result of the improved embedding quality, CIDER improves the ID accuracy by compared to SSD+ with the SupCon loss, and by compared to training with the CE loss. We provide results on the less challenging task (CIFAR-10 as ID) in Appendix D, where CIDER’s strong performance remains.

CIDER learns distinguishable representations. We visualize the learned feature embeddings in Figure 3 using UMAP [29], where the colors encode different class labels. A salient observation is that embeddings obtained with CIDER (2(b)) enjoy much better compactness compared to embeddings trained with the CE loss (2(a)). Moreover, the classes are distributed more uniformly in the space, highlighting the efficacy of the dispersion loss.

CIDER improves inter-class dispersion and intra-class compactness. Beyond visualization, we also quantitatively measure the intra-class compactness by Eq. 2 and the inter-class dispersion by Eq. 1. To make the measurements more interpretable, we convert cosine similarities to angular degrees. Hence, a higher inter-class dispersion (in degrees) indicates more separability among class prototypes, which is desirable. Similarly, lower intra-class compactness (in degrees) is better. The results are shown in Table 2 based on the CIFAR-10 test set. Compared to SSD+ (with SupCon loss), CIDER significantly improves the inter-class dispersion by 12.03 degrees. Different from SupCon, CIDER explicitly optimizes the inter-class dispersion, which especially benefits OOD detection.

(a) ID embeddings of CE
(b) ID embeddings of CIDER
Figure 3: (a), (b): UMAP [29] visualization of the feature embedding when the model is trained with CE v.s. CIDER for CIFAR-10 (ID). (c)-(d): compared to the embeddings of CE, CIDER makes OOD samples more separable from ID (c.f. Table 2).
Training Loss Dispersion (ID) Compactness (ID) ID-OOD Separability (in degree)
(in degree) (in degree) CIFAR-100 LSUN iSUN Texture SVHN AVG
Cross-Entropy 67.17 24.53 7.11 14.57 13.70 13.76 11.08 12.04
SupCon 75.50 22.08 23.90 28.55 25.70 33.45 37.70 29.86
CIDER (ours) 87.53 21.35 31.41 48.37 41.54 39.60 51.65 42.51
Table 2: Compactness and dispersion of CIFAR-10 feature embedding, along with the separability w.r.t each OOD test set. We convert cosine similarity to angular degrees for better readability. All numbers are based on the Mahalanobis distance as OOD detection score.

CIDER improves ID-OOD separability. Next, we quantitatively measure how the feature embedding quality affects the ID-OOD separability. We introduce a separability score, which measures on average how close the embedding of a sample from the OOD test set is to the closest ID class prototype, compared to that of an ID sample. The traditional notion of “OOD being far away from ID classes” is now translated to “OOD being somewhere between ID clusters on the hypersphere”. A higher separability score indicates that the OOD test set is easier to be detected. Formally, we define the separability measurement as:

(7)

where is the OOD test dataset and denotes the normalized embedding of sample . Table 2 shows that our method leads to higher separability and consequently superior OOD detection performance (c.f. Table 1). Averaging across 5 OOD test datasets, our method displays a relative improvement of ID-OOD separability compared to SupCon. This further verifies the effectiveness of the compactness and dispersion losses for improving OOD detection.

CIDER benefits hard OOD detection. OOD samples that are semantically similar to ID samples are particularly challenging for OOD detection algorithms [51]. As a concrete example, in Figure 1, the fox can be viewed as a hard OOD w.r.t cat and dog classes (ID). Larger dispersion (in degrees) between the two ID classes improves the ID-OOD separability from fox. In the literature, CIFAR-10 (ID) v.s. CIFAR-100 (OOD) and vice versa are often used as hard OOD detection tasks. As shown in Table 2 and Figure 2(c), when CIFAR-10 is used as ID, the embedding of CIFAR-100 is barely separable from CIFAR-10 under the cross-entropy loss (c.f. the ID-OOD separability is only 7.1 degrees). In contrast, CIDER displays a large ID-OOD separability of 31.4 degrees, as shown in Figure 2(d).

To measure the detection performance, we contrast different training loss functions while keeping the test-time OOD detection score to be the maximum Mahalanobis distance for all. As shown in Table 3, CIDER improves the CE baseline by 17.99% in AUROC on CIFAR-10 (ID) v.s. CIFAR-100 (OOD). For the more challenging case where CIFAR-100 is used as ID, CIDER improves AUROC by 14.18% over the CE baseline. This further highlights the importance of inter-class dispersion via CIDER for hard OOD detection.

Training Loss ID CIFAR-10 CIFAR-100
OOD (CIFAR-100) (CIFAR-10)
CE 75.05 62.83
CE+SimCLR [51] 88.16 68.56
CSI [44] 91.19 70.74
SSD+ [41] 91.68 75.67
CIDER (ours) 93.04 77.02
Table 3: Performance comparison (in AUROC) on hard OOD detection tasks. Results are based on ResNet-18 for CIFAR-10 (ID) and ResNet-34 for CIFAR-100 (ID). All numbers are based on the Mahalanobis score.

4.3 Ablation Studies

In this section, we provide ablation results on how different factors impact the performance of CIDER. For consistency, we present the analyses below based on CIFAR-100. Similar trends also hold for the less challenging task (CIFAR-10). Note that when ablating on one factor, we keep the other hyperparameters the same as default (c.f. Section 4.1).

Inter-class dispersion is critical for OOD detection. Here we examine the effects of loss components on OOD detection. As shown in Table 4, we have the following observations: (1) for ID classification, training with alone leads to an accuracy of , similar to the ID accuracy of SupCon (). This suggests that promoting intra-class compactness and a moderate level of inter-class dispersion (as a result of sample-to-prototype negative pairs in ) are sufficient to discriminate different ID classes; (2) For OOD detection, further inter-class dispersion is beneficial, which is explicitly encouraged through the dispersion loss . As a result, adding improved the average AUROC by . However, promoting inter-class dispersion via alone without is not sufficient for neither ID classification nor OOD detection. Our ablation suggests that and work synergistically to improve the hyperspherical embeddings that are desirable for both ID classification and OOD detection.

Loss Components AUROC ID ACC
Places365 LSUN iSUN Texture SVHN AVG
78.81 83.50 82.25 83.95 89.28 83.56 75.25
54.35 70.84 51.39 37.25 42.68 51.30 1.99
79.08 86.99 84.90 91.96 97.75 88.14 76.05
Table 4: Ablation study on loss component. Results (in AUROC) are based on CIFAR-100 trained with ResNet-34. Training with only suffices for ID classification. Inter-class dispersion induced by is key to OOD detection.
(a) weight
(b) init lr
(c) Mov-avg
(d) temperature
Figure 4: Ablation on (a) weight of the compactness loss; (b) initial learning rate; (c) prototype update discount factor ; and (d) temperature. The results are based on CIFAR-100 (ID) and averaged over 5 OOD test sets.

Ablation on the loss weights. In the main results (Table 1), we demonstrate the effectiveness of CIDER where the loss weight is simply set to balance the initial scale between the and . In fact, CIDER can be further improved by adjusting . As shown in Figure 3(a), the performance of CIDER is relatively stable for moderate adjustments of (e.g. 0.5 to 2), with the best FPR95 of at around . This indicates that CIDER provides a simple and effective solution for improving OOD detection performance, without much need for hyperparameter tuning on the loss scale.

Ablation on the learning rate. Prior works [17, 41] use a default initial learning rate (lr) of to train contrastive losses, which is also the default setting of CIDER. We further investigate the impact of the initial learning rate on OOD detection. As shown in Figure 3(b), a relatively higher initial lr (e.g. 0.4-0.5) is indeed desirable for competitive performance. Too small of an lr (e.g. 0.1) would lead to performance degradation.

Adjusting prototype update factor improves CIDER. We show in Figure 3(c) the performance by varying the moving-average discount factor in Eq. 5. We can observe that the detection performance (averaged over 5 test sets) is still competitive across a wide range of . Our main results are based on without tuning. In particular, a moderate results in the best performance with FPR95 of .

Small temperature leads to better performance. Figure 3(d) demonstrates the detection performance as we vary the temperature parameter . We observe that the OOD detection performance is desirable at a relatively small temperature. Complementary to our finding, a relatively small temperature is shown to be desirable for ID classification [17, 48] which penalizes hard negative samples with larger gradients and leads to separable features.

CIDER is effective under various batch sizes. Figure 4(a) shows that CIDER remains competitive under different batch size configurations compared to SupCon. To explain this, the standard SupCon loss requires instance-to-instance distance measurement, whereas compactness loss reduces the complexity to instance-to-prototype. The class-conditional prototypes are updated during training, which capture the average statistics of each class and alleviate the dependency on the batch size. This leads to an overall memory-efficient solution for OOD detection.

Loss Components AUROC ID ACC
Places365 LSUN iSUN Texture SVHN AVG
74.21 75.28 74.51 72.53 79.80 75.27 74.59
79.72 85.36 84.16 88.11 94.13 86.30 74.35
79.08 86.99 84.90 91.96 97.75 88.14 76.05
Table 5: Ablation on jointly training CIDER with the cross-entropy loss ().

The potential of using CIDER as regularizers. Following the convention [17], the ID classification results in the main paper are obtained via the linear evaluation scheme. We further investigate the benefits of the explicit geometric regularization in CIDER by jointly training with (dubbed as Triple), where denotes the cross-entropy loss. We realize by adding a linear classifier on top of the penultimate layer. By comparing the first and second row in Table 5, jointly training the cross-entropy loss with CIDER improves the OOD detection performance significantly, while the ID accuracy is maintained. This further highlights the simplicity of CIDER as regularizers that facilitate desirable geometry in the representation space. We also investigate the effects of adding CIDER to further regularize SupCon in Appendix C.

(a) Ablation on batch size
(b) Ablation on architecture
Figure 5: (a): Ablation on batch size. The results are averaged across the 5 OOD test sets based on ResNet-34. CIDER outperforms SupCon across different batch sizes; (b): Ablation on architecture. Results are based on ResNet-50.

Ablation on network capacity and architecture. Lastly, we verify the effectiveness of CIDER under networks with higher capacity such as ResNet-50. The results are shown in Figure 4(b). The trend is similar to what we observed with ResNet-34. Specifically, as a result of the improved representation, training with CIDER improves the FPR95 for various test sets compared to training with the SupCon loss. We provide ablations on different network architectures such as DenseNet-101 in Appendix D.2.

5 Related Works

Out-of-distribution detection. The majority of works in the OOD detection literature focus on the supervised setting, where the goal is to derive a binary ID-OOD classifier along with a classification model for the in-distribution data. Compared to generative model-based methods [20, 32, 38, 42, 53], OOD detection based on supervised discriminative models typically yield more competitive performance. For deep neural networks-based methods, most OOD detection methods derive confidence scores either based on the output [2, 10, 12, 14, 16, 25, 26, 43], gradient information [15], or the feature embeddings [22, 40, 41, 44, 51]. Our method can be categorized as a distance-based OOD detection method by exploiting the hyperspherical embedding space. While some works assume access to auxiliary outlier datasets during training [3, 13, 30, 36], our method does not rely on any external information and is hence more generally applicable and flexible.

Contrastive representation learning. Contrastive representation learning [7, 46] aims to learn a discriminative embedding where positive samples are aligned while negative ones are dispersed. It has demonstrated remarkable success for visual representation learning in unsupervised [4, 6, 11, 39], semi-supervised [1], and supervised settings [17]. Recently, Li et al. [24]

propose a prototype-based contrastive loss for unsupervised learning where prototypes are generated via a clustering algorithm, while our method is supervised where prototypes are updated based on labels. Li

et al. [23] incorporate a prototype-based loss to tackle data noise. Wang and Isola [50] analyze the asymptotic behavior of contrastive losses theoretically, while Wang and Liu [47]

empirically investigate properties of contrastive losses for classification. However, none of the works focus on OOD detection. We aim to fill the gap and facilitate the understanding of contrastive losses for OOD detection in supervised learning.

Contrastive learning for out-of-distribution detection. Self-supervised learning has been shown to improve OOD detection. Prior works [41, 51] verify the effectiveness of directly applying the off-the-shelf multi-view contrastive losses such as SupCon and SimCLR for OOD detection. However, these training objectives are primarily designed for ID classification, lacking ID-OOD separability consideration in design. Similarly, CSI [44] investigates the type of data augmentations that are particularly beneficial for OOD detection. Different from all prior works, we take the perspective from the embedding and propose a new learning framework that explicitly encourages the desirable properties for OOD detection, and thus alleviates the dependence on specific data augmentations or self-supervision. We provide comparisons with all these approaches in Section 4.

Deep metric learning. Learning a desirable embedding with neural networks have been the fundamental goal in the deep metric learning community. Various losses have been proposed for applications such as face verification [9, 27, 49], person re-identification [5, 52], and image retrieval [19, 31, 35, 45]. However, none of works focus on desirable embeddings for OOD detection. Compared to other proxy-based metric learning methods, CIDER bears several key differences: (1) prior methods such as ProxyAnchor [18] lacks explicit prototype-to-prototype dispersion, which we show is essential for improving OOD detection; (2) ProxyAnchor initializes the proxies randomly and updates through gradients, while we estimate prototypes directly from sample embeddings using EMA.

6 Conclusion and Outlook

In this work, we propose CIDER, a novel representation learning framework that exploits hyperspherical embeddings for OOD detection. CIDER jointly optimizes the compactness and dispersion losses to promote strong ID-OOD separability. We show that CIDER achieves state-of-the-art performance on common OOD benchmarks, including hard OOD detection tasks. Moreover, we introduce new measurements to quantify the hyperspherical embedding, and establish the relationship with OOD detection performance. We conduct extensive ablations to understand the efficacy and behavior of CIDER under various settings and hyperparameters. We hope our work can inspire future methods of exploiting hyperspherical representations for OOD detection.

References

  • [1] M. Assran, M. Caron, I. Misra, P. Bojanowski, A. Joulin, N. Ballas, and M. Rabbat (2021) Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    ,
    pp. 8443–8452. Cited by: §5.
  • [2] A. Bendale and T. E. Boult (2016) Towards open set deep networks. In CVPR, Cited by: §5.
  • [3] J. Chen, Y. Li, X. Wu, Y. Liang, and S. Jha (2021) ATOM: robustifying out-of-distribution detection using outlier mining. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). Cited by: §5.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §3, §4.1, §5.
  • [5] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 403–412. Cited by: §5.
  • [6] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §5.
  • [7] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 539–546 vol. 1. Cited by: §5.
  • [8] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613. Cited by: §4.1.
  • [9] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    Arcface: additive angular margin loss for deep face recognition

    .
    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699. Cited by: §5.
  • [10] T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. External Links: 1802.04865 Cited by: §5.
  • [11] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 9726–9735. Cited by: §5.
  • [12] D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: Appendix B, Table 6, Table 7, §1, §2, §4.2, Table 1, §5.
  • [13] D. Hendrycks, M. Mazeika, and T. Dietterich (2018)

    Deep anomaly detection with outlier exposure

    .
    In International Conference on Learning Representations, Cited by: §5.
  • [14] Y. Hsu, Y. Shen, H. Jin, and Z. Kira (2020) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10951–10960. Cited by: Table 6, Table 7, §4.2, Table 1, §5.
  • [15] R. Huang, A. Geng, and Y. Li (2021) On the importance of gradients for detecting distributional shifts in the wild. In Advances in Neural Information Processing Systems, Cited by: §5.
  • [16] R. Huang and Y. Li (2021) Towards scaling out-of-distribution detection for large semantic space. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Cited by: §5.
  • [17] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 18661–18673. Cited by: Appendix B, §3.2, §4.1, §4.3, §4.3, §4.3, §5.
  • [18] S. Kim, D. Kim, M. Cho, and S. Kwak (2020) Proxy anchor loss for deep metric learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Table 7, §3.2, §4.2, Table 1, §5.
  • [19] S. Kim, M. Seo, I. Laptev, M. Cho, and S. Kwak (2019) Deep metric learning beyond binary supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2288–2297. Cited by: §5.
  • [20] P. Kirichenko, P. Izmailov, and A. G. Wilson (2020) Why normalizing flows fail to detect out-of-distribution data. Advances in Neural Information Processing Systems 33. Cited by: §5.
  • [21] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [22] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: Appendix B, Appendix B, Table 6, Table 7, §1, §4.1, §4.2, Table 1, §5.
  • [23] J. Li, C. Xiong, and S. C.H. Hoi (2021) Learning from noisy data with robust representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9485–9494. Cited by: §5.
  • [24] J. Li, P. Zhou, C. Xiong, and S. C.H. Hoi (2021) Prototypical contrastive learning of unsupervised representations. In ICLR, Cited by: §5.
  • [25] S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: Appendix B, Table 6, Table 7, §4.2, Table 1, §5.
  • [26] W. Liu, X. Wang, J. Owens, and Y. Li (2020) Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems. Cited by: Appendix B, Table 6, Table 7, §4.2, Table 1, §5.
  • [27] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • [28] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §D.1.
  • [29] L. McInnes, J. Healy, N. Saul, and L. Grossberger (2018) UMAP: uniform manifold approximation and projection.

    The Journal of Open Source Software

    3 (29), pp. 861.
    Cited by: Figure 3, §4.2.
  • [30] S. Mohseni, M. Pitale, J. Yadawa, and Z. Wang (2020) Self-supervised learning for generalizable out-of-distribution detection. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 5216–5223. Cited by: §5.
  • [31] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §5.
  • [32] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019) Do deep generative models know what they don’t know?. In International Conference on Learning Representations, Cited by: §5.
  • [33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1.
  • [34] A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
  • [35] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: §5.
  • [36] A. Papadopoulos, M. R. Rajati, N. Shaikh, and J. Wang (2021) Outlier exposure with confidence control for out-of-distribution detection. Neurocomputing 441, pp. 138–150. Cited by: §5.
  • [37] J. Ren, S. Fort, J. Liu, A. G. Roy, S. Padhy, and B. Lakshminarayanan (2021) A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022. Cited by: Appendix B.
  • [38] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, pp. 14680–14691. Cited by: §5.
  • [39] J. D. Robinson, C. Chuang, S. Sra, and S. Jegelka (2021) Contrastive learning with hard negative samples. In International Conference on Learning Representations, Cited by: §5.
  • [40] C. S. Sastry and S. Oore (2020) Detecting out-of-distribution examples with Gram matrices. In Proceedings of the 37th International Conference on Machine Learning, Cited by: §5.
  • [41] V. Sehwag, M. Chiang, and P. Mittal (2021)

    SSD: a unified framework for self-supervised outlier detection

    .
    In International Conference on Learning Representations, Cited by: Appendix B, Appendix B, §D.1, §D.2, Table 6, Table 7, §1, §3.2, §3, §4.1, §4.1, §4.2, §4.3, Table 1, Table 3, §5, §5.
  • [42] J. Serrà, D. Álvarez, V. Gómez, O. Slizovskaia, J. F. Núñez, and J. Luque (2020) Input complexity and out-of-distribution detection with likelihood-based generative models. In International Conference on Learning Representations, Cited by: §5.
  • [43] Y. Sun, C. Guo, and Y. Li (2021) ReAct: out-of-distribution detection with rectified activations. In Advances in Neural Information Processing Systems, Cited by: §5.
  • [44] J. Tack, S. Mo, J. Jeong, and J. Shin (2020)

    CSI: novelty detection via contrastive learning on distributionally shifted instances

    .
    In Advances in Neural Information Processing Systems, Cited by: Appendix B, Table 6, §1, §4.2, Table 1, Table 3, §5, §5.
  • [45] E. W. Teh, T. DeVries, and G. W. Taylor (2020) Proxynca++: revisiting and revitalizing proxy neighborhood component analysis. In European Conference on Computer Vision, pp. 448–464. Cited by: §5.
  • [46] A. van den Oord, Y. Li, and O. Vinyals (2019) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §5.
  • [47] F. Wang and H. Liu (2021) Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2495–2504. Cited by: §5.
  • [48] F. Wang and H. Liu (2021) Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2495–2504. Cited by: §4.3.
  • [49] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §5.
  • [50] T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929–9939. Cited by: §3.2, §3, §5.
  • [51] J. Winkens, R. Bunel, A. G. Roy, R. Stanforth, V. Natarajan, J. R. Ledsam, P. MacWilliams, P. Kohli, A. Karthikesalingam, S. Kohl, et al. (2020) Contrastive training for improved out-of-distribution detection. arXiv preprint arXiv:2007.05566. Cited by: Appendix B, Appendix B, §D.1, Table 6, Table 7, §1, §4.1, §4.2, §4.2, Table 1, Table 3, §5, §5.
  • [52] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3415–3424. Cited by: §5.
  • [53] Z. Xiao, Q. Yan, and Y. Amit (2020) Likelihood regret: an out-of-distribution detection score for variational auto-encoder. Advances in Neural Information Processing Systems 33. Cited by: §5.
  • [54] C. Xing, S. Arik, Z. Zhang, and T. Pfister (2020) Distance-based learning from errors for confidence calibration. In International Conference on Learning Representations, Cited by: §1.
  • [55] P. Xu, K. A. Ehinger, Y. Zhang, A. Finkelstein, S. R. Kulkarni, and J. Xiao (2015) Turkergaze: crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755. Cited by: §4.1.
  • [56] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015)

    Lsun: construction of a large-scale image dataset using deep learning with humans in the loop

    .
    arXiv preprint arXiv:1506.03365. Cited by: §4.1.
  • [57] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    .
    IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.1.

Appendix A CIDER Training Scheme

The first stage of the training scheme of our compactness and dispersion regularized (CIDER) learning framework is shown in Algorithm 1. We jointly optimize: (1) a compactness loss to encourage samples to be close to their class prototypes, and (2) a dispersion loss to encourage larger angular distances among different class prototypes. Note that the second stage of training (for ID classification) follows the convention by training a linear classifier based on the extracted penultimate layer features from the first stage.

1 Input: Training dataset , neural network encoder , projection head , classifier , class prototypes (), weights of loss terms , and , temperature
2 for  do
3        for  do
4               sample a mini-batch
5               obtain augmented batch by applying two random augmentations to
6               for  do
                      // obtain normalized embedding
7                      ,
                      // update class-prototypes
8                     
                      // obtain the index set of positives
9                     
10                     
              // calculate compactness loss
11              
               // calculate dispersion loss
12              
               // calculate overall loss
13              
               // update the network weights
14               update the weights in the encoder and the projection head
15              
16       
Algorithm 1 Pseudo-code of CIDER.

Appendix B Experimental Details

Software and hardware.

All methods are implemented in Pytorch 1.10. We run all the experiments on NVIDIA GeForce RTX-2080Ti GPU for small to medium batch size and on NVIDIA A100 GPU for large batch size and larger network encoder.

Architecture. As shown in Figure 2, the overall architecture of CIDER consists of a projection head on top of a deep neural network encoder . Following common practice and fair comparison with prior works [41], [17], we fix the output dimension of the projection head to be 128. We use a linear projection head for the simpler task CIFAR-10 and a two-layer non-linear projection head for the more complex task CIFAR-100.

Training. For methods based on pre-trained models such as MSP [12], ODIN [25], Mahalanobis [22], and Energy [26], we follow the common practice and train with the cross-entropy loss for 200 epochs with an initial learning rate of . For fair comparison, methods involving contrastive learning [51],[44],[41] are trained for 500 epochs. For CIDER, we adopt the same key hyperparameters for contrastive losses such as initial learning rate (0.5), temperature (0.1), and batch size (512) as SSD+ [41] in main experiments to demonstrate the effectiveness and simplicity of our loss.

Evaluation. For each OOD test set, if the dataset size is larger than the size of the ID test set , we randomly select a subset with size for fairness consideration. During test time, we fix the OOD detection score throughout our study, which allows us to isolate the effect of representation quality under different training methods. For simplicity, we consider the maximum Mahalanobis distance [22] based on a single layer and with shared covariance estimated from the training set as in Eq. 6. Note that more complicated variants exist such as estimating the class-wise covariance [51]

and calculating the Mahalanobis score from multiple intermediate layers and training a logistic regression to find a good combination of scores from different layers 

[22]. A recent work [37] proposed a Relative Mahalanobis score to improve near-OOD detection. We believe that variants of the Mahalanobis score also benefit from improved representation quality and leave as future work.

Appendix C The Potential of Using CIDER to Regularize SupCon

In Section 4.3, we demonstrate the effectiveness of embedding regularization via CIDER by jointly training CIDER with the cross-entropy loss. In this section, we provide another set of experiments where we investigate whether CIDER also facilitates better inter-class dispersion when jointly trained with SupCon. Specially, we use SupConCIDER to denote the loss . As shown in Figure 5(a) and 5(b), compared to training with the SupCon loss alone, adding CIDER improves the OOD detection performance across different batch size configurations. However, under moderate to large batch sizes (e.g. 512 to 1024), training with CIDER alone still results in the best performance.

(a) FPR95 under different batch sizes
(b) AUROC under different batch sizes
Figure 6: Ablation on using CIDER as regularizers to the SupCon loss for better inter-class dispersion. The results are averaged across the 5 OOD test sets based on ResNet-34. SupConCIDER (i.e. ) outperforms SupCon across different batch sizes, suggesting the effectiveness of explicitly facilitating prototype-wise dispersion.

Appendix D Results on CIFAR-10

d.1 Main Results

In the main paper, we mainly focus on the more challenging task CIFAR-100. In this section, we additionally evaluate on CIFAR-10, a commonly used benchmark in literature. For methods involving contrastive losses, we use the same network encoder and embedding dimension, while only varying the training objective. The Mahalanobis distance (Eq. 6) is used for OOD detection in SSD+ [41], CE+SimCLR [51], SupCon, and CIDER. The results are shown in Table 6. Similar trends also hold as we describe in Section 4.2: (1) CIDER achieves the state-of-the-art OOD detection performance in CIFAR-10 as a result of better inter-class dispersion and intra-class compactness. For example, using the same Mahalanobis score for detection, compared to training with the cross-entropy loss [28], training with CIDER reduces the FPR95 by averaged over 5 diverse test sets; (2) Although the ID classification accuracy of CIDER is similar to another proxy-based loss ProxyAnchor, CIDER significantly improves the OOD detection performance due to the addition of explicit inter-class dispersion which we show is critical for OOD detection in Section 4.3. The significant improvements highlight the importance of representation learning for OOD detection; (3) CIDER also improves the ID classification accuracy by compared to training with cross-entropy loss.

Method OOD Dataset Average
SVHN Places365 LSUN iSUN Texture
FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC ID ACC
Without Contrastive Learning
MSP [12] 82.75 81.69 66.79 64.56 69.71 85.35 59.77 90.39 69.33 78.97 69.67 80.19 92.82
Energy [26] 77.31 84.49 65.91 87.22 56.42 91.22 73.53 83.21 64.35 86.62 67.50 86.55 92.82
ODIN [25] 86.74 80.55 48.49 90.38 31.45 94.93 37.35 94.09 70.51 85.52 54.91 89.09 92.82
GODIN [14] 15.51 96.60 62.63 87.31 32.43 95.08 34.03 94.94 46.91 89.69 38.30 92.72 93.96
Mahalanobis [22] 19.01 95.74 89.28 65.99 77.15 78.94 71.98 80.06 28.32 93.36 57.15 82.82 92.82
With Contrastive Learning
CE + SimCLR [51] 6.98 99.22 54.39 86.70 64.53 85.60 59.62 86.78 16.77 96.56 40.46 90.97 93.12
CSI [44] 37.38 94.69 38.31 93.04 5.88 98.86 10.36 98.01 28.85 94.87 24.16 95.89 NA
SSD+ [41] 2.47 99.51 22.05 95.57 10.56 97.83 28.44 95.67 9.27 98.35 14.56 97.38 94.55
ProxyAnchor [18] 39.27 94.55 43.46 92.06 21.04 97.02 23.53 96.56 42.70 93.16 34.00 94.67 94.21
CIDER 1.17 99.73 21.91 94.19 3.58 99.16 16.01 96.85 12.91 97.44 11.12 97.47 94.63
Table 6: Results on CIFAR-10. OOD detection performance based on Mahalanobis distance for ResNet-18 trained on CIFAR-10 with and without contrastive loss. CIDER improves OOD detection performance and ID classification accuracy.

d.2 Ablation on Other Architectures

In this section, we further verify the generality of CIDER based on an alternative popular network architecture DenseNet-101. For fair comparison and memory efficiency considerations, we use a batch size of 128 for all methods during training. The results are shown in Table 7. The trends are similar to Table 1 (Section 4.2) with ResNet-34 and Table 6 with ResNet-18: (1) training with contrastive losses generally improves the OOD detection performance compared to training with softmax cross-entropy (CE) loss; (2) CIDER achieves the state-of-the-art OOD detection performance. For example, compared to SSD+ [41], CIDER further reduces the the average FPR95 by ; (3) Compared to training with the CE loss, CIDER also improves the ID classification accuracy by .

Method OOD Dataset Average
SVHN Places365 LSUN iSUN Texture
FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC ID ACC
Without Contrastive Learning
MSP [12] 64.49 90.89 67.91 86.36 53.96 92.46 55.49 91.96 69.32 85.25 62.23 89.38 93.28
ODIN [25] 56.60 92.01 50.30 89.39 22.43 96.47 26.31 96.02 54.79 86.16 42.09 92.01 93.28
GODIN [14] 10.94 97.31 63.22 84.83 14.46 97.00 12.85 97.25 39.45 87.57 28.18 92.79 94.37
Mahalanobis [22] 8.00 98.19 79.44 73.50 14.96 96.90 17.11 96.31 18.14 95.27 27.53 92.03 93.28
Energy [26] 18.42 94.46 73.58 78.82 26.42 95.25 24.44 95.45 44.63 89.12 37.50 90.62 93.28
With Contrastive Learning
CE + SimCLR [51] 12.19 97.57 81.50 69.35 22.19 96.19 18.42 96.70 28.95 90.83 32.65 90.13 94.66
SSD+ [41] 6.12 96.09 37.65 92.25 7.73 98.49 10.61 97.62 33.85 93.39 22.19 95.57 NA
ProxyAnchor [18] 12.68 96.65 35.40 92.95 3.40 99.30 10.95 98.25 27.43 95.10 17.97 96.45 NA
CIDER 4.13 99.29 17.53 96.20 4.68 98.69 5.59 98.28 18.07 96.59 10.00 97.81 94.99
.
Table 7: Ablation on additional architecture. OOD detection performance based on Mahalanobis distance for DenseNet-101 trained on CIFAR-10 with and without contrastive loss. CIDER significantly improves OOD detection performance, with improved ID accuracy compared to training with the softmax cross-entropy loss.