Out-of-distribution (OOD) detection [hodge2004survey]
, also referred to as a novelty- or anomaly detection is the task of identifying whether a test input is drawn far from the training distribution (in-distribution) or not. In general, the OOD detection problem aims to detect OOD samples where a detector is allowed to access only to training data. The space of OOD samples is typically huge (compared to that of in-distribution),i.e., an OOD sample can vary significantly and arbitrarily from the given training distribution. Hence, assuming specific prior knowledge, e.g., external data representing some specific OODs, may introduce a bias to the detector. The OOD detection is a classic yet essential problem in machine learning, with a broad range of applications, including medical diagnosis [caruana2015intelligible], fraud detection [phua2010comprehensive], and autonomous driving [eykholt2018robust].
A long line of literature has thus been proposed, including density based [zhai2016deep, nalisnick2019deep, choi2018waic, du2019implicit, ren2019likelihood, serra2020input, grathwohl2020your], reconstruction based [schlegl2017unsupervised, zong2018deep, pidhorskyi2018generative, perera2019ocgan, choi2020novelty]
, one-class classifier[scholkopf2000support, ruff2018deep, ruff2020deep], and self-supervised [golan2018deep, hendrycks2019using_self, bergman2020classification] approaches. Overall, a majority of recent literature is concerned with (a) modeling the representation for a given sample to better encode normality [hendrycks2019using_pre, hendrycks2019using_self], and (b) defining a new detection score [ruff2018deep, bergman2020classification]
. For example, recent studies have shown that inductive biases from neural networks significantly help to learn discriminative features for OOD detection[ruff2018deep, hendrycks2019using_self].
Meanwhile, recent progress on unsupervised representation learning has proven the effectiveness of contrastive learning in various domains, e.ghenaff2019data, tian2019contrastive, he2019momentum, chen2020simple], audio processing [oord2018representation]
, and reinforcement learning[srinivas2020curl]. More specifically, contrastive learning extracts a strong inductive bias from data by pulling similar samples yet pushing the others. Instance discrimination [wu2018unsupervised] is a special type of contrastive learning that have achieved state-of-the-art results [he2019momentum, chen2020simple], which only pulls the same instances up to different augmentations.
Inspired by the recent success of instance discrimination, we aim to utilize its power of representation learning for OOD detection. To this end, we investigate the following questions: (a) how to learn a (more) discriminative representation for detecting OODs and (b) how to design a score function utilizing the representation from (a). We remark that the desired representation for OOD detection may differ from that for standard representation learning [hendrycks2019using_pre, hendrycks2019using_self], as the former aims to discriminate in-distribution and OOD samples, while the latter aims to discriminate within in-distribution samples.
We first found that an existing contrastive learning scheme of visual representation is already reasonably effective for detecting OOD samples with an appropriate detection score. We further observe that one can improve its performance by utilizing “hard” augmentations, e.g., rotation, that were known to be harmful and unused for the standard contrastive learning [chen2020simple]. In particular, while the existing contrastive learning schemes act by pulling all augmented samples toward the original sample, we suggest to additionally push the samples with hard or distribution-shifting augmentations away from the original. We observe that contrasting shifted samples help OOD detection,111It may not help (in-distribution) classification (see Section 3.2). as the model now learns a new task of discriminating between in- and out-of-distribution, in addition to the original task of discriminating within in-distribution.
Contribution. We propose a simple yet effective method for OOD detection, coined contrasting shifted instances (CSI). Built upon the existing contrastive learning scheme [chen2020simple], we propose two novel additional components: (a) a new training method which contrasts distributionally-shifted augmentations (of the given sample) in addition to other instances, and (b) a score function which utilizes both the contrastively learned representation and our new training scheme in (a). Finally, we show that CSI enjoys broader usage by applying it to improve the confidence-calibration of the classifiers: it relaxes the overconfidence issue in their predictions for both in- and out-of-distribution samples while maintaining the classification accuracy.
We verify the effectiveness of CSI under various environments of detecting OOD, including unlabeled one-class, unlabeled multi-class, and labeled multi-class settings. To our best knowledge, we are the first to demonstrate all three settings under a single framework. Overall, CSI outperforms the baseline methods for all tested datasets. In particular, CSI achieves new state-of-the-art results222We do not compare with the methods using external OOD samples [hendrycks2019deep, ruff2020deep]. on one-class classification, e.g., it improves the mean area under the receiver operating characteristics (AUROC) from 90.1% to 94.3% (+4.2%) for CIFAR-10 [krizhevsky2009learning], 79.8% to 89.6% (+9.8%) for CIFAR-100 [krizhevsky2009learning]
, and 85.7% to 91.6% (+5.9%) for ImageNet-30[hendrycks2019using_self] one-class datasets, respectively. We remark that CSI gives a larger improvement in harder (or near-distribution) OOD samples. To verify this, we also release new benchmark datasets: fixed version of the resized LSUN and ImageNet [liang2018enhancing].
We remark that learning representation to discriminate between in- and out-of-distributions is an important but under-explored problem. We think that our work would guide new interesting directions in the future, for both representation learning and OOD detection.
2 CSI: Contrasting shifted instances
For a given dataset sampled from a data distribution on the data space , the goal of out-of-distribution (OOD) detection is to model a detector from that identifies whether is sampled from the data generating distribution (or in-distribution) or not. As modeling directly is prohibitive in most cases, many existing methods for OOD detection define a score function
that a high value heuristically represents thatis from in-distribution.
In Section 2.1, we first briefly review the preliminaries on contrastive learning. Then, we describe two components of our method, contrasting shifted instances (CSI): the training scheme and the corresponding score function, in Section 2.2 and Section 2.3, respectively. Finally, we propose an extension of CSI for training confidence-calibrated classifiers in Section 2.4.
2.1 Contrastive learning
The idea of contrastive learning is to learn an encoder to extract the necessary information to distinguish similar samples from the others. Let be a query, , and be a set of positive and negative samples, respectively, and
be the cosine similarity. Then, the primitive form of thecontrastive loss is defined as follows:
where denotes the cardinality of the set , denotes the output feature of the contrastive layer, and denotes a temperature hyper-parameter. One can define the contrastive feature directly from the encoder , i.e., [he2019momentum], or apply an additional projection layer , i.e., [chen2020simple]. We use the projection layer following the recent studies [chen2020simple, chen2020improved].
In this paper, we specifically consider the simple contrastive learning (SimCLR) [chen2020simple], a simple and effective objective based on the task of instance discrimination [wu2018unsupervised]: Let and be two independent augmentations of from a pre-defined family , namely, and , where . Then the SimCLR objective can be defined by the contrastive loss (1) where each and are considered as query-key pairs while others being negatives. Namely, for a given batch , the SimCLR objective is defined as follows:
where and .
2.2 Contrastive learning for distribution-shifting transformations
chen2020simple has performed an extensive study on which family of augmentations leads to a better representation when used in SimCLR, i.e., which transformations should consider as positives. Overall, the authors report that some of the examined augmentations (e.g., rotation), sometimes degrades the discriminative performance of SimCLR. One of our key findings is that such augmentations can be useful for OOD detection by considering them as negatives - contrast from the original sample. In this paper, we explore which family of augmentations , which we call distribution-shifting transformations, or simply shifting transformations, would lead to better representation in terms of OOD detection when used as negatives in SimCLR.
Contrasting shifted instances. We consider a set consisting of different (random or deterministic) transformations, including the identity : namely, we denote . In contrast to the vanilla SimCLR that considers augmented samples as positive to each other, we attempt to consider them as negative if the augmentation is from . For a given batch of samples , this can be done simply by augmenting via before putting it into the SimCLR loss defined in (2): namely, we define contrasting shifted instances (con-SI) loss as follows:
Here, our intuition is to regard each distributionally-shifted sample (i.e., ) as an OOD with respect to the original. In this respect, con-SI attempts to discriminate an in-distribution (i.e., ) sample from other OOD (i.e., ) samples. We further verify the effectiveness of con-SI in our experimental results: although con-SI does not improve representation for standard classification, it does improve OOD detection significantly.
Classifying shifted instances. In addition to contrasting shifted instances, we consider an auxiliary task that predicts which shifting transformation is applied for a given input , in order to facilitate to discriminate each shifted instance. Specifically, we add a linear layer to for modeling an auxiliary softmax classifier , as in [golan2018deep, hendrycks2019using_self, bergman2020classification]. Let be the batch augmented from via SimCLR; then, we define classifying shifted instances (cls-SI) loss as follows:
The final loss of our proposed method, CSI, is defined by combining the two objectives:
where is a balancing hyper-parameter. We simply set for all our experiments.
2.3 Score functions for detecting out-of-distribution
Upon the representation learned by our proposed training objective, we define several score functions for detecting out-of-distribution; whether a given is OOD or not. We first propose a detection score that is applicable to any contrastive representation. We then introduce how one could incorporate additional information learned by contrasting (and classifying) shifted instances as in (5).
Detection score for contrastive representation. Overall, we find that two features from SimCLR representations are surprisingly effective for detecting OOD samples: (a) the cosine similarity to the nearest training sample in , i.e., , and (b) the norm of the representation, i.e., . We discuss the detailed analysis of both features in Appendix H. We simply combine these two features to define a detection score for contrastive representation:
We also discuss how one can reduce the computation and memory cost by choosing a proper subset (i.e., coreset) of training samples in Appendix E.
Utilizing shifting transformations. Given that our proposed is used for training, one can further improve the detection score significantly by incorporating shifting transformations . Here, we propose two additional scores, and , where are corresponded to (3) and (4), respectively.
Firstly, we define by taking an expectation of over :
where are used as balancing terms since the shifting transformation can change the feature statistics (See Appendix F for the details).
Secondly, we define utilizing the auxiliary classifier upon as follows:
where are again balancing terms similarly to above, and
is the weight vector in the linear layer ofper .
Finally, the combined score for CSI representation is defined as follows:
Ensembling over random augmentations. In addition, we find one can further improve each of the proposed scores by ensembling it over random augmentations where . Namely, for instance, the ensembled CSI score is defined by . Unless otherwise noted, we use these ensembled versions of creftypeplural 9, 8, 7 and 6 in our experiments. See Appendix D for details.
2.4 Extension for training confidence-calibrated classifiers
Furthermore, we show that our proposed method can be extended for training a confidence-calibrated classifier [hendrycks2017baseline, lee2018training] from a given labeled dataset : Here, the goal is to model a classifier that is (a) accurate on predicting for a given in-distribution sample , and (b) the confidence [hendrycks2017baseline] of the classifier is well-calibrated, e.g., should be low when is an OOD sample or . In our experiments, we measure the performance of on detecting OOD samples to evaluate a confidence-calibrated classifier.
To this end, we extend CSI with supervised contrastive learning (SupCLR) [khosla2020supervised], a supervised extension of SimCLR that only contrasts different classes instead of individual samples. We extend our training method (Section 2.2) to SupCLR by contrasting the self-label augmented [lee2019rethinking] space , where is the shifting transformation. In a similar manner to (3), this can be done simply by augmenting the samples via before putting into the SupCLR loss. From the learned representation, we train two types of linear classifiers: (a) , which predicts the class label, and (b)
, which predicts the joint probability distribution. We then marginalize the prediction (b) over , similarly to the ensemble score in Section 2.3, obtaining (c) . More details on how CSI can be integrated with SupCLR are presented in Appendix B.
In Section 3.1, we report OOD detection results on unlabeled one-class, unlabeled multi-class, and labeled multi-class datasets. In Section 3.2, we analyze the effects on various shifting transformations in the context of OOD detection, as well as an ablation study on each component we propose.
Setup. We use ResNet-18 [he2016deep] architecture for all the experiments. For data augmentations, we adopt those used by chen2020simple: namely, we use the combination of Inception crop [szegedy2015going], horizontal flip, color jitter, and grayscale as . Unless specified otherwise, we assume , the shifting transformation, to be the random rotation . We remark that one may further improve the performance by incorporating different transformations: see Table 5 for ablation study on different transformations other than rotation. By default, we train our models with the training objective in (5) and detect OOD samples with the ensembled version of the score in (9).
We mainly report the area under the receiver operating characteristic curve (AUROC) as a threshold-free evaluation metric for a detection score. In addition, we also report the test accuracy and the expected calibration error (ECE)[naeini2015obtaining, guo2017calibration]
for the experiments on labeled multi-class datasets. Here, ECE estimates whether a classifier can indicate when they are likely to be incorrect for test samples (from in-distribution) by measuring the difference between prediction confidence and accuracy. The formal description of the metrics and detailed experimental setups are in AppendixA.
3.1 Main results
Unlabeled one-class datasets. We start by considering the one-class setup: here, for a given multi-class dataset of classes, we conduct one-class classification tasks, where each task chooses one of the classes as in-distribution while the remaining classes being out-of-distribution. We run our experiments on three datasets, following the prior work [golan2018deep, hendrycks2019using_self, bergman2020classification]: CIFAR-10 [krizhevsky2009learning], CIFAR-100 labeled into 20 super-classes [krizhevsky2009learning], and ImageNet-30 [hendrycks2019using_self] datasets. We compare our method with various prior methods including one-class classifier [scholkopf2000support, ruff2018deep], reconstruction-based [schlegl2017unsupervised, perera2019ocgan], and self-supervised [golan2018deep, hendrycks2019using_self, bergman2020classification] approaches. Table 1 summarizes the results, showing that CSI significantly outperforms the prior methods in all the tested cases. We provide the full, additional results, e.g., class-wise results on CIFAR-100 (super-class) and ImageNet-30, in Appendix C.
Unlabeled multi-class datasets. In this setup, we assume that in-distribution samples are from a specific multi-class dataset without labels, testing on various external datasets as out-of-distribution. We compare our method on two in-distribution datasets: CIFAR-10 [krizhevsky2009learning] and ImageNet-30 [hendrycks2019using_self]. We consider the following datasets as out-of-distribution: SVHN [netzer2011reading], resized LSUN and ImageNet [liang2018enhancing], CIFAR-100 [krizhevsky2009learning]
, and linearly-interpolated samples of CIFAR-10 (Interp.)[du2019implicit] for CIFAR-10 experiments, and CUB-200 [welinder2010caltech], Dogs [khosla2011novel], Pets [parkhi2012cats], Flowers [nilsback2006visual], Food-101 [bossard2014food], Places-365 [zhou2017places], Caltech-256 [griffin2007caltech], and DTD [cimpoi2014describing] for ImageNet-30. We compare our method with various prior methods, including density-based [du2019implicit, ren2019likelihood, serra2020input] and self-supervised [golan2018deep, bergman2020classification] approaches.
Table 2 shows the results. Overall, we observe that our method significantly outperforms prior methods in all the benchmarks tested. We remark that our method is particularly effective for detecting hard (i.e., near-distribution) OOD samples, e.g., CIFAR-100 and Interp. in Table 0(d). Also, our method still shows a notable performance in the cases when prior methods often fail, e.g., Places-365 in Table 0(e). Finally, we notice that the resized LSUN and ImageNet datasets officially released by liang2018enhancing might be misleading to evaluate detection performance for hard OODs: we find that those datasets contain some unintended artifacts, due to incorrect resizing procedure. Such an artifact makes those datasets easily-detectable, e.g., via input statistics. In this respect, we produce and test on their fixed versions333We provide the code and datasets in https://github.com/alinlab/CSI., coined LSUN (FIX), and ImageNet (FIX). See Appendix I for details.
Labeled multi-class datasets. We also consider the labeled version of the above setting: namely, we now assume that every in-distribution sample also contains discriminative label information. We use the same datasets considered in the unlabeled multi-class setup for in- and out-of-distribution datasets. We train our model as proposed in Section 2.4, and compare it with those trained by other methods, the cross-entropy and supervised contrastive learning (SupCLR) [khosla2020supervised]. Since our goal is to calibrate the confidence, the maximum softmax probability is used to detect OOD samples (see [hendrycks2017baseline]).
Table 3 shows the results. Interestingly, our method consistently improves AUROC and ECE for ImageNet-30 while maintaining test accuracy. It supports our intuition that CSI learns the discriminative information for in- vs. out-of-distribution samples, in addition to that within in-distribution. One can also observe that CSI can further improve the performance by ensembling over the transformations. We remark that our results on unlabeled datasets (in Table 2) already show comparable performance to the supervised baselines (in Table 3).
3.2 Ablation study
We perform an ablation study on various shifting transformations, training losses, and detection scores. Throughout this section, we report the mean AUROC values on one-class CIFAR-10.
Shifting transformation. We test various data transformations for the shifting transformation. In particular, we consider Cutout [devries2017improved], Sobel filtering [kanopoulos1988design], Gaussian noise, Gaussian blur, and rotation [gidaris2018unsupervised]. We remark that these transformations are reported to be ineffective in improving the class discriminative power of SimCLR [chen2020simple]. In addition, we also consider the transformation coined “Perm”, which randomly permutes each part of the evenly partitioned image. Intuitively, such transformations commonly shift the input distribution, hence forcing them to be aligned can be harmful. Figure 1 visualizes all the considered transformations.
Table 4 shows AUROC values of the vanilla SimCLR, where the in-distribution samples shifted by the chosen transformation are given as OOD samples. The shifted samples are easily detected: it validates our intuition that the considered transformations shift the input distribution. In particular, “Perm” and “Rotate” are the most distinguishable, which implies they shift the distribution the most. Note that “Perm” and “Rotate” turns out to be the most effective shifting transformations; it implies that the transformations shift the distribution most indeed performs best for our method.444We also try contrasting some external OOD samples in a similar manner of [hendrycks2019deep, ruff2020deep]; however, we find that naïvely using them in our framework degrade the performance. This is because the contrastive loss also discriminates within external OOD samples, which is unnecessary and an additional learning burden for our purpose. Shifted samples can be most effective (‘nearby’ but ‘not-too-nearby’) OOD without the issue.
Besides, we apply the transformation upon the vanilla SimCLR: align the transformed samples to the original samples (i.e., use as ) or consider them as the shifted samples (i.e., use as ). Table 1(a) shows that aligning the transformations degrade (or on par) the detection performance, while shifting the transformations gives consistent improvements. We also remove or convert-to-shift the transformation from the vanilla SimCLR in Table 1(b), and see similar results. We remark that one can further improve the performance by combining multiple shifting transformations (see Appendix G).
Linear evaluation. We also measure the linear evaluation [kolesnikov2019revisiting]
, the accuracy of a linear classifier to discriminate classes of in-distribution samples. It is widely used for evaluating the quality of (unsupervised) learned representation. We report the linear evaluation of vanilla SimCLR and CSI (with shifting rotation), trained under unlabeled CIFAR-10. They show comparable results, 90.48% for SimCLR and 90.19% for CSI; CSI is more specialized to learn a representation for OOD detection.
Data-dependence of shifting transformation. We remark that the best choice of shifting transformation for detecting OODs depends on the dataset. For instance, consider a rotation-invariant dataset such as texture. Here, the rotation should not be a shifting transformation. Table 6 shows the AUROC values where Describable Textures Dataset (DTD) [cimpoi2014describing] and ImageNet-30 are in- and out-of-distribution samples, respectively. We compare the vanilla SimCLR and CSI using rotation as , denoted as “Base”, and “CSI (Rotation)”, respectively. Unlike natural images, shifting rotated images degrades OOD detection. See Appendix J for additional discussion.
Training loss. In Table 1(c), we assess the individual effects of each component that consists of our final training loss (5): namely, we compare the vanilla SimCLR (2), contrasting shifted instances (3), and classifying shifted instances (4) losses. For the evaluation of the models of different training losses creftypeplural 5, 4, 3 and 2, we use the detection scores defined in creftypeplural 9, 8, 7 and 6, respectively. We remark that both contrasting and classifying shows better results than the vanilla SimCLR; and combining them (i.e., the final CSI loss (5)) gives further improvements, i.e., two losses are complementary.
Detection score. Finally, Table 1(d) shows the effect of each component in our detection score: the vanilla contrastive (6), contrasting shifted instances (7), and classifying shifted instances (8) scores. We ensemble the scores over both and for creftypeplural 9, 8 and 7, and use a single sample for creftype 6. All the reported values are evaluated from the model trained by the final loss 5. Similar to above, both contrasting and classifying scores show better results than the vanilla contrastive score; and combining them (i.e., the final CSI score (9)) gives further improvements.
4 Related work
4.1 OOD detection
Out-of-distribution (OOD) detection is a classic and essential problem in machine learning, studied under different names, e.g., novelty or anomaly detection [hodge2004survey]. In this paper, we primarily focus on unsupervised OOD detection, which is arguably the most traditional and popular setup in the field [scholkopf2000support]. In this setting, the detector is only allowed to access in-distribution samples while required to identify unseen OOD samples. There are other settings, e.g., semi-supervised setting - the detector can access a small subset of out-of-distribution samples [hendrycks2019deep, ruff2020deep], or supervised setting - the detector knows the target out-of-distribution, but we do not consider those settings in this paper. We remark that the unsupervised setting is the most practical and challenging scenario since there are infinitely many cases for out-of-distribution, and it is often not possible to have such external data.
Most recent works can be grouped into four categories: (a) density-based [zhai2016deep, nalisnick2019deep, choi2018waic, du2019implicit, ren2019likelihood, serra2020input, grathwohl2020your], (b) reconstruction-based [schlegl2017unsupervised, zong2018deep, deecke2018anomaly, pidhorskyi2018generative, perera2019ocgan, choi2020novelty], (c) one-class classifier [scholkopf2000support, ruff2018deep], and (d) self-supervised [golan2018deep, hendrycks2019using_self, bergman2020classification]
methods. We note that there are more extensive literature on this topic, but we mainly focus on the recent work based on deep learning due to the limited space (see[hodge2004survey, chandola2009anomaly, pimentel2014review] for survey). Brief description for each method are as follows:
Density-based methods. Density-based methods are one of the most classic and principled approaches for OOD detection. Intuitively, they directly use the likelihood of the sample as the detection score. However, recent studies reveal that the likelihood is often not the best metric - especially for deep neural networks with complex datasets [nalisnick2019deep, choi2018waic]. Several work thus proposed modified scores, e.g., WAIC [choi2018waic], likelihood ratio [ren2019likelihood], and input complexity [serra2020input], or utilized unnormalized likelihood (i.e., energy) [du2019implicit, grathwohl2020your].
Reconstruction-based methods. Reconstruction-based approach is another popular line of research for OOD detection. It trains an encoder-decoder network that reconstructs the training data in an unsupervised manner. Since the encoder-decoder network would less generalize for unseen OOD samples, they use the reconstruction loss as the detection score. There are also several works on this approach, utilizing generative auto-encoders [zong2018deep, pidhorskyi2018generative] or generative adversarial networks [schlegl2017unsupervised, deecke2018anomaly, perera2019ocgan].
One-class classifiers. One-class classifiers are also a classic and principled approach for OOD detection. They learn a decision boundary of in- vs. out-of-distribution samples, by giving some margin that covers the in-distribution samples [scholkopf2000support]. Recent work shows that the one-class classifier is effective upon the deep representation [ruff2018deep].
Self-supervised methods. Self-supervised methods are a relatively new approach based on the strong representation learned from the self-supervision [gidaris2018unsupervised]. They train a network with a pre-defined task (e.g., predict the angle of the rotated image) on the training set, and use the generalization error to detect OOD samples. Recent self-supervised methods show outstanding results on various OOD detection benchmark datasets [golan2018deep, hendrycks2019using_self, bergman2020classification].
Our work can be categorized into the self-supervised methods, but differs from the prior work as we consider the contrastive learning type of self-supervision [chen2020simple]. We validate the power of contrastive representation learning can be extended for OOD detection, with proper modification.
4.2 Confidence-calibrated classifiers
Another line of research is on confidence-calibrated classifiers [hendrycks2017baseline], which relaxes the overconfidence issues of the classifiers. There are two types of calibration: (a) in-distribution calibration, that aligns the uncertainty and the actual accuracy, measured by ECE [naeini2015obtaining, guo2017calibration], and (b) out-of-distribution detection, that reduces the uncertainty of OOD samples, measured by AUROC [hendrycks2017baseline, lee2018training]. Note that the goal of confidence-calibrated classifiers is to regularize the prediction; hence all three tasks: classification, in-distribution calibration, and out-of-distribution detection, are done by the softmax probability. Namely, the detection score is given by the confidence (or maximum softmax probability) [hendrycks2017baseline]. There are also several works on designing a specific detection score utilizing the pre-trained classifier (e.g., [liang2018enhancing, lee2018simple]), but we do not consider those approaches in this paper.
4.3 Self-supervised learning
Self-supervised learning[gidaris2018unsupervised, kolesnikov2019revisiting] show remarkable success in learning representation recently. In particular, contrastive learning [oord2018representation], specifically instance discrimination [wu2018unsupervised], show the state-of-the-art results on visual representation learning [he2019momentum, chen2020simple]. However, most prior works primarily focus on improving the downstream task performance (e.g., classification); and other advantages of self-supervised learning (e.g., uncertainty or robustness) are rarely investigated [hendrycks2019using_self]. Our work first verifies the effectiveness of contrastive learning for OOD detection.
We propose a simple yet effective method named contrasting shifted instances (CSI), which extends the power of contrastive learning for out-of-distribution (OOD) detection problems. CSI demonstrates outstanding performance under various OOD detection scenarios. We believe our work would guide various future directions in OOD detection and self-supervised learning as an important baseline.
Appendix A Experimental details
Training details. We use ResNet-18 [he2016deep] as the base encoder network
and 2-layer multi-layer perceptron with 128 embedding dimension as the projection head. All models are trained by minimizing the final loss (5) with a temperature of . We follow the same optimization step of SimCLR [chen2020simple]
. For optimization, we train CSI with 1,000 epoch under LARS optimizer[you2017large] with weight decay of and momentum with 0.9. For the learning rate scheduling, we use linear warmup [goyal2017accurate] for early 10 epochs until learning rate of 1.0 and decay with cosine decay schedule without a restart [loshchilov2016sgdr]. We use batch size of 512 for both vanilla SimCLR and ours: where the batch is given by for vanilla SimCLR and the aggregated one
for ours. Furthermore, we use global batch normalization (BN)[ioffe2015batch]
, which shares the BN parameters (mean and variance) over the GPUs in distributed training.
For supervised contrastive learning (SupCLR) [khosla2020supervised] and supervised CSI, we select the best temperature from : SupCLR recommend 0.07 but 0.5 was better in our experiments. For training the encoder
, we use the same optimization scheme as above, except using 700 for the epoch. For training the linear classifier, we train the model for 100 epochs with batch size 128, using stochastic gradient descent with momentum 0.9. The learning rate starts at 0.1 and is dropped by a factor of 10 at 60%, 75%, and 90% of the training progress.
Data augmentation details. We use SimCLR augmentations: Inception crop [szegedy2015going], horizontal flip, color jitter, and grayscale for random augmentations , and rotation as shifting transformation . The detailed description of the augmentations are as follows:
Randomly crops the area of the original image with uniform distribution 0.08 to 1.0. After the crop, cropped image are resized to the original image size.
Horizontal flip. Flips the image horizontally with 50% of probability.
Color jitter. Change the hue, brightness, and saturation of the image. We transform the RGB (red, green, blue) image into an HSV (hue, saturation, value) image format and add noise to the HSV channels. We apply color jitter with 80% of probability.
Grayscale. Convert into a gray image. Randomly apply a grayscale with 20% of probability.
Rotation. We use rotation as , the shifting transformation, . For a given batch , we apply each rotation degree to obtain the new batch for CSI: .
Dataset details. For one-class datasets, we train one class of CIFAR-10 [krizhevsky2009learning], CIFAR-100 (super-class) [krizhevsky2009learning], and ImageNet-30 [hendrycks2019using_self]. CIFAR-10 and CIFAR-100 consist of 50,000 training and 10,000 test images with 10 and 20 (super-class) image classes, respectively. ImageNet-30 contains 39,000 training and 3,000 test images with 30 image classes.
For unlabeled and labeled multi-class datasets, we train ResNet with CIFAR-10 and ImageNet-30. For CIFAR-10, out-of-distribution (OOD) samples are as follows: SVHN [netzer2011reading] consists of 26,032 test images with 10 digits, resized LSUN [liang2018enhancing] consists of 10,000 test images of 10 different scenes, resized ImageNet [liang2018enhancing] consists of 10,000 test images with 200 images classes from a subset of full ImageNet dataset, Interp. consists of 10,000 test images of linear interpolation of CIFAR-10 test images, and LSUN (FIX), ImageNet (FIX) consists of 10,000 test images, respectively with following details in Appendix I. For multi-class ImageNet-30, OOD samples are as follows: CUB-200 [welinder2010caltech], Stanford Dogs [khosla2011novel], Oxford Pets [parkhi2012cats], Oxford Flowers [nilsback2006visual], Food-101 [bossard2014food] without the “hotdog” class to avoid overlap, Places-365 [zhou2017places] with small images (256 * 256) validation set, Caltech-256 [griffin2007caltech], and Describable Textures Dataset (DTD) [cimpoi2014describing]. Here, we randomly sample 3,000 images to balance with the in-distribution test set.
Evaluation metrics. For evaluation, we measure the two metrics that each measures (a) the effectiveness of the proposed score in distinguishing in- and out-of-distribution images, (b) the confidence calibration of softmax classifier.
Area under the receiver operating characteristic curve (AUROC). Let TP, TN, FP, and FN denote true positive, true negative, false positive and false negative, respectively. The ROC curve is a graph plotting true positive rate = TP / (TP+FN) against the false positive rate = FP / (FP+TN) by varying a threshold.
Expected calibration error (ECE). For a given test data , we group the predictions into interval bins (each of size ). Let be the set of indices of samples whose prediction confidence falls into the interval . Then, the expected calibration error (ECE) [naeini2015obtaining, guo2017calibration] is follows:
where is accuracy of : where is indicator function and is confidence of : where is the confidence of data .
Appendix B Detailed description for confidence-calibrated classifiers
We propose a simple extension of CSI for training a confidence-calibrated classifier [hendrycks2017baseline]. To this end, we first explain the supervised contrastive learning (SupCLR) [khosla2020supervised], a supervised extension of SimCLR that contrasts the samples of different classes instead of the different samples. Following the notation of SimCLR, let be a training batch with class labels , and be an augmented batch with class labels, i.e., . We define a subset of with the samples of label , namely, . Then, the SupCLR objective is:
where denotes the set complement and . After learning representation with the SupCLR objective (11), we train a linear classifier which predicts the class labels, upon the embedding network . Here, we use the confidence (or maximum softmax probability) [hendrycks2017baseline] where is given by to detect OOD samples.
Similar to the contrasting loss for SimCLR (3), we extend the SupCLR objective utilizing the shifting transformations . To this end, we consider the joint label of class label and shifting transformation . Let be the shifted batch for each transformation . Then, the supervised contrasting shifted instances (sup-CSI) loss is given by
defined on the self-label augmented [lee2019rethinking] space . We observe that the classifying shifted instances loss (7) do not help supervised learning, which coincides with the observation of [lee2019rethinking] that the self-supervised labels often conflict with the class labels. Hence, we only use the contrasting shifted instances loss (3) for our supervised experiments.
From the learned representation, we train two types of linear classifiers: (a) , which predicts the class labels, and (b) , which predicts the joint labels. For the former one, we use directly to compute the confidence . For the latter one, we marginalize the joint prediction over the shifting transformation similar to Section 2.3. Formally, let
be the logit values ofwhere and . Let denote the logit values correspond to . Then, the ensembled probability is given by:
where denotes the softmax activation. Here, we use to compute the confidence . We denote the confidence computed by and and “CSI” and “CSI-ens”, respectively.
Appendix C Additional one-class OOD detection results
presents the confusion matrix of AUROC values of our method on one-class CIFAR-10 datasets, where bold denotes the hard pairs. The results align with the human intuition that ‘car’ is confused to ‘ship’ and ‘truck’, and ‘cat’ is confused to ‘dog’.
Table 9 presents the OOD detection results of various methods on one-class CIFAR-100 (super-class) datasets, for all 20 super-classes. Our method outperforms the prior methods for all classes.
Table 10 presents the OOD detection results of our method on one-class ImageNet-30 dataset, for all 30 classes. Our method consistently performs well for all classes.
Appendix D Ablation study on random augmentation
We verify that ensembling the scores over the random augmentations improves OOD detection. However, naïve random sampling from the entire is often sample inefficient. We find that choosing a proper subset improves the performance for given number of samples. Specifically, we choose as the set of the most common samples. For example, the size of the cropping area is sampled from for uniform distribution during training. Since the rare samples, e.g., area near increases the noise, we only use the samples with size during inference. Table 11 shows random sampling from the controlled set often gives improvements.
|# of samples||Controlled||OC-CIFAR-10||OC-CIFAR-100|
Appendix E Efficient computation of (6) via coreset
One can reduce the computation and memory cost of the contrastive score (6) by selecting a proper subset, i.e., coreset
, of the training samples. To this end, we run K-means clustering[macqueen1967some] on the normalized features using cosine similarity as a metric. Then, we use the center of each cluster as the coreset. For contrasting shifted instances (4), we choose the coreset for each shifting transformation . Table 12 shows the results for various coreset sizes, given by a ratio from the full training samples. Keeping only a few (e.g., 1%) samples is sufficient.
Appendix F Ablation study on the balancing terms
We study the effects of the balancing terms , in Section 2.3. To this end, we compare of our final loss (5), without (w/o) and with (w/) the balancing terms and . When not using the balancing terms, we set for all . We follow the experimental setup of Table 1, e.g., use rotation for the shifting transformation. We run our experiments on CIFAR-10, CIFAR-100 (super-class), and ImageNet-30 datasets. Table 13 shows that the balancing terms gives a consistent improvement. CIFAR-10 do not show much gain since all and show similar values; in contrast, CIFAR-100 (super-class) and ImageNet-30 show large gain since they varies much.
|CSI (w/o balancing)||94.28||89.00||91.04|
|CSI (w/ balancing)||94.31||89.55||91.63|
Appendix G Combining multiple shifting transformations
We find that combining multiple shifting transformations: given two transformations and , use as the combined shifting transformation, can give further improvements. Table 14 shows that combining “Noise”, “Blur”, and “Perm” to “Rotate” gives additional gain. We remark that one can investigate the better combination; we choose rotation for our experiments due to its simplicity.
Appendix H Discussion on the features of the contrastive score (6)
We find that the two features: a) the cosine similarity to the nearest training sample in the training set , i.e., , and (b) the feature norm of the representation, i.e., , are important features for detecting OOD samples under the SimCLR representation.
In this section, we first demonstrate the properties of the two features under vanilla SimCLR. While we use the vanilla SimCLR to validate they are general properties of SimCLR, we remark that our training scheme (see Section 2.2) further improves the discrimination power of the features. Next, we verify that cosine similarity and feature norm are complementary, that combining both features (i.e., (6)) give additional gain. For the latter one, we use our final training loss to match the reported values in prior experiments, but we note that the trend is consistent among the models.
First, we demonstrate the effect of cosine similarity for OOD detection. To this end, we train vanilla SimCLR using CIFAR-10 and CIFAR-100 and in- and out-of-distribution datasets. Since SimCLR attracts the same image with different augmentations, it learns to cluster similar images; hence, it shows good discrimination performance measured by linear evaluation [chen2020simple]. Figure 2(a) presents the t-SNE [maaten2008visualizing] plot of the normalized features that each color denote different class. Even though SimCLR is trained in an unsupervised manner, the samples of the same classes are gathered.
Figure 2(b) and Figure 2(c) presents the histogram of the cosine similarities from the nearest training sample (i.e., ), for training and test datasets, respectively. For the training set, we choose the second nearest sample since the nearest one is itself. One can see that training samples are concentrated, even though contrastive learning pushes the different samples. It complements the results of Figure 2(a). For test sets, the in-distribution samples show a similar trend with the training samples. However, the OOD samples are farther from the training samples, which implies that the cosine similarity is an effective feature to detect OOD samples.
Second, we demonstrate that the feature norm is a discriminative feature for OOD detection. Following the prior setting, we use CIFAR-10 and CIFAR-100 for in- and out-of-distribution datasets, respectively. Figure 3(a) shows that the discriminative power of feature norm improves as the training epoch increases. We observe that this phenomenon consistently happens over models and settings; the contrastive loss makes the norm of in-distribution samples relatively larger than OOD samples. Figure 3(b) shows the norm of CIFAR-10 is indeed larger than CIFAR-100, under the final model.
This is somewhat unintuitive since the SimCLR uses the normalized features to compute the loss (1). To understand this phenomenon, we visualize the t-SNE [maaten2008visualizing] plot of the feature space in Figure 3(c), randomly choosing 100 images from both datasets. We randomly augment each image for 100 times for better visualization. One can see that in-distribution samples tend to be spread out over the large sphere, while OOD samples are gathered near center.555t-SNE plot does not tell the true behavior of the original feature space, but it may give some intuition. Also, note that the same image with different augmentations are highly clustered, while in-distribution samples are slightly more assembled.666We also try the local variance of the norm as a detection score. It also works well, but the norm is better.
We suspect that increasing the norm may be an easier way to maximize cosine similarity between two vectors: instead of directly reducing the feature distance of two augmented samples, one can also increase the overall norm of the features to reduce the relative distance of two samples.
Finally, we verify that cosine similarity (sim-only) and feature norm (norm-only) are complementary: combining them (sim+norm) gives additional improvements. Here, we use the model trained by our final objective (5), and follow the inference scheme of the main experiments (see Table 7). Table 15 shows AUROC values under sim-only, norm-only, and sim+norm scores. Using only sim or norm already shows good results, but combining them shows the best results.
Appendix I Rethinking OOD detection benchmarks
We find that resized LSUN and ImageNet [liang2018enhancing], one of the most popular benchmark datasets for OOD detection, are visually far from in-distribution datasets (commonly, CIFAR [krizhevsky2009learning]). Figure 6 shows that resized LSUN and ImageNet contain artificial noises, produced by broken image operations.777It is also reported in https://twitter.com/jaakkolehtinen/status/1258102168176951299. It is problematic since one can detect such datasets with simple data statistics, without understanding semantics from neural networks. To progress OOD detection research one step further, one needs more hard or semantic OOD samples that cannot be easily detected by data statistics.
To verify this, we propose a simple detection score that measures the input smoothness of an image. Intuitively, noisy images would have a higher variation in input space than natural images. Formally, let be the -th value of the vectorized image . Here, we define the neighborhood as the set of spatially connected pairs of pixel indices. Then, the total variation distance is given by
Then, we define the smoothness score as the difference of total variation from the training samples:
Table 16 shows that this simple score detects current benchmark datasets surprisingly well.
To address this issue, we construct new benchmark datasets, using a fixed resize operation888 We use PyTorch
We use PyTorchtorchvision.transforms.Resize() operation., hence coined LSUN (FIX) and ImageNet (FIX). For LSUN (FIX), we randomly sample 1,000 images from every ten classes of the training set of LSUN. For ImageNet (FIX), we randomly sample 10,000 images from the entire training set of ImageNet-30, excluding “airliner”, “ambulance”, “parking-meter”, and “schooner” classes to avoid overlapping with CIFAR-10.999We provide the datasets and data generation code in https://github.com/alinlab/CSI. Figure 6 shows that the new datasets are more visually realistic than the former ones (Figure 6). Also, Table 16 shows that the fixed datasets are not detected by the simple data statistics (15). We believe our newly produced datasets would be a stronger benchmark for hard or semantic OOD detection for future researches.
|SVHN||LSUN||ImageNet||LSUN (FIX)||ImageNet (FIX)||CIFAR-100||Interp.|
Appendix J Additional discussion on shifting transformation
As remarked in Section 3.2, the appropriate shifting transformation can be dependent on the dataset. It is crucial for real-world scenarios since many practical applications deal with non-natural images, e.g., manufacturing - steel101010https://www.kaggle.com/c/severstal-steel-defect-detection or textile111111https://lmb.informatik.uni-freiburg.de/resources/datasets/tilda.en.html for instance, or aerial [xia2018dota] images. For such datasets, one should not use rotation as a shifting transformation. We present OOD detection results using Steel and ImageNet-30 as in- and out-of-distribution datasets, respectively. Table 17 shows similar results with the one in Section 3.2: shifting rotation degrades the performance. Investigating new transformations considering the characteristics of the datasets would be an interesting future direction.