1 Introduction
Outofdistribution (OOD) detection [hodge2004survey]
, also referred to as a novelty or anomaly detection is the task of identifying whether a test input is drawn far from the training distribution (indistribution) or not. In general, the OOD detection problem aims to detect OOD samples where a detector is allowed to access only to training data. The space of OOD samples is typically huge (compared to that of indistribution),
i.e., an OOD sample can vary significantly and arbitrarily from the given training distribution. Hence, assuming specific prior knowledge, e.g., external data representing some specific OODs, may introduce a bias to the detector. The OOD detection is a classic yet essential problem in machine learning, with a broad range of applications, including medical diagnosis [caruana2015intelligible], fraud detection [phua2010comprehensive], and autonomous driving [eykholt2018robust].A long line of literature has thus been proposed, including density based [zhai2016deep, nalisnick2019deep, choi2018waic, du2019implicit, ren2019likelihood, serra2020input, grathwohl2020your], reconstruction based [schlegl2017unsupervised, zong2018deep, pidhorskyi2018generative, perera2019ocgan, choi2020novelty]
, oneclass classifier
[scholkopf2000support, ruff2018deep, ruff2020deep], and selfsupervised [golan2018deep, hendrycks2019using_self, bergman2020classification] approaches. Overall, a majority of recent literature is concerned with (a) modeling the representation for a given sample to better encode normality [hendrycks2019using_pre, hendrycks2019using_self], and (b) defining a new detection score [ruff2018deep, bergman2020classification]. For example, recent studies have shown that inductive biases from neural networks significantly help to learn discriminative features for OOD detection
[ruff2018deep, hendrycks2019using_self].Meanwhile, recent progress on unsupervised representation learning has proven the effectiveness of contrastive learning in various domains, e.g
[henaff2019data, tian2019contrastive, he2019momentum, chen2020simple], audio processing [oord2018representation], and reinforcement learning
[srinivas2020curl]. More specifically, contrastive learning extracts a strong inductive bias from data by pulling similar samples yet pushing the others. Instance discrimination [wu2018unsupervised] is a special type of contrastive learning that have achieved stateoftheart results [he2019momentum, chen2020simple], which only pulls the same instances up to different augmentations.Inspired by the recent success of instance discrimination, we aim to utilize its power of representation learning for OOD detection. To this end, we investigate the following questions: (a) how to learn a (more) discriminative representation for detecting OODs and (b) how to design a score function utilizing the representation from (a). We remark that the desired representation for OOD detection may differ from that for standard representation learning [hendrycks2019using_pre, hendrycks2019using_self], as the former aims to discriminate indistribution and OOD samples, while the latter aims to discriminate within indistribution samples.
We first found that an existing contrastive learning scheme of visual representation is already reasonably effective for detecting OOD samples with an appropriate detection score. We further observe that one can improve its performance by utilizing “hard” augmentations, e.g., rotation, that were known to be harmful and unused for the standard contrastive learning [chen2020simple]. In particular, while the existing contrastive learning schemes act by pulling all augmented samples toward the original sample, we suggest to additionally push the samples with hard or distributionshifting augmentations away from the original. We observe that contrasting shifted samples help OOD detection,^{1}^{1}1It may not help (indistribution) classification (see Section 3.2). as the model now learns a new task of discriminating between in and outofdistribution, in addition to the original task of discriminating within indistribution.
Contribution. We propose a simple yet effective method for OOD detection, coined contrasting shifted instances (CSI). Built upon the existing contrastive learning scheme [chen2020simple], we propose two novel additional components: (a) a new training method which contrasts distributionallyshifted augmentations (of the given sample) in addition to other instances, and (b) a score function which utilizes both the contrastively learned representation and our new training scheme in (a). Finally, we show that CSI enjoys broader usage by applying it to improve the confidencecalibration of the classifiers: it relaxes the overconfidence issue in their predictions for both in and outofdistribution samples while maintaining the classification accuracy.
We verify the effectiveness of CSI under various environments of detecting OOD, including unlabeled oneclass, unlabeled multiclass, and labeled multiclass settings. To our best knowledge, we are the first to demonstrate all three settings under a single framework. Overall, CSI outperforms the baseline methods for all tested datasets. In particular, CSI achieves new stateoftheart results^{2}^{2}2We do not compare with the methods using external OOD samples [hendrycks2019deep, ruff2020deep]. on oneclass classification, e.g., it improves the mean area under the receiver operating characteristics (AUROC) from 90.1% to 94.3% (+4.2%) for CIFAR10 [krizhevsky2009learning], 79.8% to 89.6% (+9.8%) for CIFAR100 [krizhevsky2009learning]
, and 85.7% to 91.6% (+5.9%) for ImageNet30
[hendrycks2019using_self] oneclass datasets, respectively. We remark that CSI gives a larger improvement in harder (or neardistribution) OOD samples. To verify this, we also release new benchmark datasets: fixed version of the resized LSUN and ImageNet [liang2018enhancing].We remark that learning representation to discriminate between in and outofdistributions is an important but underexplored problem. We think that our work would guide new interesting directions in the future, for both representation learning and OOD detection.
2 CSI: Contrasting shifted instances
For a given dataset sampled from a data distribution on the data space , the goal of outofdistribution (OOD) detection is to model a detector from that identifies whether is sampled from the data generating distribution (or indistribution) or not. As modeling directly is prohibitive in most cases, many existing methods for OOD detection define a score function
that a high value heuristically represents that
is from indistribution.In Section 2.1, we first briefly review the preliminaries on contrastive learning. Then, we describe two components of our method, contrasting shifted instances (CSI): the training scheme and the corresponding score function, in Section 2.2 and Section 2.3, respectively. Finally, we propose an extension of CSI for training confidencecalibrated classifiers in Section 2.4.
2.1 Contrastive learning
The idea of contrastive learning is to learn an encoder to extract the necessary information to distinguish similar samples from the others. Let be a query, , and be a set of positive and negative samples, respectively, and
be the cosine similarity. Then, the primitive form of the
contrastive loss is defined as follows:(1) 
where denotes the cardinality of the set , denotes the output feature of the contrastive layer, and denotes a temperature hyperparameter. One can define the contrastive feature directly from the encoder , i.e., [he2019momentum], or apply an additional projection layer , i.e., [chen2020simple]. We use the projection layer following the recent studies [chen2020simple, chen2020improved].
In this paper, we specifically consider the simple contrastive learning (SimCLR) [chen2020simple], a simple and effective objective based on the task of instance discrimination [wu2018unsupervised]: Let and be two independent augmentations of from a predefined family , namely, and , where . Then the SimCLR objective can be defined by the contrastive loss (1) where each and are considered as querykey pairs while others being negatives. Namely, for a given batch , the SimCLR objective is defined as follows:
(2) 
where and .
2.2 Contrastive learning for distributionshifting transformations
chen2020simple has performed an extensive study on which family of augmentations leads to a better representation when used in SimCLR, i.e., which transformations should consider as positives. Overall, the authors report that some of the examined augmentations (e.g., rotation), sometimes degrades the discriminative performance of SimCLR. One of our key findings is that such augmentations can be useful for OOD detection by considering them as negatives  contrast from the original sample. In this paper, we explore which family of augmentations , which we call distributionshifting transformations, or simply shifting transformations, would lead to better representation in terms of OOD detection when used as negatives in SimCLR.
Contrasting shifted instances. We consider a set consisting of different (random or deterministic) transformations, including the identity : namely, we denote . In contrast to the vanilla SimCLR that considers augmented samples as positive to each other, we attempt to consider them as negative if the augmentation is from . For a given batch of samples , this can be done simply by augmenting via before putting it into the SimCLR loss defined in (2): namely, we define contrasting shifted instances (conSI) loss as follows:
(3) 
Here, our intuition is to regard each distributionallyshifted sample (i.e., ) as an OOD with respect to the original. In this respect, conSI attempts to discriminate an indistribution (i.e., ) sample from other OOD (i.e., ) samples. We further verify the effectiveness of conSI in our experimental results: although conSI does not improve representation for standard classification, it does improve OOD detection significantly.
Classifying shifted instances. In addition to contrasting shifted instances, we consider an auxiliary task that predicts which shifting transformation is applied for a given input , in order to facilitate to discriminate each shifted instance. Specifically, we add a linear layer to for modeling an auxiliary softmax classifier , as in [golan2018deep, hendrycks2019using_self, bergman2020classification]. Let be the batch augmented from via SimCLR; then, we define classifying shifted instances (clsSI) loss as follows:
(4) 
The final loss of our proposed method, CSI, is defined by combining the two objectives:
(5) 
where is a balancing hyperparameter. We simply set for all our experiments.
2.3 Score functions for detecting outofdistribution
Upon the representation learned by our proposed training objective, we define several score functions for detecting outofdistribution; whether a given is OOD or not. We first propose a detection score that is applicable to any contrastive representation. We then introduce how one could incorporate additional information learned by contrasting (and classifying) shifted instances as in (5).
Detection score for contrastive representation. Overall, we find that two features from SimCLR representations are surprisingly effective for detecting OOD samples: (a) the cosine similarity to the nearest training sample in , i.e., , and (b) the norm of the representation, i.e., . We discuss the detailed analysis of both features in Appendix H. We simply combine these two features to define a detection score for contrastive representation:
(6) 
We also discuss how one can reduce the computation and memory cost by choosing a proper subset (i.e., coreset) of training samples in Appendix E.
Utilizing shifting transformations. Given that our proposed is used for training, one can further improve the detection score significantly by incorporating shifting transformations . Here, we propose two additional scores, and , where are corresponded to (3) and (4), respectively.
Firstly, we define by taking an expectation of over :
(7) 
where are used as balancing terms since the shifting transformation can change the feature statistics (See Appendix F for the details).
Secondly, we define utilizing the auxiliary classifier upon as follows:
(8) 
where are again balancing terms similarly to above, and
is the weight vector in the linear layer of
per .Finally, the combined score for CSI representation is defined as follows:
(9) 
Ensembling over random augmentations. In addition, we find one can further improve each of the proposed scores by ensembling it over random augmentations where . Namely, for instance, the ensembled CSI score is defined by . Unless otherwise noted, we use these ensembled versions of creftypeplural 9, 8, 7 and 6 in our experiments. See Appendix D for details.
2.4 Extension for training confidencecalibrated classifiers
Furthermore, we show that our proposed method can be extended for training a confidencecalibrated classifier [hendrycks2017baseline, lee2018training] from a given labeled dataset : Here, the goal is to model a classifier that is (a) accurate on predicting for a given indistribution sample , and (b) the confidence [hendrycks2017baseline] of the classifier is wellcalibrated, e.g., should be low when is an OOD sample or . In our experiments, we measure the performance of on detecting OOD samples to evaluate a confidencecalibrated classifier.
To this end, we extend CSI with supervised contrastive learning (SupCLR) [khosla2020supervised], a supervised extension of SimCLR that only contrasts different classes instead of individual samples. We extend our training method (Section 2.2) to SupCLR by contrasting the selflabel augmented [lee2019rethinking] space , where is the shifting transformation. In a similar manner to (3), this can be done simply by augmenting the samples via before putting into the SupCLR loss. From the learned representation, we train two types of linear classifiers: (a) , which predicts the class label, and (b)
, which predicts the joint probability distribution
. We then marginalize the prediction (b) over , similarly to the ensemble score in Section 2.3, obtaining (c) . More details on how CSI can be integrated with SupCLR are presented in Appendix B.3 Experiments


In Section 3.1, we report OOD detection results on unlabeled oneclass, unlabeled multiclass, and labeled multiclass datasets. In Section 3.2, we analyze the effects on various shifting transformations in the context of OOD detection, as well as an ablation study on each component we propose.
Setup. We use ResNet18 [he2016deep] architecture for all the experiments. For data augmentations, we adopt those used by chen2020simple: namely, we use the combination of Inception crop [szegedy2015going], horizontal flip, color jitter, and grayscale as . Unless specified otherwise, we assume , the shifting transformation, to be the random rotation . We remark that one may further improve the performance by incorporating different transformations: see Table 5 for ablation study on different transformations other than rotation. By default, we train our models with the training objective in (5) and detect OOD samples with the ensembled version of the score in (9).
We mainly report the area under the receiver operating characteristic curve (AUROC) as a thresholdfree evaluation metric for a detection score. In addition, we also report the test accuracy and the expected calibration error (ECE)
[naeini2015obtaining, guo2017calibration]for the experiments on labeled multiclass datasets. Here, ECE estimates whether a classifier can indicate when they are likely to be incorrect for test samples (from indistribution) by measuring the difference between prediction confidence and accuracy. The formal description of the metrics and detailed experimental setups are in Appendix
A.3.1 Main results
Unlabeled oneclass datasets. We start by considering the oneclass setup: here, for a given multiclass dataset of classes, we conduct oneclass classification tasks, where each task chooses one of the classes as indistribution while the remaining classes being outofdistribution. We run our experiments on three datasets, following the prior work [golan2018deep, hendrycks2019using_self, bergman2020classification]: CIFAR10 [krizhevsky2009learning], CIFAR100 labeled into 20 superclasses [krizhevsky2009learning], and ImageNet30 [hendrycks2019using_self] datasets. We compare our method with various prior methods including oneclass classifier [scholkopf2000support, ruff2018deep], reconstructionbased [schlegl2017unsupervised, perera2019ocgan], and selfsupervised [golan2018deep, hendrycks2019using_self, bergman2020classification] approaches. Table 1 summarizes the results, showing that CSI significantly outperforms the prior methods in all the tested cases. We provide the full, additional results, e.g., classwise results on CIFAR100 (superclass) and ImageNet30, in Appendix C.
Unlabeled multiclass datasets. In this setup, we assume that indistribution samples are from a specific multiclass dataset without labels, testing on various external datasets as outofdistribution. We compare our method on two indistribution datasets: CIFAR10 [krizhevsky2009learning] and ImageNet30 [hendrycks2019using_self]. We consider the following datasets as outofdistribution: SVHN [netzer2011reading], resized LSUN and ImageNet [liang2018enhancing], CIFAR100 [krizhevsky2009learning]
, and linearlyinterpolated samples of CIFAR10 (Interp.)
[du2019implicit] for CIFAR10 experiments, and CUB200 [welinder2010caltech], Dogs [khosla2011novel], Pets [parkhi2012cats], Flowers [nilsback2006visual], Food101 [bossard2014food], Places365 [zhou2017places], Caltech256 [griffin2007caltech], and DTD [cimpoi2014describing] for ImageNet30. We compare our method with various prior methods, including densitybased [du2019implicit, ren2019likelihood, serra2020input] and selfsupervised [golan2018deep, bergman2020classification] approaches.Table 2 shows the results. Overall, we observe that our method significantly outperforms prior methods in all the benchmarks tested. We remark that our method is particularly effective for detecting hard (i.e., neardistribution) OOD samples, e.g., CIFAR100 and Interp. in Table 0(d). Also, our method still shows a notable performance in the cases when prior methods often fail, e.g., Places365 in Table 0(e). Finally, we notice that the resized LSUN and ImageNet datasets officially released by liang2018enhancing might be misleading to evaluate detection performance for hard OODs: we find that those datasets contain some unintended artifacts, due to incorrect resizing procedure. Such an artifact makes those datasets easilydetectable, e.g., via input statistics. In this respect, we produce and test on their fixed versions^{3}^{3}3We provide the code and datasets in https://github.com/alinlab/CSI., coined LSUN (FIX), and ImageNet (FIX). See Appendix I for details.
Labeled multiclass datasets. We also consider the labeled version of the above setting: namely, we now assume that every indistribution sample also contains discriminative label information. We use the same datasets considered in the unlabeled multiclass setup for in and outofdistribution datasets. We train our model as proposed in Section 2.4, and compare it with those trained by other methods, the crossentropy and supervised contrastive learning (SupCLR) [khosla2020supervised]. Since our goal is to calibrate the confidence, the maximum softmax probability is used to detect OOD samples (see [hendrycks2017baseline]).
Table 3 shows the results. Interestingly, our method consistently improves AUROC and ECE for ImageNet30 while maintaining test accuracy. It supports our intuition that CSI learns the discriminative information for in vs. outofdistribution samples, in addition to that within indistribution. One can also observe that CSI can further improve the performance by ensembling over the transformations. We remark that our results on unlabeled datasets (in Table 2) already show comparable performance to the supervised baselines (in Table 3).
3.2 Ablation study
We perform an ablation study on various shifting transformations, training losses, and detection scores. Throughout this section, we report the mean AUROC values on oneclass CIFAR10.
Shifting transformation. We test various data transformations for the shifting transformation. In particular, we consider Cutout [devries2017improved], Sobel filtering [kanopoulos1988design], Gaussian noise, Gaussian blur, and rotation [gidaris2018unsupervised]. We remark that these transformations are reported to be ineffective in improving the class discriminative power of SimCLR [chen2020simple]. In addition, we also consider the transformation coined “Perm”, which randomly permutes each part of the evenly partitioned image. Intuitively, such transformations commonly shift the input distribution, hence forcing them to be aligned can be harmful. Figure 1 visualizes all the considered transformations.
Cutout  Sobel  Noise  Blur  Perm  Rotate  
AUROC  79.5  69.2  74.4  76.0  83.8  85.2 


Table 4 shows AUROC values of the vanilla SimCLR, where the indistribution samples shifted by the chosen transformation are given as OOD samples. The shifted samples are easily detected: it validates our intuition that the considered transformations shift the input distribution. In particular, “Perm” and “Rotate” are the most distinguishable, which implies they shift the distribution the most. Note that “Perm” and “Rotate” turns out to be the most effective shifting transformations; it implies that the transformations shift the distribution most indeed performs best for our method.^{4}^{4}4We also try contrasting some external OOD samples in a similar manner of [hendrycks2019deep, ruff2020deep]; however, we find that naïvely using them in our framework degrade the performance. This is because the contrastive loss also discriminates within external OOD samples, which is unnecessary and an additional learning burden for our purpose. Shifted samples can be most effective (‘nearby’ but ‘nottoonearby’) OOD without the issue.
Besides, we apply the transformation upon the vanilla SimCLR: align the transformed samples to the original samples (i.e., use as ) or consider them as the shifted samples (i.e., use as ). Table 1(a) shows that aligning the transformations degrade (or on par) the detection performance, while shifting the transformations gives consistent improvements. We also remove or converttoshift the transformation from the vanilla SimCLR in Table 1(b), and see similar results. We remark that one can further improve the performance by combining multiple shifting transformations (see Appendix G).
Linear evaluation. We also measure the linear evaluation [kolesnikov2019revisiting]
, the accuracy of a linear classifier to discriminate classes of indistribution samples. It is widely used for evaluating the quality of (unsupervised) learned representation. We report the linear evaluation of vanilla SimCLR and CSI (with shifting rotation), trained under unlabeled CIFAR10. They show comparable results, 90.48% for SimCLR and 90.19% for CSI; CSI is more specialized to learn a representation for OOD detection.
Datadependence of shifting transformation. We remark that the best choice of shifting transformation for detecting OODs depends on the dataset. For instance, consider a rotationinvariant dataset such as texture. Here, the rotation should not be a shifting transformation. Table 6 shows the AUROC values where Describable Textures Dataset (DTD) [cimpoi2014describing] and ImageNet30 are in and outofdistribution samples, respectively. We compare the vanilla SimCLR and CSI using rotation as , denoted as “Base”, and “CSI (Rotation)”, respectively. Unlike natural images, shifting rotated images degrades OOD detection. See Appendix J for additional discussion.
Training loss. In Table 1(c), we assess the individual effects of each component that consists of our final training loss (5): namely, we compare the vanilla SimCLR (2), contrasting shifted instances (3), and classifying shifted instances (4) losses. For the evaluation of the models of different training losses creftypeplural 5, 4, 3 and 2, we use the detection scores defined in creftypeplural 9, 8, 7 and 6, respectively. We remark that both contrasting and classifying shows better results than the vanilla SimCLR; and combining them (i.e., the final CSI loss (5)) gives further improvements, i.e., two losses are complementary.
Detection score. Finally, Table 1(d) shows the effect of each component in our detection score: the vanilla contrastive (6), contrasting shifted instances (7), and classifying shifted instances (8) scores. We ensemble the scores over both and for creftypeplural 9, 8 and 7, and use a single sample for creftype 6. All the reported values are evaluated from the model trained by the final loss 5. Similar to above, both contrasting and classifying scores show better results than the vanilla contrastive score; and combining them (i.e., the final CSI score (9)) gives further improvements.
4 Related work
4.1 OOD detection
Outofdistribution (OOD) detection is a classic and essential problem in machine learning, studied under different names, e.g., novelty or anomaly detection [hodge2004survey]. In this paper, we primarily focus on unsupervised OOD detection, which is arguably the most traditional and popular setup in the field [scholkopf2000support]. In this setting, the detector is only allowed to access indistribution samples while required to identify unseen OOD samples. There are other settings, e.g., semisupervised setting  the detector can access a small subset of outofdistribution samples [hendrycks2019deep, ruff2020deep], or supervised setting  the detector knows the target outofdistribution, but we do not consider those settings in this paper. We remark that the unsupervised setting is the most practical and challenging scenario since there are infinitely many cases for outofdistribution, and it is often not possible to have such external data.
Most recent works can be grouped into four categories: (a) densitybased [zhai2016deep, nalisnick2019deep, choi2018waic, du2019implicit, ren2019likelihood, serra2020input, grathwohl2020your], (b) reconstructionbased [schlegl2017unsupervised, zong2018deep, deecke2018anomaly, pidhorskyi2018generative, perera2019ocgan, choi2020novelty], (c) oneclass classifier [scholkopf2000support, ruff2018deep], and (d) selfsupervised [golan2018deep, hendrycks2019using_self, bergman2020classification]
methods. We note that there are more extensive literature on this topic, but we mainly focus on the recent work based on deep learning due to the limited space (see
[hodge2004survey, chandola2009anomaly, pimentel2014review] for survey). Brief description for each method are as follows:
Densitybased methods. Densitybased methods are one of the most classic and principled approaches for OOD detection. Intuitively, they directly use the likelihood of the sample as the detection score. However, recent studies reveal that the likelihood is often not the best metric  especially for deep neural networks with complex datasets [nalisnick2019deep, choi2018waic]. Several work thus proposed modified scores, e.g., WAIC [choi2018waic], likelihood ratio [ren2019likelihood], and input complexity [serra2020input], or utilized unnormalized likelihood (i.e., energy) [du2019implicit, grathwohl2020your].

Reconstructionbased methods. Reconstructionbased approach is another popular line of research for OOD detection. It trains an encoderdecoder network that reconstructs the training data in an unsupervised manner. Since the encoderdecoder network would less generalize for unseen OOD samples, they use the reconstruction loss as the detection score. There are also several works on this approach, utilizing generative autoencoders [zong2018deep, pidhorskyi2018generative] or generative adversarial networks [schlegl2017unsupervised, deecke2018anomaly, perera2019ocgan].

Oneclass classifiers. Oneclass classifiers are also a classic and principled approach for OOD detection. They learn a decision boundary of in vs. outofdistribution samples, by giving some margin that covers the indistribution samples [scholkopf2000support]. Recent work shows that the oneclass classifier is effective upon the deep representation [ruff2018deep].

Selfsupervised methods. Selfsupervised methods are a relatively new approach based on the strong representation learned from the selfsupervision [gidaris2018unsupervised]. They train a network with a predefined task (e.g., predict the angle of the rotated image) on the training set, and use the generalization error to detect OOD samples. Recent selfsupervised methods show outstanding results on various OOD detection benchmark datasets [golan2018deep, hendrycks2019using_self, bergman2020classification].
Our work can be categorized into the selfsupervised methods, but differs from the prior work as we consider the contrastive learning type of selfsupervision [chen2020simple]. We validate the power of contrastive representation learning can be extended for OOD detection, with proper modification.
4.2 Confidencecalibrated classifiers
Another line of research is on confidencecalibrated classifiers [hendrycks2017baseline], which relaxes the overconfidence issues of the classifiers. There are two types of calibration: (a) indistribution calibration, that aligns the uncertainty and the actual accuracy, measured by ECE [naeini2015obtaining, guo2017calibration], and (b) outofdistribution detection, that reduces the uncertainty of OOD samples, measured by AUROC [hendrycks2017baseline, lee2018training]. Note that the goal of confidencecalibrated classifiers is to regularize the prediction; hence all three tasks: classification, indistribution calibration, and outofdistribution detection, are done by the softmax probability. Namely, the detection score is given by the confidence (or maximum softmax probability) [hendrycks2017baseline]. There are also several works on designing a specific detection score utilizing the pretrained classifier (e.g., [liang2018enhancing, lee2018simple]), but we do not consider those approaches in this paper.
4.3 Selfsupervised learning
Selfsupervised learning
[gidaris2018unsupervised, kolesnikov2019revisiting] show remarkable success in learning representation recently. In particular, contrastive learning [oord2018representation], specifically instance discrimination [wu2018unsupervised], show the stateoftheart results on visual representation learning [he2019momentum, chen2020simple]. However, most prior works primarily focus on improving the downstream task performance (e.g., classification); and other advantages of selfsupervised learning (e.g., uncertainty or robustness) are rarely investigated [hendrycks2019using_self]. Our work first verifies the effectiveness of contrastive learning for OOD detection.5 Conclusion
We propose a simple yet effective method named contrasting shifted instances (CSI), which extends the power of contrastive learning for outofdistribution (OOD) detection problems. CSI demonstrates outstanding performance under various OOD detection scenarios. We believe our work would guide various future directions in OOD detection and selfsupervised learning as an important baseline.
References
Appendix A Experimental details
Training details. We use ResNet18 [he2016deep] as the base encoder network
and 2layer multilayer perceptron with 128 embedding dimension as the projection head
. All models are trained by minimizing the final loss (5) with a temperature of . We follow the same optimization step of SimCLR [chen2020simple]. For optimization, we train CSI with 1,000 epoch under LARS optimizer
[you2017large] with weight decay of and momentum with 0.9. For the learning rate scheduling, we use linear warmup [goyal2017accurate] for early 10 epochs until learning rate of 1.0 and decay with cosine decay schedule without a restart [loshchilov2016sgdr]. We use batch size of 512 for both vanilla SimCLR and ours: where the batch is given by for vanilla SimCLR and the aggregated onefor ours. Furthermore, we use global batch normalization (BN)
[ioffe2015batch], which shares the BN parameters (mean and variance) over the GPUs in distributed training.
For supervised contrastive learning (SupCLR) [khosla2020supervised] and supervised CSI, we select the best temperature from : SupCLR recommend 0.07 but 0.5 was better in our experiments. For training the encoder
, we use the same optimization scheme as above, except using 700 for the epoch. For training the linear classifier, we train the model for 100 epochs with batch size 128, using stochastic gradient descent with momentum 0.9. The learning rate starts at 0.1 and is dropped by a factor of 10 at 60%, 75%, and 90% of the training progress.
Data augmentation details. We use SimCLR augmentations: Inception crop [szegedy2015going], horizontal flip, color jitter, and grayscale for random augmentations , and rotation as shifting transformation . The detailed description of the augmentations are as follows:

Inception crop.
Randomly crops the area of the original image with uniform distribution 0.08 to 1.0. After the crop, cropped image are resized to the original image size.

Horizontal flip. Flips the image horizontally with 50% of probability.

Color jitter. Change the hue, brightness, and saturation of the image. We transform the RGB (red, green, blue) image into an HSV (hue, saturation, value) image format and add noise to the HSV channels. We apply color jitter with 80% of probability.

Grayscale. Convert into a gray image. Randomly apply a grayscale with 20% of probability.

Rotation. We use rotation as , the shifting transformation, . For a given batch , we apply each rotation degree to obtain the new batch for CSI: .
Dataset details. For oneclass datasets, we train one class of CIFAR10 [krizhevsky2009learning], CIFAR100 (superclass) [krizhevsky2009learning], and ImageNet30 [hendrycks2019using_self]. CIFAR10 and CIFAR100 consist of 50,000 training and 10,000 test images with 10 and 20 (superclass) image classes, respectively. ImageNet30 contains 39,000 training and 3,000 test images with 30 image classes.
For unlabeled and labeled multiclass datasets, we train ResNet with CIFAR10 and ImageNet30. For CIFAR10, outofdistribution (OOD) samples are as follows: SVHN [netzer2011reading] consists of 26,032 test images with 10 digits, resized LSUN [liang2018enhancing] consists of 10,000 test images of 10 different scenes, resized ImageNet [liang2018enhancing] consists of 10,000 test images with 200 images classes from a subset of full ImageNet dataset, Interp. consists of 10,000 test images of linear interpolation of CIFAR10 test images, and LSUN (FIX), ImageNet (FIX) consists of 10,000 test images, respectively with following details in Appendix I. For multiclass ImageNet30, OOD samples are as follows: CUB200 [welinder2010caltech], Stanford Dogs [khosla2011novel], Oxford Pets [parkhi2012cats], Oxford Flowers [nilsback2006visual], Food101 [bossard2014food] without the “hotdog” class to avoid overlap, Places365 [zhou2017places] with small images (256 * 256) validation set, Caltech256 [griffin2007caltech], and Describable Textures Dataset (DTD) [cimpoi2014describing]. Here, we randomly sample 3,000 images to balance with the indistribution test set.
Evaluation metrics. For evaluation, we measure the two metrics that each measures (a) the effectiveness of the proposed score in distinguishing in and outofdistribution images, (b) the confidence calibration of softmax classifier.

Area under the receiver operating characteristic curve (AUROC). Let TP, TN, FP, and FN denote true positive, true negative, false positive and false negative, respectively. The ROC curve is a graph plotting true positive rate = TP / (TP+FN) against the false positive rate = FP / (FP+TN) by varying a threshold.

Expected calibration error (ECE). For a given test data , we group the predictions into interval bins (each of size ). Let be the set of indices of samples whose prediction confidence falls into the interval . Then, the expected calibration error (ECE) [naeini2015obtaining, guo2017calibration] is follows:
(10) where is accuracy of : where is indicator function and is confidence of : where is the confidence of data .
Appendix B Detailed description for confidencecalibrated classifiers
We propose a simple extension of CSI for training a confidencecalibrated classifier [hendrycks2017baseline]. To this end, we first explain the supervised contrastive learning (SupCLR) [khosla2020supervised], a supervised extension of SimCLR that contrasts the samples of different classes instead of the different samples. Following the notation of SimCLR, let be a training batch with class labels , and be an augmented batch with class labels, i.e., . We define a subset of with the samples of label , namely, . Then, the SupCLR objective is:
(11) 
where denotes the set complement and . After learning representation with the SupCLR objective (11), we train a linear classifier which predicts the class labels, upon the embedding network . Here, we use the confidence (or maximum softmax probability) [hendrycks2017baseline] where is given by to detect OOD samples.
Similar to the contrasting loss for SimCLR (3), we extend the SupCLR objective utilizing the shifting transformations . To this end, we consider the joint label of class label and shifting transformation . Let be the shifted batch for each transformation . Then, the supervised contrasting shifted instances (supCSI) loss is given by
(12) 
defined on the selflabel augmented [lee2019rethinking] space . We observe that the classifying shifted instances loss (7) do not help supervised learning, which coincides with the observation of [lee2019rethinking] that the selfsupervised labels often conflict with the class labels. Hence, we only use the contrasting shifted instances loss (3) for our supervised experiments.
From the learned representation, we train two types of linear classifiers: (a) , which predicts the class labels, and (b) , which predicts the joint labels. For the former one, we use directly to compute the confidence . For the latter one, we marginalize the joint prediction over the shifting transformation similar to Section 2.3. Formally, let
be the logit values of
where and . Let denote the logit values correspond to . Then, the ensembled probability is given by:(13) 
where denotes the softmax activation. Here, we use to compute the confidence . We denote the confidence computed by and and “CSI” and “CSIens”, respectively.
Appendix C Additional oneclass OOD detection results
Table 8
presents the confusion matrix of AUROC values of our method on oneclass CIFAR10 datasets, where bold denotes the hard pairs. The results align with the human intuition that ‘car’ is confused to ‘ship’ and ‘truck’, and ‘cat’ is confused to ‘dog’.
Table 9 presents the OOD detection results of various methods on oneclass CIFAR100 (superclass) datasets, for all 20 superclasses. Our method outperforms the prior methods for all classes.
Table 10 presents the OOD detection results of our method on oneclass ImageNet30 dataset, for all 30 classes. Our method consistently performs well for all classes.
Plane  Car  Bird  Cat  Deer  Dog  Frog  Horse  Ship  Truck  Mean  
Plane    74.1  95.8  98.4  94.9  98.0  96.2  90.1  79.6  82.8  90.0 
Car  99.3    99.9  99.9  99.8  99.9  99.8  99.7  98.7  95.0  99.1 
Bird  91.1  97.5    97.3  87.0  92.5  96.1  83.2  96.4  98.0  93.2 
Cat  91.9  91.5  90.3    83.3  67.0  89.6  79.0  92.8  91.9  86.4 
Deer  95.7  98.4  94.9  96.6    94.7  98.7  69.0  97.4  98.8  93.8 
Dog  97.9  98.5  95.5  90.3  88.1    96.8  76.6  98.6  98.3  93.4 
Frog  93.6  92.3  94.6  96.1  96.8  96.3    95.2  94.4  97.3  95.2 
Horse  99.3  99.5  99.0  99.3  94.2  97.4  99.8    99.7  99.4  98.6 
Ship  96.6  91.2  99.5  99.7  99.4  99.7  99.5  99.3    96.6  97.9 
Truck  96.2  72.3  99.4  99.5  99.1  99.4  98.7  98.3  96.2    95.5 
Appendix D Ablation study on random augmentation
We verify that ensembling the scores over the random augmentations improves OOD detection. However, naïve random sampling from the entire is often sample inefficient. We find that choosing a proper subset improves the performance for given number of samples. Specifically, we choose as the set of the most common samples. For example, the size of the cropping area is sampled from for uniform distribution during training. Since the rare samples, e.g., area near increases the noise, we only use the samples with size during inference. Table 11 shows random sampling from the controlled set often gives improvements.
# of samples  Controlled  OCCIFAR10  OCCIFAR100 
4    92.22  87.36 
40    94.13  89.51 
40  ✓  94.31  89.55 
Appendix E Efficient computation of (6) via coreset
One can reduce the computation and memory cost of the contrastive score (6) by selecting a proper subset, i.e., coreset
, of the training samples. To this end, we run Kmeans clustering
[macqueen1967some] on the normalized features using cosine similarity as a metric. Then, we use the center of each cluster as the coreset. For contrasting shifted instances (4), we choose the coreset for each shifting transformation . Table 12 shows the results for various coreset sizes, given by a ratio from the full training samples. Keeping only a few (e.g., 1%) samples is sufficient.Coreset (%)  OCCIFAR10  OCCIFAR100  OCImageNet30 
1%  94.22  89.27  91.06 
10%  94.30  89.46  91.51 
100%  94.31  89.55  91.63 
Appendix F Ablation study on the balancing terms
We study the effects of the balancing terms , in Section 2.3. To this end, we compare of our final loss (5), without (w/o) and with (w/) the balancing terms and . When not using the balancing terms, we set for all . We follow the experimental setup of Table 1, e.g., use rotation for the shifting transformation. We run our experiments on CIFAR10, CIFAR100 (superclass), and ImageNet30 datasets. Table 13 shows that the balancing terms gives a consistent improvement. CIFAR10 do not show much gain since all and show similar values; in contrast, CIFAR100 (superclass) and ImageNet30 show large gain since they varies much.
OCCIFAR10  OCCIFAR100  OCImageNet30  
CSI (w/o balancing)  94.28  89.00  91.04 
CSI (w/ balancing)  94.31  89.55  91.63 
Appendix G Combining multiple shifting transformations
We find that combining multiple shifting transformations: given two transformations and , use as the combined shifting transformation, can give further improvements. Table 14 shows that combining “Noise”, “Blur”, and “Perm” to “Rotate” gives additional gain. We remark that one can investigate the better combination; we choose rotation for our experiments due to its simplicity.
Noise  Blur  Perm  Rotate  Rotate+Noise  Rotate+Blur  Rotate+Perm  
AUROC  89.29  89.15  90.68  94.31  94.65  94.66  94.60 
Appendix H Discussion on the features of the contrastive score (6)
We find that the two features: a) the cosine similarity to the nearest training sample in the training set , i.e., , and (b) the feature norm of the representation, i.e., , are important features for detecting OOD samples under the SimCLR representation.
In this section, we first demonstrate the properties of the two features under vanilla SimCLR. While we use the vanilla SimCLR to validate they are general properties of SimCLR, we remark that our training scheme (see Section 2.2) further improves the discrimination power of the features. Next, we verify that cosine similarity and feature norm are complementary, that combining both features (i.e., (6)) give additional gain. For the latter one, we use our final training loss to match the reported values in prior experiments, but we note that the trend is consistent among the models.
First, we demonstrate the effect of cosine similarity for OOD detection. To this end, we train vanilla SimCLR using CIFAR10 and CIFAR100 and in and outofdistribution datasets. Since SimCLR attracts the same image with different augmentations, it learns to cluster similar images; hence, it shows good discrimination performance measured by linear evaluation [chen2020simple]. Figure 2(a) presents the tSNE [maaten2008visualizing] plot of the normalized features that each color denote different class. Even though SimCLR is trained in an unsupervised manner, the samples of the same classes are gathered.
Figure 2(b) and Figure 2(c) presents the histogram of the cosine similarities from the nearest training sample (i.e., ), for training and test datasets, respectively. For the training set, we choose the second nearest sample since the nearest one is itself. One can see that training samples are concentrated, even though contrastive learning pushes the different samples. It complements the results of Figure 2(a). For test sets, the indistribution samples show a similar trend with the training samples. However, the OOD samples are farther from the training samples, which implies that the cosine similarity is an effective feature to detect OOD samples.
Second, we demonstrate that the feature norm is a discriminative feature for OOD detection. Following the prior setting, we use CIFAR10 and CIFAR100 for in and outofdistribution datasets, respectively. Figure 3(a) shows that the discriminative power of feature norm improves as the training epoch increases. We observe that this phenomenon consistently happens over models and settings; the contrastive loss makes the norm of indistribution samples relatively larger than OOD samples. Figure 3(b) shows the norm of CIFAR10 is indeed larger than CIFAR100, under the final model.
This is somewhat unintuitive since the SimCLR uses the normalized features to compute the loss (1). To understand this phenomenon, we visualize the tSNE [maaten2008visualizing] plot of the feature space in Figure 3(c), randomly choosing 100 images from both datasets. We randomly augment each image for 100 times for better visualization. One can see that indistribution samples tend to be spread out over the large sphere, while OOD samples are gathered near center.^{5}^{5}5tSNE plot does not tell the true behavior of the original feature space, but it may give some intuition. Also, note that the same image with different augmentations are highly clustered, while indistribution samples are slightly more assembled.^{6}^{6}6We also try the local variance of the norm as a detection score. It also works well, but the norm is better.
We suspect that increasing the norm may be an easier way to maximize cosine similarity between two vectors: instead of directly reducing the feature distance of two augmented samples, one can also increase the overall norm of the features to reduce the relative distance of two samples.
Finally, we verify that cosine similarity (simonly) and feature norm (normonly) are complementary: combining them (sim+norm) gives additional improvements. Here, we use the model trained by our final objective (5), and follow the inference scheme of the main experiments (see Table 7). Table 15 shows AUROC values under simonly, normonly, and sim+norm scores. Using only sim or norm already shows good results, but combining them shows the best results.
OCCIFAR10  OCCIFAR100  OCImageNet30  
Simonly  90.12  86.57  83.18 
Normonly  92.70  87.71  88.56 
Sim+Norm  93.32  88.79  89.32 
Appendix I Rethinking OOD detection benchmarks
We find that resized LSUN and ImageNet [liang2018enhancing], one of the most popular benchmark datasets for OOD detection, are visually far from indistribution datasets (commonly, CIFAR [krizhevsky2009learning]). Figure 6 shows that resized LSUN and ImageNet contain artificial noises, produced by broken image operations.^{7}^{7}7It is also reported in https://twitter.com/jaakkolehtinen/status/1258102168176951299. It is problematic since one can detect such datasets with simple data statistics, without understanding semantics from neural networks. To progress OOD detection research one step further, one needs more hard or semantic OOD samples that cannot be easily detected by data statistics.
To verify this, we propose a simple detection score that measures the input smoothness of an image. Intuitively, noisy images would have a higher variation in input space than natural images. Formally, let be the th value of the vectorized image . Here, we define the neighborhood as the set of spatially connected pairs of pixel indices. Then, the total variation distance is given by
(14) 
Then, we define the smoothness score as the difference of total variation from the training samples:
(15) 
Table 16 shows that this simple score detects current benchmark datasets surprisingly well.
To address this issue, we construct new benchmark datasets, using a fixed resize operation^{8}^{8}8
We use PyTorch
torchvision.transforms.Resize() operation., hence coined LSUN (FIX) and ImageNet (FIX). For LSUN (FIX), we randomly sample 1,000 images from every ten classes of the training set of LSUN. For ImageNet (FIX), we randomly sample 10,000 images from the entire training set of ImageNet30, excluding “airliner”, “ambulance”, “parkingmeter”, and “schooner” classes to avoid overlapping with CIFAR10.^{9}^{9}9We provide the datasets and data generation code in https://github.com/alinlab/CSI. Figure 6 shows that the new datasets are more visually realistic than the former ones (Figure 6). Also, Table 16 shows that the fixed datasets are not detected by the simple data statistics (15). We believe our newly produced datasets would be a stronger benchmark for hard or semantic OOD detection for future researches.CIFAR10  
SVHN  LSUN  ImageNet  LSUN (FIX)  ImageNet (FIX)  CIFAR100  Interp. 
85.88  95.70  90.53  44.13  52.76  52.14  66.17 
Appendix J Additional discussion on shifting transformation
As remarked in Section 3.2, the appropriate shifting transformation can be dependent on the dataset. It is crucial for realworld scenarios since many practical applications deal with nonnatural images, e.g., manufacturing  steel^{10}^{10}10https://www.kaggle.com/c/severstalsteeldefectdetection or textile^{11}^{11}11https://lmb.informatik.unifreiburg.de/resources/datasets/tilda.en.html for instance, or aerial [xia2018dota] images. For such datasets, one should not use rotation as a shifting transformation. We present OOD detection results using Steel and ImageNet30 as in and outofdistribution datasets, respectively. Table 17 shows similar results with the one in Section 3.2: shifting rotation degrades the performance. Investigating new transformations considering the characteristics of the datasets would be an interesting future direction.
SimCLR  CSI (Rotation) 
74.0  36.0 
Comments
There are no comments yet.