Mean-Shifted Contrastive Loss for Anomaly Detection

06/07/2021 ∙ by Tal Reiss, et al. ∙ 0

Deep anomaly detection methods learn representations that separate between normal and anomalous samples. Very effective representations are obtained when powerful externally trained feature extractors (e.g. ResNets pre-trained on ImageNet) are fine-tuned on the training data which consists of normal samples and no anomalies. However, this is a difficult task that can suffer from catastrophic collapse, i.e. it is prone to learning trivial and non-specific features. In this paper, we propose a new loss function which can overcome failure modes of both center-loss and contrastive-loss methods. Furthermore, we combine it with a confidence-invariant angular center loss, which replaces the Euclidean distance used in previous work, that was sensitive to prediction confidence. Our improvements yield a new anomaly detection approach, based on Mean-Shifted Contrastive Loss, which is both more accurate and less sensitive to catastrophic collapse than previous methods. Our method achieves state-of-the-art anomaly detection performance on multiple benchmarks including 97.5% ROC-AUC on the CIFAR-10 dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection is a fundamental task for intelligent agents that aims to detect if an observed pattern is normal or anomalous (unusual or unlikely). Anomaly detection has broad applications in scientific and industrial tasks such as detecting new physical phenomena (black holes, supernovae) or genetic mutations, as well as production line inspection and video surveillance. Due to the significance of the task, many efforts have been focused on automatic anomaly detection, particularly on statistical and machine learning methods. A common paradigm used by many anomaly detection methods is measuring the probability of samples and assigning high-probability samples as normal and low-probability samples as anomalous. The quality of the density estimators is closely related to the quality of features used to represent the data. Classical methods used statistical estimators such as K nearest-neighbors (kNN) or Gaussian mixture models (GMMs) on raw features, however this often results in sub-optimal results on high-dimensional data such as images. Many recent methods, learn features in a self-supervised way and use them in order to detect anomalies. Their main weakness is that anomaly detection datasets are typically small and do not include anomalous samples resulting in weak features. An alternative direction, which achieved better results, is to transfer features learned from auxiliary tasks on large-scale external datasets such as ImageNet classification. It was found that fine-tuning the pre-trained features on the normal training data can result in significant performance improvements, however it is quite challenging. The main issue with fine-tuning on one-class classification (OCC) tasks such as anomaly detection is catastrophic collapse i.e. after an initial improvement in efficacy, the features degrade and become uninformative. This phenomenon is caused by trivial solutions allowed by OCC tasks such as the center-loss used by PANDA

Reiss et al. (2020) and Deep-SVDD Ruff et al. (2018), which can achieve perfect score by either ignoring the data or by learning simple functions of the data. Although past methods Reiss et al. (2020); Ruff et al. (2018); Perera and Patel (2019) attempted to mitigate this collapse by various techniques such as: architectural modifications, auxiliary tasks with extra supervision and continual learning, they do not solve the problem of catastrophic collapse.

Our contributions: we propose several advances that significantly improve the accuracy of anomaly detection and reduce catastrophic collapse. i) We introduce a new loss function that relies on ideas from contrastive learning for visual recognition tasks. Unlike the standard contrastive learning, where the angular distances are measured in relation to the origin, we measure the angular distances in relation to the normalized center of the extracted features. We show that this modification is crucial for achieving strong performance for adapting features in the OCC setting. ii) We find that learning a deep representation and rescaling it to the unit sphere provides a substantial boost in accuracy. This constraint dramatically improves performance across fine-tuning settings. We show that this is related to disambiguating semantics from confidence. iii) we demonstrate a limitation of our proposed mean-shifted contrastive loss, and show that the combination of our new loss with an angular center loss solves this limitation. In extensive experiments we demonstrate that our method is able to both achieve the top anomaly detection performance (e.g. ROC-AUC on CIFAR-10) and nearly entirely eliminates catastrophic collapse.

2 Related Work

Classical anomaly detection methods:

Detecting anomalies in images has been researched for several decades. The methods follow three main paradigms: i) Reconstruction - this paradigm first attempts to characterize the normal data by a set of basis functions and then attempts to reconstruct a new example using these basis functions, typically under some constraint such as sparsity or weights with small norm. Samples with high reconstruction errors are atypical of the normal data distribution and are denoted anomalous. Some notable methods include: principal component analysis

Jolliffe (2011) and K nearest neighbors (kNN) Eskin et al. (2002). ii) Density estimation - another paradigm is to first estimate the density of normal data. A new test sample is denoted as anomalous if its estimated density is low. Parametric density estimation methods include Ensembles of Gaussian Mixture Models (EGMM) Glodek et al. (2013)

, and non-parametric methods include kNN (which is also a reconstruction-based method) as well as kernel density estimation

Latecki et al. (2007)

. Both types of methods have weaknesses: parametric methods are sensitive to the parametric assumptions about the nature of the data whereas non-parametric methods suffer from the difficulty to accurately estimate density in high-dimensions. iii) One-class classification (OCC) - this paradigm attempt to fit a parametric classifier to distinguish between normal training data and all other data. The classifier is then used to classify new samples as normal or anomalous. Such methods include one-class support vector machine (OCSVM)

Scholkopf et al. (2000)

and support vector data description (SVDD)

Tax and Duin (2004).

Self-supervised deep learning methods:

Instead of using supervision for learning deep representations, self-supervised methods train neural networks to solve an auxiliary task for which obtaining data is free or at least very inexpensive. Auxiliary tasks for learning high-quality image features include: video frame prediction

Mathieu et al. (2016)

, image colorization

Zhang et al. (2016); Larsson et al. (2016) and puzzle solving Noroozi and Favaro (2016). RotNet Gidaris et al. (2018) used a set of image processing rotations around the image axis, and predicted the true image orientation to learn high-quality image features. GEOM Golan and El-Yaniv (2018), have used similar image-processing task prediction for detecting anomalies in images. This method was improved by Hendrycks et al. (2019), and extended to tabular data by Bergman and Hoshen (2020). Another commonly used self-supervised paradigm is contrastive learning Chen et al. (2020a), which learns representations by distinguishing similar views (augmentations) of the same samples from other data samples. Recently, variants of contrastive learning were also introduced to OCC. CSI Tack et al. (2020) treats augmented input as positive samples and the distributionally-shifted input as negative samples. DROC Sohn et al. (2020) shares a similar technical formulation as CSI without any test-time augmentation nor ensemble of models.

Feature adaptation for one-class classification: This line of work is based on the idea of initializing a neural network with pre-trained weights and then obtaining stronger performance by further adaptation of the training data. Although feature adaptation has been extensively studied in the multi-class classification setting, limited work was done in the OCC setting. DeepSVDD Ruff et al. (2018)

suggested to first train an auto-encoder on the normal training data, and then using the encoder as the initial feature extractor. Moreover, since the features of the encoder are not specifically fitted to anomaly detection, DeepSVDD adapts on the encoder training data. However, this naive training procedure leads to catastrophic collapse. An alternative direction, is to use features learned from auxiliary tasks on large-scale external datasets such as ImageNet classification. Deep features representations trained on the ImageNet dataset have been shown by

Huh et al. (2016) to significantly boost performance on other datasets that are only vaguely related to some of the ImageNet classes. Transferring ImageNet pre-trained features for out-of-distribution detection has been proposed by Hendrycks et al. (2019). Analogous pre-training for OCC has been proposed by Perera and Patel (2019), where they jointly train anomaly detection with the original task, which achieves only limited adaptation success. PANDA Reiss et al. (2020) proposed techniques based on early stopping and EWC Kirkpatrick et al. (2017), a continual learning method, to mitigate catastrophic collapse. Although PANDA achieved state-of-the-art performance on most datasets, it has yet to solve the problem of catastrophic collapse.

3 Background: Learning Representations for One-Class Classification

3.1 Preliminaries

In the one-class classification task, we are given a set of training samples that are all normal (and contain no anomalies) . The objective is to classify a new sample as being normal or anomalous. The methods considered here learn a deep representation of a sample parametrized by the neural network function , where is the feature dimension. In several methods, is initialized by pre-trained weights , which can be learned either using external datasets (e.g. ImageNet classification) or using self-supervised tasks on the training set (e.g. RotNet Gidaris et al. (2018)). The representation is further tuned on the training data to form the final representation . Finally, an anomaly scoring function which utilizes the representation of the sample and predicts how anomalous the sample is. The binary anomaly classification can be predicted by applying a threshold on . In Sec. 3.2 and Sec. 3.3, we review the most relevant methods for learning the representation .

3.2 Center Loss

One simple method for adapting an initial feature extractor for OCC is using a center loss. The idea simply states that the features are adapted such that the training data, which consist only of normal samples, lie as near as possible to the center. Specifically, the center loss for an input sample can be written as follows:


The feature extractor is initialized with some pre-trained feature extractor (an ImageNet pre-trained ResNet was shown by Reiss et al. (2020) to be very effective). The center can be set to be the mean of the training set pre-trained feature representation:


This loss suffers from catastrophic collapse, i.e., the discriminative properties of the pre-trained extractor may be lost due to training on the simple center loss task. After an initial improvement in efficacy, the features degrade and become uninformative due to a trivial solution where independently of the sample . Such a representation cannot, of course, discriminate between normal and anomalous samples.

DeepSVDD Ruff et al. (2018) suggested various architectural modifications to improve the utilization of the center loss. However, the proposed architectures were too constrained. PANDA Reiss et al. (2020) showed that the DeepSVDD feature adaptation does not perform better than linear post-processing of the initial feature extractor . Instead, they use the center loss without any architectural modifications, and propose techniques based on early stopping and continual learning to mitigate catastrophic collapse. While achieving state-of-the-art performance on most datasets, this does not completely resolve the issue.

3.3 Self-supervised Auxiliary Tasks

Figure 1: Top: The angular representation in relation to the origin. enlarging the angles between postive and negative samples, thus increasing their Euclidean distance to . Bottom: The mean-shifted representation. does not affect the Euclidean distance between and the mean-shifted representations while maximizes the angles between the negative pairs.

An alternative line of work proposed to first learn the representations

by using state-of-the-art self-supervised learning techniques. Although earlier works were based on RotNet ideas

Gidaris et al. (2018), top performing methods are now based on contrastive learning. In the contrastive training procedure a mini-batch of size is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the mini-batch, resulting in data points. The typical contrastive loss for a positive pair , where and are augmentations of , is written below:


where is an augmented view of some , denotes a temperature hyper-parameter and it holds that . Augmentations include crops, flips, color jitter, grayscale and Gaussian blurs. The contrastive loss was shown by Wang and Isola (2020)

to stimulate two properties i) uniform distribution of

across the unit sphere ii) different augmentations of the same sample are mapped to the same representation.

Constrative methods currently achieve the top performance for anomaly detection without utilization of externally trained network weights. We remind the reader that PANDA uses ImageNet pre-trained weights while contrastive loss methods, such as CSI Tack et al. (2020), only use the normal training data. The fact that contrastive losses achieve better self-supervised anomaly detection performance than the center loss might suggest that combining pre-trained features with the contrastive methods can result in further improvement over PANDA. However, we empirically find that this is not the case, rather it results in very fast catastrophic collapse and no improvement over the pre-trained feature extractor . We will present a new loss function: Mean-Shifted Contrastive Loss to overcome this limitation in Sec. 4.

Figure 2: (a) The initialized feature space derived by . (b) forces to be equally distributed across the unit sphere, resulting that every anomalous sample will have a nearby normal sample (c) operates in the space of angles around the center forming a uniform distribution of across the unit sphere that surrounds the center (d) Projecting the mean-shifted features to the unit sphere after optimizing yields an informative compact representation of normal samples features around the center.

4 The Mean-Shifted Contrastive Loss

In this section, we introduce our new approach for OCC feature adaptation. In Sec 4.1 we present our new loss function, the mean-shifted contrastive loss, where we operate in the angular space with respect to the extracted features center. In Sec 4.2 we explain the advantages of using the angular distance as a metric in place of the Euclidean distance and leverage it to propose the angular center loss. In Sec 4.3 we combine the above to construct our final approach.

4.1 Mean-Shifted Loss: Modifying the Contrastive Loss for OCC Transfer Learning

Previous contrastive learning methods were mostly evaluated by their ability to perform inter-class separation. While contrastive methods have achieved state-of-the-art performance on visual recognition tasks, they are not apriori designed for OCC feature adaptation. In order to minimize the contrastive loss, the angles between representations of negative pairs need to be maximized, even though are both normal samples, i.e., from the same class. By maximizing these angles, the distance to the normalized center increases as well, as illustrated in Fig. 1 (top). This behaviour is in contrast to the optimization of the center loss (Eq. 1), which learns representations by minimizing the Euclidean distance between normal representations and the center. PANDA Reiss et al. (2020) has shown that the optimized center loss results in high anomaly detection performance. Furthermore, forcing the representations of the normal training data to be uniformly distributed on the unit sphere is not well suited for OCC, since every anomalous sample will have a nearby normal sample .

We propose a modification to the contrastive loss that alleviates these issues. Instead of measuring the angular distance between samples in relation to the origin, we measure the angular distance in relation to the normalized center of the normal features. As can be seen in Fig. 1 (bottom), in our proposed mean-shifted representation the contrastive loss maximizes the angles between the negative pairs while maintaining their distance to the normalized center. Our new loss function is named - Mean-Shifted Contrastive Loss.

By a slight abuse of notation, let us denote the normalized center of the feature representation of the training set by (that is, the normalized form of Eq. 2). For each image , we create two different augmentations of the image, denoted . All images are passed through a feature extractor, then, as in most contrastive learning methods, the extracted features are scaled to the unit sphere (by normalization) resulting in their respective feature representations. For each representation, we define its mean-shifted counterpart, by subtracting the center and normalizing to the unit sphere. For a sample , its mean-shifted representation is defined:


The mean-shifted loss for two augmentations of image from an augmented mini-batch of size is defined as follows:


where denotes a temperature hyper-parameter. Fig. 2 illustrates the training process. Since the mean-shifted representation operates in the space of angles around , it does not directly encourage increasing the distance between the center and the features of the training images. Furthermore, the loss forms a uniform distribution around the center of normal data, rather than forming a uniform distribution of the normal data on the unit sphere, thus allowing discrimination of anomalies (which will often be farther from the center). This behaviour mostly eliminates the collapse encountered when adapting using the standard contrastive loss.

4.2 Angular Center Loss for Anomaly Detection

One limitation of the mean-shifted loss is that although it does not directly encourage increasing the distance of features of training images from the center, it might still do so indirectly. Specifically, if the Euclidean distance between and is small, the angle is very susceptible to changes in . Indeed, a small change in would induce a large change in the angle, since it induces sensitive behaviour. This may eventually increase the distance to . In accordance to the intuition of the center loss used by DeepSVDD Ruff et al. (2018) and PANDA Reiss et al. (2020), we expect that having the normal data lie in a small region around the center will be more discriminative than when it is allowed to occupy larger distances. We propose to support the shrinkage of the distance of normal samples from the center by adding a center loss.

Breaking away from previous deep learning approaches that used the center loss, we propose to use the angular center loss. The angular center loss encourages the angular distance between each sample and the center to be minimal. This contrasts with the standard center loss, which opts towards minimal Euclidean distance to the center. Although a simple change, the angular center loss achieves much better results than the regular center loss (see Sec. 5.3). We remind the reader that our feature extractor yields unit vectors, and that is the normalized center. Therefore, the angular center loss is defined as:

Figure 3: Confidence histogram of CIFAR-10 "Bird" class. The norm confidence of the extracted features derived by does not differentiate between normal and anomalous samples.

In what follows we analyse the reasons for the strong performance of the angular center loss. Our initial feature extractor is pre-trained on a classification task (specifically ImageNet classification). To obtain class probabilities from the features , which are subsequently multiplied by classifier matrix

and passed through a softmax layer. The logits are therefore given by

. As softmax is a monotonic function, scaling of the logits does not change the order of probabilities. However, scaling does determine the degree of confidence in the decision. We propose to disambiguate the representation into two components: i) the semantic class , and the confidence . The confidence acts as a per-sample temperature that determines how confident the discrimination between the classes is. In Fig. 3, we compare the histogram of confidence values between the normal and anomalous values on a particular class of the CIFAR-10 dataset ("Bird"). We observe that confidence does not discriminate between normal and anomalous images in this dataset. A thorough investigation that we conducted, showed that the confidence of an ImageNet pre-trained feature representation did not help the anomaly detection performance. It is possible that this is partially caused by the protocol used to convert multi-class datasets into anomaly detection datasets. This motivates the angular center loss, which is only sensitive to the semantic similarity with the center of the normal images, and does not use classifier confidence, which acts as a nuisance factor.

4.3 The best of both worlds: A new feature adaptation approach for OCC

Our entire approach is formed by combining the two losses: i) the mean-shifted contrastive loss (Sec. 4.1) ii) the angular center loss (Sec. 4.2).


This combination allows us to enjoy the best of both worlds and alleviates some of the limitations of each loss. The ability of the mean-shifted loss to arbitrarily increase the feature angular distance of the normal samples from the center is kept in check by the angular center loss. On the other hand, the trivial solution of the angular center loss, where independently of the sample is mitigated by the combination with the mean-shifted contrastive loss. Therefore, the combination of the two losses eliminates these faults, achieving high accuracy and training stability. In fact, we show in Sec. 5

that our combined losses encounter virtually no collapse after hundreds of training epochs. This should be compared to the collapse encountered after less than ten epochs for the standard contrastive loss, and after tens of epochs for the center loss in PANDA

Reiss et al. (2020).

Anomaly criterion: In order to classify a sample as normal or anomalous, we use a simple criterion based on kNN using the cosine distance. We first compute the cosine distance between the features of the target image and those of all training images. The anomaly score is given by:


where denotes the nearest features to in the training feature set . By checking if the anomaly score is larger than a threshold, we determine if the image is normal or anomalous.

5 Experiments

Dataset Self-supervised Pre-trained
D-SVDD Ruff et al. (2018) MHRot Hendrycks et al. (2019) DROC Sohn et al. (2020) CSI Tack et al. (2020) PANDA Reiss et al. (2020) Ours
CIFAR-10 64.8 90.1 92.5 94.3 96.2 97.5
CIFAR-100 67.0 80.1 86.5 89.6 94.1 96.5
CatsVsDogs 50.5 86.0 89.6 86.3 97.3 99.4
DIOR 70.0 73.3 - 78.5 94.3 97.2
Table 1:

Anomaly detection performance (mean ROC-AUC %) on datasets containing more than 500 train samples. For our method we report the means and standard deviations averaged over five runs. Bold denotes the best results.

In this section, we extensively evaluate our method and demonstrate that it outperforms the state-of-the-art. In Sec 5.2, we report our OCC results with a comparison to previous works on the standard benchmark datasets. In Sec 5.3 we analyse the ability of our method to mitigate catastrophic collapse and present an ablation study.

Building up on the framework suggested in Reiss et al. (2020), we use ResNet152 pre-trained on ImageNet classification task as , and adding an additional final normalization layer - this is our initialized feature extractor . We adopt the data augmentation module proposed by Chen et al. (2020b); we sequentially apply a -pixel crop from a randomly resized image, random color jittering, random grayscale conversion, random Gaussian blur and random horizontal flip. By default, we fine-tune our model with the loss function in Eq. 7. For inference we use the criterion described in Sec. 4.3. We adopt the ROC-AUC metric as detection performance score. Full training and implementation details are provided in Appendix A.3.

5.1 Benchmarks

We evaluated our approach on a wide range of anomaly detection benchmarks. Following Golan and El-Yaniv (2018); Hendrycks et al. (2019) we run our experiments on commonly used datasets: CIFAR-10 Krizhevsky et al. (2009), CIFAR-100 coarse-grained version that consists of 20 classes Krizhevsky et al. (2009), and CatsVsDogs Elson et al. (2007). In order to demonstrate different challenges in image anomaly detection, we further extend our results on small datasets from different domains. Following the setting presented in Reiss et al. (2020) we tested our method on: 102 Category Flowers Nilsback and Zisserman (2008), Caltech-UCSD Birds 200 Wah et al. (2011), MVTec Bergmann et al. (2019), WBC Zheng et al. (2018), DIOR Li et al. (2020). Following standard protocol, multi-class dataset are converted to anomaly detection by setting a class as normal and all other classes as anomalies. This is performed for all classes, in practice turning a single dataset with classes into datasets. For the full dataset descriptions and details see Appendix A.1

5.2 Comparison on Standard Datasets

We compare our approach with the top current self-supervised and pre-trained feature adaptation methods Ruff et al. (2018); Hendrycks et al. (2019); Tack et al. (2020); Sohn et al. (2020); Reiss et al. (2020). Results that were reported in the original papers were copied. When the results were not reported in the original papers, we ran the experiments (where possible).

Tab. 1 shows that our proposed approach surpasses the previous state-of-the-art on the common OCC benchmarks. This establishes the superiority of our approach over previous self-supervised and pre-trained methods. This is due to our new loss function and the removal of confidence information (see Sec. 5.3). The full additional class-wise results for CIFAR-10, CIFAR-100 and CatsVsDogs see Appendix A.5

5.3 Analysis & Ablation Study

Small datasets: In Tab. 2 we present a comparison between (i) top self-supervised contrastive-learning based method - CSI Tack et al. (2020) (ii) PANDA based on the Euclidean metric Reiss et al. (2020) (iii) our method - which uses normalized representations. We see that the self-supervised method does not perform well on such small datasets, whereas our method achieves very strong performance. The reason for the poor performance of self-supervised methods on small dataset, is due to the fact that the only training data they see is the small dataset, and they cannot learn strong features using such a small amount of data. This is particularly severe for contrastive methods (but is also the case for all other self-supervised methods). As pre-trained methods transfer features from external datasets, they do not have this failure mode. Note that similarly to PANDA Reiss et al. (2020), our method does not benefit from adaptation on small datasets - and relies on the initial feature extractor , after normalizing away the confidence . Moreover, unlike PANDA which uses kNN with Euclidean distance (DN2) we use the angular distance. We find that our confidence-invariant approach outperforms PANDA in all benchmarks but one, in which we also achieve comparable performance.

Birds Flowers MvTec WBC
CSI Tack et al. (2020) 52.4 60.8 63.6 50.4
PANDA Reiss et al. (2020) 95.3 94.1 86.5 87.4
PANDA + norm (ours) 96.7 96.5 87.2 87.0
Table 2: Anomaly detection accuracy (mean ROC-AUC %) on various small dataset ( 250 images). Self-supervised methods fail while pre-trained features achieves strong results.
Dataset DN2 Center
Raw Angular
CIFAR-10 92.5 95.8 96.2 96.8 97.3 97.5
Table 3: Training objective ablation study (CIFAR-10, mean ROC-AUC %).

Training objective: The individual effects of each of the components that comprise our final train loss is appraised in Tab. 3 as well as the unadapted ImageNet pre-trained ResNet features combined with kNN as anomaly scoring (DN2). We note that both the confidence-invariant form of the pre-trained model and the angular center loss achieve better performance than the raw features and the vanilla center loss respectively. We further notice that the mean-shifted loss outperforms the rest, and combining it with the angular center loss results in further improvements.

Catastrophic collapse: In Fig. 4.a, we evaluated the collapse of different training objectives on the CIFAR-10 "Bird" class. We notice that the contrastive loss is unsuitable for OCC feature adaptation as it results in very fast catastrophic collapse. PANDA-ES (early-stopping) results in initial improvement in accuracy, but after few epochs the features degrade and become uninformative. PANDA-EWC has the same itinerary; it postpones the collapse, but does not prevent it. Finally, we see that the mean-shifted loss nearly entirely eliminates catastrophic collapse. In Fig. 4.b-c we zoom in on the collapse slopes of the different components that comprise our final loss. While our final loss results in higher accuracy than its components, its feature degradation is similar to that of the mean-shifted loss. Moreover, the feature degradation of the angular center loss is faster than the above but is still much better than that of the standard center loss.

Figure 4: CIFAR-10 "Bird" class: (a) While the contrastive loss and PANDA suffers from catastrophic collapse the mean-shifted loss eliminates it nearly entirely. (b) We zoom in to the ROC-AUC according to epoch with different training objectives. (c) We zoom in further to the maximum of the curves and observe a faster decline for the angular center loss accuracy than for the others.

Negative samples: In additional to contrastive self-supervised learning methods such as SimCLR Chen et al. (2020a) and MoCo He et al. (2019), other non-contrastive methods have been proposed (e.g. BYOL Grill et al. (2020) and SimSiam Chen and He (2020)) which only use positive pairs but no negative pairs. We evaluated our method with SimSiam (using the mean-shifted representations), which is the same as using our loss without negative examples. We found that the method experiences an immediate catastrophic collapse. This indicates that negative examples are necessary for good performance when using mean-shifted representations. To give some intuition, note that the SimSiam objective (with or without mean-shifted representations), can in fact be optimized by having all representations mapped to a constant value. Although it does not happen when SimSiam is initialized from scratch, it appears that in the OCC case, it does degrade to the trivial solution. This establishes the need for a contrastive approach.

6 Conclusion

We presented a novel feature adaptation approach for deep anomaly detection. We improved over PANDA Reiss et al. (2020) and surpassed it by introducing two new components; First, we presented a new loss function for feature adaptation for one-class classification that measures the angular distance in relation to the center of the extracted features rather than the origin. Second, we used the angular distance as a metric in place of the Euclidean distance, as the latter was shown to be sensitive to prediction confidence, which was found to be uncorrelated with anomalies. Our method achieves the top anomaly detection performance and appears to eliminate catastrophic collapse nearly entirely.


  • [1] L. Bergman and Y. Hoshen (2020) Classification-based anomaly detection for general data. In ICLR, Cited by: §2.
  • [2] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger (2019) MVTec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9592–9600. Cited by: §A.1, §5.1.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2, §5.3.
  • [4] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §5.
  • [5] X. Chen and K. He (2020) Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566. Cited by: §5.3.
  • [6] J. Elson, J. R. Douceur, J. Howell, and J. Saul (2007) Asirra: a captcha that exploits interest-aligned manual image categorization.. In ACM Conference on Computer and Communications Security, Vol. 7, pp. 366–374. Cited by: §A.1, §5.1.
  • [7] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo (2002) A geometric framework for unsupervised anomaly detection. In Applications of data mining in computer security, pp. 77–101. Cited by: §2.
  • [8] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2, §3.1, §3.3.
  • [9] M. Glodek, M. Schels, and F. Schwenker (2013) Ensemble gaussian mixture models for probability density estimation. Computational Statistics 28 (1), pp. 127–138. Cited by: §2.
  • [10] I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. In NeurIPS, Cited by: §2, §5.1.
  • [11] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §5.3.
  • [12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §5.3.
  • [13] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, Cited by: §A.2, Table 4, Table 5, Table 6, §2, §2, §5.1, §5.2, Table 1.
  • [14] M. Huh, P. Agrawal, and A. A. Efros (2016)

    What makes imagenet good for transfer learning?

    arXiv preprint arXiv:1608.08614. Cited by: §2.
  • [15] I. Jolliffe (2011) Principal component analysis. Springer. Cited by: §2.
  • [16] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.
  • [17] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §A.1, §5.1.
  • [18] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In ECCV, Cited by: §2.
  • [19] L. J. Latecki, A. Lazarevic, and D. Pokrajac (2007) Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 61–75. Cited by: §2.
  • [20] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han (2020) Object detection in optical remote sensing images: a survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 159, pp. 296–307. Cited by: §A.1, §5.1.
  • [21] M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. ICLR. Cited by: §2.
  • [22] M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §A.1, §5.1.
  • [23] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • [24] P. Perera and V. M. Patel (2019) Learning deep features for one-class classification. IEEE Transactions on Image Processing 28 (11), pp. 5450–5463. Cited by: §1, §2.
  • [25] T. Reiss, N. Cohen, L. Bergman, and Y. Hoshen (2020) PANDA–adapting pretrained features for anomaly detection. arXiv preprint arXiv:2010.05903. Cited by: §A.2, Table 4, Table 5, Table 6, §1, §2, §3.2, §3.2, §4.1, §4.2, §4.3, §5.1, §5.2, §5.3, Table 1, Table 2, §5, §6.
  • [26] L. Ruff, N. Gornitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In ICML, Cited by: §A.2, Table 4, Table 5, Table 6, §1, §2, §3.2, §4.2, §5.2, Table 1.
  • [27] B. Scholkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt (2000)

    Support vector method for novelty detection

    In NIPS, Cited by: §2.
  • [28] K. Sohn, C. Li, J. Yoon, M. Jin, and T. Pfister (2020) Learning and evaluating representations for deep one-class classification. arXiv preprint arXiv:2011.02578. Cited by: §A.2, Table 4, Table 5, Table 6, §2, §5.2, Table 1.
  • [29] J. Tack, S. Mo, J. Jeong, and J. Shin (2020) Csi: novelty detection via contrastive learning on distributionally shifted instances. arXiv preprint arXiv:2007.08176. Cited by: §A.2, Table 4, Table 5, Table 6, §2, §3.3, §5.2, §5.3, Table 1, Table 2.
  • [30] D. M. Tax and R. P. Duin (2004) Support vector data description. Machine learning. Cited by: §2.
  • [31] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §A.1, §5.1.
  • [32] T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929–9939. Cited by: §3.3.
  • [33] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §2.
  • [34] X. Zheng, Y. Wang, G. Wang, and J. Liu (2018) Fast and robust segmentation of white blood cell images by self-supervised learning. Micron 107, pp. 55–71. Cited by: §A.1, §5.1.

Appendix A Experimental details

a.1 Dataset Descriptions

Standard datasets: We evaluate our method on a set of commonly used datasets: CIFAR-10 Krizhevsky et al. (2009): Consists of RGB images of 10 object classes. CIFAR-100 Krizhevsky et al. (2009): We use the coarse-grained version that consists of 20 classes. DogsVsCats: High resolution color images of two classes: cats and dogs. The data were extracted from the ASIRRA datasetElson et al. (2007), we split each class to the first 10,000 images as train and the last 2,500 as test.

Small datasets: To further extend our results, we compared the methods on a number of small datasets from different domains: Category Flowers & Caltech-UCSD Birds Nilsback and Zisserman (2008); Wah et al. (2011): For each of those datasets we evaluated the methods using only each of the first classes as normal, and using the entire test set for evaluation. MvTecBergmann et al. (2019): This dataset contains 15 different industrial products, with normal images of proper products for train and types of manufacturing errors as anomalies. The anomalies in MvTec are in-class i.e. the anomalous images come from the same class of normal images with subtle variations.

Symmetric datasets: We evaluated our method on datasets that contain symmetries, such as images that have no preferred angle (microscopy, aerial images.): WBC Zheng et al. (2018): We used the big classes in "Dataset 1" of microscopy images of white blood cells, and a train-test split. DIOR Li et al. (2020): We pre-processed the DIOR aerial image dataset by taking the segmented object in classes that have more than images with size larger than pixels.

a.2 Baselines

DROC Sohn et al. (2020): We used the numbers reported in the paper.

For the evaluation of the other competing method, we trained using the official repositories of their authors and make an effort to select the best configurations available.

DeepSVDD Ruff et al. (2018): We resize all the images to

pixels and use the official pyTorch implementation with the CIFAR-10 configuration.

MHRot Hendrycks et al. (2019): An improved version of the original RotNet approach. For high-resolution images we used the current GitHub implementation. For low resolution images, we modified the code to the architecture described in the paper, replicating the numbers in the paper on CIFAR-10.

CSI Tack et al. (2020), PANDA Reiss et al. (2020): We run the code and used the exact protocol as described in the official repositories.

a.3 Implementation details

We fine-tune the two last blocks of an ImageNet pre-trained ResNet152 with an additional normalization layer for 100 epochs by minimizing where the temperature is set as 0.25. We use SGD optimizer with weight decay of , and no momentum. The size of the mini-batches is set to be . Finally, for anomaly scoring we use kNN with nearest neighbours.

a.4 Training Resources

Training each dataset class presented in this paper takes approximately 3 hours on a single NVIDIA RTX-2080 TI.

a.5 Per-class results

DeepSVDD Ruff et al. (2018) MHRot Hendrycks et al. (2019) DROC Sohn et al. (2020) CSI Tack et al. (2020) PANDA Reiss et al. (2020) Ours
0 61.7 77.5 90.9 89.9 97.4 97.7
1 65.9 96.9 98.9 99.1 98.4 98.9
2 50.8 87.3 88.1 93.1 93.9 95.8
3 59.1 80.9 83.1 86.4 90.6 94.5
4 60.9 92.7 89.9 93.9 97.5 97.3
5 65.7 90.2 90.3 93.2 94.4 97.1
6 67.7 90.9 93.5 95.1 97.5 98.4
7 67.3 96.5 98.2 98.7 97.5 98.3
8 75.9 95.2 96.5 97.9 97.6 98.7
9 73.1 93.3 95.2 95.5 97.4 98.4
Mean 64.8 90.1 92.5 94.3 96.2 97.5
Table 4: CIFAR-10 anomaly detection performance (mean ROC-AUC %). Bold denotes the best results.
DeepSVDD Ruff et al. (2018) MHRot Hendrycks et al. (2019) DROC Sohn et al. (2020) CSI Tack et al. (2020) PANDA Reiss et al. (2020) Ours
0 66.0 77.6 82.9 86.3 91.5 96.2
1 60.1 72.8 84.3 84.8 92.6 95.9
2 59.2 71.9 88.6 88.9 98.3 98.4
3 58.7 81.0 86.4 85.7 96.6 97.7
4 60.9 81.1 92.6 93.7 96.3 97.6
5 54.2 66.7 84.5 81.9 94.1 96.5
6 63.7 87.9 73.4 91.8 96.4 98.6
7 66.1 69.4 84.2 83.9 91.2 94.1
8 74.8 86.8 87.7 91.6 94.7 97.1
9 78.3 91.7 94.1 95.0 94.0 96.6
10 80.4 87.3 85.2 94.0 96.4 97.4
11 68.3 85.4 87.8 90.1 92.6 96.3
12 75.6 85.1 82.0 90.3 93.1 95.6
13 61.0 60.3 82.7 81.5 89.4 93.0
14 64.3 92.7 93.4 94.4 98.0 98.9
15 66.3 70.4 75.8 85.6 89.7 92.6
16 72.0 78.3 80.3 83.0 92.1 95.4
17 75.9 93.5 97.5 97.5 97.7 98.5
18 67.4 89.6 94.4 95.9 94.7 97.4
19 65.8 88.1 92.4 95.2 92.7 97.0
Mean 67.0 80.1 86.5 89.6 94.1 96.5
Table 5: CIFAR-100 coarse-grained version anomaly detection performance (mean ROC-AUC %). Bold denotes the best results.
DeepSVDD Ruff et al. (2018) MHRot Hendrycks et al. (2019) DROC Sohn et al. (2020) CSI Tack et al. (2020) PANDA Reiss et al. (2020) Ours
Cat 49.2 87.7 91.7 85.7 99.2 99.7
Dog 51.8 84.2 87.5 86.9 95.4 99.1
Mean 50.5 86.0 89.6 86.3 97.3 99.4
Table 6: CatsVsDogs anomaly detection performance (mean ROC-AUC %). Bold denotes the best results.