Deep Nearest Neighbor Anomaly Detection

02/24/2020 ∙ by Liron Bergman, et al. ∙ 0

Nearest neighbors is a successful and long-standing technique for anomaly detection. Significant progress has been recently achieved by self-supervised deep methods (e.g. RotNet). Self-supervised features however typically under-perform Imagenet pre-trained features. In this work, we investigate whether the recent progress can indeed outperform nearest-neighbor methods operating on an Imagenet pretrained feature space. The simple nearest-neighbor based-approach is experimentally shown to outperform self-supervised methods in: accuracy, few shot generalization, training time and noise robustness while making fewer assumptions on image distributions.



There are no comments yet.


page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Agents interacting with the world are constantly exposed to a continuous stream of data. Agents can benefit from classifying particular data as anomalous i.e. particularly interesting or unexpected. Such discrimination is helpful in allocating resources to the observations that require it. This mechanism is used by humans to discover opportunities or alert of dangers. Anomaly detection by artificial intelligence has many important applications such as fraud detection, cyber intrusion detection and predictive maintenance of critical industrial equipment.

In machine learning, the task of anomaly detection consists of learning a classifier that can label a data point as normal or anomalous. In supervised classification, methods attempt to perform well on normal data whereas anomalous data is considered noise. The goal of an anomaly detection methods is to specifically detect extreme cases, which are highly variable and hard to predict. This makes the task of anomaly detection challenging (and often poorly specified).

The three main settings for anomaly detection are: supervised, semi-supervised and unsupervised. In the supervised setting, labelled training examples exist for normal and anomalous data. It is therefore not fundamentally different from other classification tasks. This setting is also too restrictive for many anomaly detection tasks as many anomalies of interest have never been seen before e.g. the emergence of new diseases. In the more interesting semi-supervised setting, all training images are normal with no included anomalies. The task of learning a normal-anomaly classifier is now one-class classification. The most difficult setting is unsupervised

where an unlabelled training set of both normal and anomalous data exists. The typical assumption is that the proportion of anomalous data is significantly smaller than normal data. In this paper, we deal both with the semi-supervised and the unsupervised settings. Anomaly detection methods are typically based on distance, distribution or classification. The emergence of deep neural networks has brought significant improvements to each category. In the last two years, deep classification-based methods have significantly outperformed all other methods, mainly relying on the principle that classifiers that were trained to perform a certain task on normal data will perform this task well on unseen normal data, but will fail on anomalous data, due to poor generalization on a different data distribution.

In a recent paper, Gu et al. (2019)

demonstrated that a K nearest-neighbours (kNN) approach on the raw data is competitive with the state-of-the-art methods on tabular data. Surprisingly, kNN is not used or compared against in most current image anomaly detection papers. In this paper, we show that although kNN on raw image data does not perform well, it outperforms the state of the art when combined with a strong off-the-shelf generic feature extractor. Specifically, we embed every (train and test) image using an Imagenet-pretrained ResNet feature extractor. We compute the K nearest neighbor (KNN) distance between the embedding of each test image and the training set, and use a simple threshold-based criterion to determine if a datum is anomalous.

We evaluate this baseline extensively, both on commonly used datasets as well as datasets that are quite different from Imagenet. We find that it has significant advantages over existing methods: i) higher than state-of-the-art accuracy ii) extremely low sample complexity iii) it can utilize very strong external feature extractors, at minimal cost iv) it makes few assumptions on the images e.g. images can be rotation invariant, and of arbitrary size v) it is robust to anomalies in the training set i.e. it can handle the unsupervised case (when coupled with our two-stage approach) vi) it is plug and play, does not have a training stage.

Another contribution of our paper is presenting a novel adaptation of kNN to image group anomaly detection, a task that received scant attention from the deep learning community.

Although using kNN for anomaly detection is not a new method, it is not often used or compared against by most recent image anomaly detection works. Our aim is to bring awareness to this simple but highly effective and general image anomaly detection method. We believe that every new work should compare to this simple method due to its simplicity, robustness, low sample complexity and generality.

2 Previous Work

Pre-deep learning methods: The two classical paradigms for anomaly detection are: reconstruction-based and distribution-based. Reconstruction-based methods use the training set to learn a set of basis functions, which represent the normal data in an effective way. At test time, they attempt to reconstruct a new sample using the learned basis functions. The method assumes that normal data will be reconstructed well, while anomalous data will not. By thresholding the reconstruction cost, the sample is classified as normal or anomalous. Choices of different basis functions include: sparse combinations of other samples (e.g. kNN) (Eskin et al., 2002), principal components (Jolliffe, 2011; Candès et al., 2011)

, K-means

(Hartigan and Wong, 1979). Reconstruction metric include Euclidean, distance or perceptual losses such as SSIM (Wang et al., 2004)

. The main weaknesses of reconstruction-based methods are i) difficulty of learning discriminative basis functions ii) finding effective similarity measures is non-trivial. Semi-supervised distribution-based approaches, attempt to learn the probability density function (PDF) of the normal data. Given a new sample, its probability is evaluated and is designated as anomalous if the probability is lower than a certain threshold. Such methods include: parametric models e.g. mixture of Gaussians (GMM). Non-parametric methods include Kernel Density Estimation

(Latecki et al., 2007) and kNN (Eskin et al., 2002)

(which we also consider reconstruction-based) The main weakness of distributional methods is the difficulty of density estimation for high-dimensional data. Another popular approach is one-class SVM

(Scholkopf et al., 2000) and related SVDD (Tax and Duin, 2004). SVDD can be seen as fitting the minimal volume sphere that includes at least a certain percentage of the normal data points. As this method is very sensitive to the feature space, kernel methods were used to learn an effective feature space.

Augmenting classical methods with deep networks: The success of deep neural networks has prompted research combining deep learned features to classical methods. PCA methods were extended to deep auto-encoders (Yang et al., 2017), while their reconstruction costs were extended to deep perceptual losses (Zhang et al., 2018). GANs were also used as a basis function for reconstruction in images. One approach (Zong et al., 2018) to improve distributional models is to first learn to embed data in a semantic, low dimensional space and then model its distribution using standard methods e.g. GMM. SVDD was extended by Ruff et al. (2018)

to learn deep features as a superior alternative for kernel methods. This method suffers from a ”mode collapse” issue, which has been the subject of followup work. The approach investigated in this paper can be seen as belonging to this category, as classical kNN is extended with deep learned features.

Self-supervised Deep Methods: Instead of using supervision for learning deep representations, self-supervised methods train neural networks to solve an auxiliary task for which obtaining data is free or at least very inexpensive. It should be noted that self-supervised representation typically under-perform those learned from large supervised datasets such as Imagenet. Auxiliary tasks for learning high-quality image features include: video frame prediction (Mathieu et al., 2016)

, image colorization

(Zhang et al., 2016; Larsson et al., 2016), and puzzle solving (Noroozi and Favaro, 2016). Recently, Gidaris et al. (2018) used a set of image processing transformations (rotation by degrees around the image axis), and predicted the true image orientation. They used it to learn high-quality image features. Golan and El-Yaniv (2018), have used similar image-processing task prediction for detecting anomalies in images. This method has shown good performance on detecting images from anomalous classes. The performance of this method was improved by Hendrycks et al. (2019), while it was combined with openset classification and extended to tabular data by Bergman and Hoshen (2020). In this work, we show that self-supervised methods underperform simpler kNN-based methods that use strong generic feature extractors on image anomaly detection tasks.

3 Deep Nearest-Neighbors for Image Anomaly Detection

We investigate a simple K nearest-neighbors (kNN) based method for image anomaly detection. We denote this method, Deep Nearest-Neighbors (DN2).

3.1 Semi-supervised Anomaly Detection

DN2 takes a set of input images . In the semi-supervised setting we assume that all input images are normal. DN2 uses a pre-trained feature extractor to extract features from the entire training set:


In this paper, we use a ResNet feature extractor that was pretrained on the Imagenet dataset. At first sight it might appear that this supervision is a strong requirement, however such feature extractors are widely available. We will later show experimentally that the normal or anomalous images do not need to be particularly closely related to Imagenet.

The training set is now summarized as a set of embeddings . After the initial stage, the embeddings can be stored, amortizing the inference of the training set.

To infer if a new sample is anomalous, we first extract its feature embedding: . We then compute its kNN distance and use it as the anomaly score:


denotes the nearest embeddings to in the training set

. We elected to use the euclidean distance, which often achieves strong results on features extracted by deep networks, but other distance measures can be used in a similar way. By verifying if the distance

is larger than a threshold, we determine if an image is normal or anomalous.

3.2 Unsupervised Anomaly Detection

In the fully-unsupervised case, we can no longer assume that all input images are normal, instead, we assume that only a small proportion of input images are anomalous. To deal with this more difficult setting (and inline with previous works on unsupervised anomaly detection), we propose to first conduct a cleaning stage on the input images. After the feature extraction stage, we compute the kNN distance between each input image and the rest of the input images. Assuming that anomalous images lie in low density regions, we remove a fraction of the images with the largest kNN distances. This fraction should be chosen such that it is larger than the estimated proportion of anomalous input images. It will be later shown in our experiments that DN2 requires very few training images. We can therefore be very aggressive in the percentage of removed image, and keep only the images most likely to be normal (in practice we remove of training images). After removal of the suspected anomalous input images, the images are now assumed to have a very high-proportion of normal images. We can therefore proceed exactly as in the semi-supervised case.

3.3 Group Image Anomaly Detection

Group anomaly detection tackles the setting where the input sample consists of a set of images. The particular combination is important, but not the order. It is possible that each image in the set will individually be normal but the set as a whole will be anomalous. As an example, let us assume normal sets consisting of

images, a randomly sampled image from each class. If we trained a point (per-image) anomaly detector, it will be able to detect anomalous sets containing pointwise anomalous images e.g. images taken from classes not seen in training. An anomalous set containing multiple images from one seen class, and no images from another will however be classified as normal as all images are individually normal. Previously, several deep autoencoder methods were proposed (e.g.

D’Oro et al. (2019)) to tackle group anomaly detection in images. Such methods suffer from multiple drawbacks: i) high sample complexity ii) sensitivity to reconstruction metric iii) potential lack of sensitivity to the groups. We propose an effective kNN based approach. The proposed method embeds the set by orderless-pooling (we chose averaging) over all the features of the images in the set:

0 70.6 61.7 1.3 74.7 0.4 77.2 0.6 77.5 93.9
1 51.3 65.9 0.7 95.7 0.0 96.7 0.2 96.9 97.7
2 69.1 50.8 0.3 78.1 0.4 83.3 1.4 87.3 85.5
3 52.4 59.1 0.4 72.4 0.5 77.7 0.7 80.9 85.5
4 77.3 60.9 0.3 87.8 0.2 87.8 0.7 92.7 93.6
5 51.2 65.7 0.8 87.8 0.1 87.8 0.6 90.2 91.3
6 74.1 67.7 0.8 83.4 0.5 90.0 0.6 90.9 94.3
7 52.6 67.3 0.3 95.5 0.1 96.1 0.3 96.5 93.6
8 70.9 75.9 0.4 93.3 0.0 93.8 0.9 95.2 95.1
9 50.6 73.1 0.4 91.3 0.1 92.0 0.6 93.3 95.3
Avg 62.0 64.8 86.0 88.2 90.1 92.5
Table 1: Anomaly Detection Accuracy on Cifar10 (ROCAUC )
  1. Feature extraction from all images in the group ,

  2. Orderless pooling of features across the group:

Having extracted the group feature described above we proceed to detect anomalies using DN2.

4 Experiments

In this section, we present extensive experiments showing that the simple kNN approach described above achieves better than state-of-the-art performance. The conclusions generalize across tasks and datasets. We extend this method to be more robust to noise, making it applicable to the unsupervised setting. We further extend this method to be effective for group anomaly detection.

4.1 Unimodal Anomaly Detection

The most common setting for evaluating anomaly detection methods is unimodal. In this setting, a classification dataset is adapted by designating one class as normal, while the other classes as anomalies. The normal training set is used to train the method, all the test data are used to evaluate the inference performance of the method. In line with previous works, we report the ROC area under the curve (ROCAUC).

FashionMNIST 92.8 93.5 94.1 94.4
CIFAR100 62.6 78.7 - 89.3
Table 2: Anomaly Detection Accuracy on Fashion MNIST and CIFAR10 (ROCAUC )

We conduct experiments against state-of-the-art methods, deep-SVDD (Ruff et al., 2018) which combines OCSVM with deep feature learning. Geometric (Golan and El-Yaniv, 2018), GOAD (Bergman and Hoshen, 2020), Multi-Head RotNet (MHRot) (Hendrycks et al., 2019). The latter three all use variations of RotNet.

For all methods except DN2, we reported the results from the original papers if available. In the case of Geometric (Golan and El-Yaniv, 2018) and the multi-head RotNet (MHRot) (Hendrycks et al., 2019), for datasets that were not reported by the authors, we run the Geometric code-release for low-resolution experiments, and MHRot for high-resolution experiments (as no code was released for the low-resolution experiments).

Cifar10: This is the most common dataset for evaluating unimodal anomaly detection. CIFAR10 contains color images from 10 object classes. Each class has training images and test images. The results are presented in Tab. 1

, note that the performance of DN2 is deterministic for a given train and test set (no variation between runs). We can observe that OC-SVM and Deep-SVDD are the weakest performers. This is because both the raw pixels as well as features learned by Deep-SVDD are not discriminative enough for the distance to the center of the normal distribution to be successful. Geometric and later approaches GOAD and MHRot perform fairly well but do not exceed

ROCAUC. DN2 significantly outperforms all other methods.

In this paper, we choose to evaluate the performance of without finetuning between the dataset and simulated anomalies (which improves performance on all methods including DN2). Outlier Exposure is one technique for such finetuning. Although it does not achieve the top performance by itself, it reported improvements when combined with MHRot to achieve an average ROCAUC of

on CIFAR10. This and other ensembling methods can also improve the performance of DN2 but are out-of-scope of this paper.

Fashion MNIST: We evaluate Geometric, GOAD and DN2 on the Fashion MNIST dataset consisting of 6000 training images per class and a test set of 1000 images per class. We present a comparison of DN2 vs. OCSVM, Deep SVDD, Geometric and GOAD. We can see that DN2 outperforms all other methods, despite the data being visually quite different from Imagenet from which the features were extracted.

CIFAR100: We evaluate Geometric, GOAD and DN2 on the CIFAR100 dataset. CIFAR100 has 100 fine-grained classes with 500 train images each or 20 coarse-grained classes with 2500 train images each. Following previous papers, we use the coarse-grained version. The protocol is the same as CIFAR10. We present a comparison of DN2 vs. OCSVM, Deep SVDD, Geometric and GOAD. The results are inline with those obtained for CIFAR10.

Comparisons against MHRot:

We present a further comparison between DN2 and MHRot (Hendrycks et al., 2019) on several commonly-used datasets. The experiments give further evidence for the generality of DN2, in datasets where RotNet-based methods are not restricted by low-resolution, or by image invariance to rotations.

We compute the ROCAUC score on each of the first categories (all categories if there are less than ), by alphabetical order, designated as normal for training. The standard train and test splits are used. All test images from all classes are used for inference, with the appropriate class designated normal and all the rest as anomalies. For brevity of presentation, the average ROCAUC score of the tested classes is reported.

Category Flowers (Nilsback and Zisserman, 2008): This dataset consists of categories of flowers, consisting of training images each. The test set consists of to over images per-class.

Caltech-UCSD Birds (Wah et al., 2011): This dataset consists of categories of bird species. Classes typically contain between to images split evenly between train and test.

CatsVsDogs (Elson et al., 2007): This dataset consists of categories; dogs and cats with training images each. The test set consist of images for each class. Each image contains either a dog or a cat in various scenes and taken from different angles. The data was extracted from the ASIRRA dataset, we split each class to the first images as train and the last as test.

The results are shown in Tab. 3. DN2 significantly outperforms MHRot on all datasets.

Dataset MHRot DN2
Oxford Flowers 65.9 93.9
UCSD Birds 200 64.4 95.2
CatsVsDogs 88.5 97.5
Table 3: MHRot vs. DN2 on Flowers, Birds, CatsVsDogs (Average Class ROCAUC )
Figure 1: Network depth (number of ResNet layers) improves both Cifar10 and FashionMNIST results.
Figure 2: Number of neighbors vs ROCAUC, the optimal number of K is around .

Effect of network depth:

Deeper networks trained on large datasets such as Imagenet learn features that generalize better than shallow network. We investigated the performance of DN2 when using features from networks of different depths. Specifically, we plot the average ROCAUC for ResNet with 50, 101, 152 layers in Fig. 1. DN2 works well with all networks but performance is improved with greater network depth.

Effect of the number of neighbors:

The only free parameter in DN2 is the number of neighbors used in kNN. We present in Fig. 2, a comparison of average CIFAR10 and FashionMNIST ROCAUC for different numbers of nearest neighbors. The differences are not particularly large, but neighbors are usually best.

Effect of data invariance:

Figure 3: (left) A chimney image from the DIOR dataset (right) An image from the WBC Dataset.
Dataset MHRot DN2
DIOR 83.2 92.2
WBC 60.5 82.9
Table 4: Anomaly Detection Accuracy on DIOR and WBC (ROCAUC )

Methods that rely on predicting geometric transformations e.g. (Golan and El-Yaniv, 2018; Hendrycks et al., 2019; Bergman and Hoshen, 2020), use a strong data prior that images have a predetermined orientation (for rotation prediction) and centering (for translation prediction). This assumption is often false for real images. Two interesting cases not satisfying this assumption, are aerial and microscope images, as they do not have a preferred orientation, making rotation prediction ineffective.

DIOR (Li et al., 2020): An aerial image dataset. The images are registered but do not have a preferred orientation. The dataset consists of object categories that have more than images with resolution above (the median number of images per-class is ). We use the bounding boxes provided with the data, and take each object with a bounding box of at least pixels in each axis. We resize it to pixels. We follow the same protocol as in the earlier datasets. As the images are of high-resolution, we use the public code release of Hendrycks (Hendrycks et al., 2018) as a self-supervised baseline. The results are summarized in Tab. 4. We can see that DN2 significantly outperforms MHRot. This is due both to the generally stronger performance of the feature extractor as well as the lack of rotational prior that is strongly used by RotNet-type methods. Note that the images are centered, a prior used by the MHRot translation heads.

WBC (Zheng et al., 2018): To further investigate the performance on difficult real world data, we performed an experiment on the WBC Image Dataset, which consists of high-resolution microscope images of different categories of white blood cells. The data do not have a preferred orientation. Additionally the dataset is very small, only a few tens of images per-class. We use Dataset that was obtained from Jiangxi Telecom Science Corporation, China, and split it to the different classes that contain more than images each. We set the first images in each class to the train set, and the last to the test set. The results are presented in Tab. 4. As expected, DN2 outperforms MHRot by a significant margin showing its greater applicability to real world data.

4.2 Multimodal Anomaly Detection

Dataset Geometric DN2
CIFAR10 61.7 71.7
CIFAR100 57.3 71.0
Table 5: Anomaly Detection Accuracy on Multimodal Normal Image Distributions (ROCAUC )

It has been argued (e.g. Ahmed and Courville (2019)) that unimodal anomaly detection is less realistic as in practice, normal distributions contain multiple classes. While we believe that both settings occur in practice, we also present results on the scenario where all classes are designated as normal apart from a single class that is taken as anomalous (e.g. all CIFAR10 classes are normal apart from ”Cat”). Note that we do not provide the class labels of the different classes that compose the normal class, rather we consider them to be a single multimodal class. We believe this simulates the realistic case of having a complex normal class consisting of many different unlabelled types of data.

We compared DN2 against Geometric on CIFAR10 and CIFAR100 on this setting. We provide the average ROCAUC across all the classes in Tab. 5. DN2 achieves significantly stronger performance than Geometric. We believe this is occurs as Geometric requires the network not to generalize on the anomalous data. However, once the training data is sufficiently varied the network can generalize even on unseen classes, making the method less effective. This is particularly evident on CIFAR100.

Figure 4: Number of images per group vs. detection ROCAUC. Group anomaly detection with mean pooling is better than simple feature concatenation for groups with more than images.
Figure 5: Number of training images vs. ROCAUC (left) CIFAR10 - Strong perfromance is achieved by DN2 even from 10 images, whereas Geometric deteriorates critically. (center) FashionMNIST - similarly strong performance by DN2. (right) Impurity ratio vs ROCAUC on CIFAR10. The training set cleaning procedure, significantly improves performance.

4.3 Generalization from Small Training Datasets

One of the advantage of DN2, which does not utilize learning on the normal dataset is its ability to generalize from very small datasets. This is not possible with self-supervised learning-based methods, which do not learn general enough features to generalize to normal test images. A comparison between DN2 and Geometric on CIFAR10 is presented in Fig. 

5. We plotted the number of training images vs. average ROCAUC. We can see that DN2 can detect anomalies very accurately even from 10 images, while Geometric deteriorates quickly with decreasing number of training images. We also present a similar plot for FashionMNIST in Fig. 5. Geometric is not shown as it suffered from numerical issues for small numbers of images. DN2 again achieved strong performance from very few images.

4.4 Unsupervised Anomaly Detection

There are settings where the training set does not consist of purely normal images, but rather a mixture of unlabelled normal and anomalous images. Instead we assume that anomalous images are only a small fraction of the number of the normal images. The performance of DN2 as function of the percentage of anomalies in the training set is presented in Fig. 5. The performance is somewhat degraded as the percentage of training set impurities exist. To improve the performance, we proposed a cleaning stage, which removes of the training set images that have the most distant inside the training set. We then run DN2 as usual. The performance is also presented in Fig. 5. Our cleaning procedure is clearly shown to significantly improve the performance degradation as percentage of impurities.

4.5 Group Anomaly Detection

To compare to existing baselines, we first tested our method on the task in D’Oro et al. (2019). The data consists of normal sets containing MNIST images of the same digit, and anomalous sets containing images of different digits. By simply computing the trace-diagonal of the covariance matrix of the per-image ResNet features in each set of images, we achieved ROCAUC vs. in the previous paper (without using the training set at all).

As a harder task for group anomaly detection in unordered image sets, we designate the normal class as sets consisting of exactly one image from each of the CIFAR10 classes (specifically the classes with ID ) while each anomalous set consisted of images selected randomly among the same classes (some classes had more than one image and some had zero). As a simple baseline, we report the average ROCAUC (Fig, 4) for anomaly detection using DN2 on the concatenated features of each individual image in the set. As expected, this baseline works well for small values of where we have enough examples of all possible permutations of the class ordering, but as grows larger (

), its performance decreases, as the number permutations grows exponentially. We compare this method, with 1000 image sets for training, to nearest neighbours of the orderless max-pooled and average-pooled features, and see that mean-pooling significantly outperforms the baseline for large values of

. While we may improve the performance of the concatenated features by augmenting the dataset with all possible orderings of the training sets, it is will grow exponentially for a non-trivial number of making it an ineffective approach.

4.6 Implementation

In all instances of DN2, we first resize the input image to , we take the center crop of size , and using an Imagenet pre-trained ResNet ( layers unless otherwise specified) extract the features just after the global pooling layer. This feature is the image embedding.

5 Analysis

In this section, we perform an analysis of DN2, both by comparing kNN to other classification methods, as well as comparing the features extracted by the pretrained networks vs. features learned by self-supervised methods.

5.1 kNN vs. one-class classification

In our experiments, we found that kNN achieved very strong performance for anomaly detection tasks. Let us try to gain a better understanding of the reasons for the strong performance. In Fig. 6 we can observe t-SNE plots of the test set features of CIFAR10. The normal class is colored in yellow while the anomlous data is marked in blue. It is clear that the pre-trained features embed images from the same class into a fairly compact region. We therefore expect the density of normal training images to be much higher around normal test images than around anomalous test images. This is responsible for the success of kNN methods.

C=1 C=3 C=5 C=10 kNN
91.94 92.00 91.87 91.64 92.52
Table 6: Accuracy on CIFAR10 using K-means approximations and full kNN (ROCAUC )
Figure 6: t-SNE plots of the features learned by SVDD (left), Geometric (center) and Imagenet pre-trained (right) on CIFAR10, where the normal class is Airplane (top), Automobile (bottom). We can see that Imagenet-pretrained features clearly separate the normal class (yellow) and anomalies (blue). Geometric learns poor features of Airplane and reasonable features on Automobile. Deep-SVDD does not learn features that allow clean separation.

kNN has linear complexity in the number of training data samples. Methods such as One-Class SVM or SVDD attempt to learn a single hypersphere, and use the distance to the center of the hypersphere as a measure of anomaly. In this case the inference runtime is constant in the size of the training set, rather than linear as in the kNN case. The drawback is the typical lower performance. Another popular way (Fukunaga and Narendra, 1975) of decreasing the inference time is using K-means clustering of the training features. This speeds up inference by a ratio of . We therefore suggest speeding up DN2 by clustering the training features into clusters and the performing kNN on the clusters rather than the original features. Tab. 6 presents a comparison of performance of DN2 and its K-means approximations with different numbers of means (we use the sum of the distances to the 2 nearest neighbors). We can see that for a small loss in accuracy, the retrieval speed can be reduced significantly.

5.2 Pretrained vs. self-supervised features

To understand the improvement in performance by pretrained feature extractors, we provide t-SNE plots of normal and anomalous test features extracted by Deep-SVDD, Geometric and DN2 (Resnet50 pretrained on Imagenet). The top plots are of a normal class that achieves moderate detection accuracy, while the bottom plots are of a normal class that achieves high accuracy. We can immediately observe that the normal class in Deep-SVDD is scattered among the anomalous classes, explaining its lower performance. In Geometric the features of the normal class are a little more localized, however the density of the normal region is still only moderately concentrated. We believe that the fairly good performance of Geometric is achieved by the massive ensembling that it performs (combination of augmentations). We can see that Imagenet pretrained features preserve very strong locality. This explains the strong performance of DN2.

6 Discussion

A general paradigm for anomaly detection: Recent papers (e.g. Golan and El-Yaniv (2018)) advocated the paradigm of self-supervision, possibly with augmentation by an external dataset e.g. outlier exposure. The results in this paper, give strong evidence to an alternative paradigm: i) learn general features using all the available supervision on vaguely related datasets ii ) the learned features are expected to be general enough to be able to use standard anomaly detection methods (e.g. kNN, k-means). The pretrained paradigm is much faster to deploy than self-supervised methods and has many other advantages investigated extensively in Sec. 4. We expect that for image data that has no similarity whatsoever to Imagenet, using pre-trained features may be less effective. That withstanding, in our experiments, we found that Imagenet-pretrained features were effective on aerial images as well as microscope images, while both settings are very different from Imagenet. We therefore expect DN2-like methods to be very broadly applicable.

External supervision:

The key enabler for DN2’s success is the availability of a high quality external feature extractor. The ResNet extractor that we used was previously trained on Imagenet. Using supervision is typically seen as being more expensive and laborious than self-supervised methods. In this case however, we do not see it as a disadvantage at all. We used networks that have already been trained and are as commoditized as free open-source software libraries. They are available completely free, no new supervision at all is required for using such networks for any new dataset, as well as minimal time or storage costs for training. The whole process consists of merely a single PyTorch line, we therefore believe that in this case, the discussion of whether these methods can be considered supervised is purely philosophical.

Scaling up to very large datasets: Nearest neighbors are famously slow for large datasets, as the runtime increases linearly with the amount of training data. The complexity is less severe for parametric classifiers such as neural networks. As this is a well known issue with nearest neighbors classification, much work was performed at circumventing it. One solution is fast kNN retrieval e.g. by kd-trees. Another solution used in Sec. 5, proposed to speed up kNN by reducing the training set through computing its k-means and computing kNN on them. This is generalized further by an established technique that approximates NN by a recursive K-means algorithm (Fukunaga and Narendra, 1975). We expect that in practice, most of the runtime will be a result of the neural network inference on the test image, rather than on nearest neighbor retrieval.

Non-image data:

Our investigation established a very strong baseline for image anomaly detection. This result, however, does not necessarily mean that all anomaly detection tasks can be performed this way. Generic feature extractors are very successful on images, and are emerging in other tasks e.g. natural language processing (BERT

(Devlin et al., 2018)

). This is however not the case in some of the most important areas for anomaly detection i.e. tabular data and time series. In those cases, general feature extractors do not exist, and due to the very high variance between datasets, there is no obvious path towards creating such feature extractors. Note however that as deep methods are generally less successful on tabular data, the baseline of kNN on raw data is a very strong one. That withstanding, we believe that these data modalities present the most promising area for self-supervised anomaly detection.

Bergman and Hoshen (2020) proposed a method along these lines.

7 Conclusion

We compare a simple method, kNN on deep image features, to current approaches for semi-supervised and unsupervised anomaly detection. Despite its simplicity, the simple method was shown to outperform the state-of-the-art methods in terms of accuracy, training time, robustness to input impurities, robustness to dataset type and sample complexity. Although, we believe that more complex approaches will eventually outperform this simple approach, we think that DN2 is an excellent starting point for practitioners of anomaly detection as well as an important baseline for future research.


  • F. Ahmed and A. Courville (2019) Detecting semantic anomalies. arXiv preprint arXiv:1908.04388. Cited by: §4.2.
  • l. Bergman and Y. Hoshen (2020) Classification-based anomaly detection for general data. In ICLR, Cited by: §2, §4.1, §4.1, §6.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011)

    Robust principal component analysis?

    JACM. Cited by: §2.
  • P. D’Oro, E. Nasca, J. Masci, and M. Matteucci (2019) Group anomaly detection via graph autoencoders. Cited by: §3.3, §4.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
  • J. Elson, J. R. Douceur, J. Howell, and J. Saul (2007) Asirra: a captcha that exploits interest-aligned manual image categorization.. In ACM Conference on Computer and Communications Security, Vol. 7, pp. 366–374. Cited by: §4.1.
  • E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo (2002) A geometric framework for unsupervised anomaly detection. In Applications of data mining in computer security, pp. 77–101. Cited by: §2.
  • K. Fukunaga and P. M. Narendra (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE transactions on computers 100 (7), pp. 750–753. Cited by: §5.1, §6.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. ICLR. Cited by: §2.
  • I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. In NeurIPS, Cited by: §2, §4.1, §4.1, §4.1, §6.
  • X. Gu, L. Akoglu, and A. Rinaldo (2019) Statistical analysis of nearest neighbor methods for anomaly detection. In NeurIPS, Cited by: §1.
  • J. A. Hartigan and M. A. Wong (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics). Cited by: §2.
  • D. Hendrycks, M. Mazeika, and T. G. Dietterich (2018) Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §4.1.
  • D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, Cited by: §2, §4.1, §4.1, §4.1, §4.1.
  • I. Jolliffe (2011) Principal component analysis. Springer. Cited by: §2.
  • G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In ECCV, Cited by: §2.
  • L. J. Latecki, A. Lazarevic, and D. Pokrajac (2007) Outlier detection with kernel density functions. In

    International Workshop on Machine Learning and Data Mining in Pattern Recognition

    pp. 61–75. Cited by: §2.
  • K. Li, G. Wan, G. Cheng, L. Meng, and J. Han (2020) Object detection in optical remote sensing images: a survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 159, pp. 296–307. Cited by: §4.1.
  • M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. ICLR. Cited by: §2.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In

    2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing

    pp. 722–729. Cited by: §4.1.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • L. Ruff, N. Gornitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In ICML, Cited by: §2, §4.1.
  • B. Scholkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt (2000)

    Support vector method for novelty detection

    In NIPS, Cited by: §2.
  • D. M. Tax and R. P. Duin (2004) Support vector data description. Machine learning. Cited by: §2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §2.
  • B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In ICML, Cited by: §2.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §2.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §2.
  • X. Zheng, Y. Wang, G. Wang, and J. Liu (2018) Fast and robust segmentation of white blood cell images by self-supervised learning. Micron 107, pp. 55–71. Cited by: §4.1.
  • B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    ICLR. Cited by: §2.