Contrastive Predictive Coding for Anomaly Detection

07/16/2021 ∙ by Puck de Haan, et al. ∙ 0

Reliable detection of anomalies is crucial when deploying machine learning models in practice, but remains challenging due to the lack of labeled data. To tackle this challenge, contrastive learning approaches are becoming increasingly popular, given the impressive results they have achieved in self-supervised representation learning settings. However, while most existing contrastive anomaly detection and segmentation approaches have been applied to images, none of them can use the contrastive losses directly for both anomaly detection and segmentation. In this paper, we close this gap by making use of the Contrastive Predictive Coding model (arXiv:1807.03748). We show that its patch-wise contrastive loss can directly be interpreted as an anomaly score, and how this allows for the creation of anomaly segmentation masks. The resulting model achieves promising results for both anomaly detection and segmentation on the challenging MVTec-AD dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An anomaly (or outlier, novelty, out-of-distribution sample) is an observation that differs significantly from the vast majority of the data. Anomaly detection (AD) tries to distinguish anomalous samples from the samples that are deemed ‘normal’ in the data. It has become increasingly relevant to detect these anomalies to make machine learning methods more reliable and to improve their applicability in real-world scenarios, such as automated industrial inspections and medical diagnosis

(Ruff et al., 2021)

. Typically, anomaly detection is treated as an unsupervised learning problem, since labelled data is generally unavailable and to allow for the development of methods that can detect previously unseen anomalies.

One promising direction involves the adaptation of contrastive learning approaches (Hjelm et al., 2019; Oord et al., 2019; Chen et al., 2020; He et al., 2020) to the anomaly detection setting (Tack et al., 2020; Winkens et al., 2020; Kopuklu et al., 2021; Qiu et al., 2021; Sohn et al., 2021). However, even though most of these approaches have been applied to image data, none of them can use the contrastive losses directly for both anomaly detection and segmentation.

In this paper, we demonstrate that Contrastive Predictive Coding (CPC) (Oord et al., 2019; Hénaff et al., 2020) can be applied to detect and segment anomalies in images. We show that the InfoNCE loss introduced by Oord et al. (2019) can be directly interpreted as an anomaly score. Since in this loss patches from within an image are contrasted against one another, we can further use it to create accurate anomaly segmentation masks. This results in a compact and straightforward anomaly detection and segmentation approach.

Figure 1: A schematic overview of Contrastive Predictive Coding for anomaly detection and segmentation in images. After extracting (sub-)patches from the input image, we contrast the encoded representations from within the same image against randomly matched representations . The resulting InfoNCE loss is used to determine whether sub-patch is anomalous or not.

To improve the performance of the CPC model for anomaly detection, we introduce two adjustments. First, we adapt the setup of negative samples during testing such that anomalous patches can only appear within the positive sample. Second, we omit the autoregressive part of the CPC model. With these adjustments, our proposed method achieves promising performance on real-world data, such as the challenging MVTec-AD dataset (Bergmann et al., 2019).

2 Related Work

In this section, we will give an overview of contrastive learning approaches and different methods for anomaly detection.

2.1 Contrastive Learning

Lately, impressive results have been achieved with self-supervised methods based on contrastive learning (Wu et al., 2018; Oord et al., 2019; Hjelm et al., 2019; He et al., 2020; Chen et al., 2020; Li et al., 2021b). Overall, these methods work by making a model decide whether two (randomly) transformed inputs originated from the same input sample, or from two samples that have been randomly drawn from across the dataset. Different transformations can be chosen depending on the domain and downstream task. For example, on image data, random data augmentation such as random cropping and color jittering has proven useful (Chen et al., 2020; He et al., 2020). In this paper, we use the Contrastive Predictive Coding model (Oord et al., 2019; Hénaff et al., 2020)

, which makes use of temporal transformations. Generally, these approaches are evaluated by training a linear classifier on top of the created representations and by measuring the performance that this linear classifier can achieve on downstream tasks.

2.2 Anomaly Detection

Anomaly detection methods can roughly be divided into three categories: density-based, reconstruction-based and discriminative-based methods (Ruff et al., 2021)

. Density-based methods predict anomalies by estimating the probability distribution of the data (e.g. GANs, VAEs, or flow-based models)

(Schlegl et al., 2017; Winkens et al., 2020; Liu et al., 2020)

; reconstruction-based methods are based on models that are trained with a reconstruction objective (e.g. autoencoders)

(Zhou & Paffenroth, 2017; Bergmann et al., 2018; Luo et al., 2020); discriminative-based methods learn a decision boundary between anomalous and normal data (e.g. SVM, one-class classification) (Ruff et al., 2020; Tack et al., 2020; Liznerski et al., 2021; Li et al., 2021a). The method proposed in this paper can be seen as a density-based method with a discriminative one-class objective.

Several previous works investigate the use of contrastive learning for AD. Tack et al. (2020); Winkens et al. (2020); Sohn et al. (2021) make use of the SimCLR framework (Chen et al., 2020)

to learn representations of the data. Then, they calculate a separate anomaly score by using these representations for density estimation, one-class classification, or by applying metric measures like the cosine similarity and the norm of the representations. The downsides of this approach are that it requires extensive data augmentations and multiple different measures, or multiple models. Another comparable contrastive learning AD method

(Kopuklu et al., 2021)

uses noise contrastive estimation for training, similar to our method. Differently to our method, they map the samples to multiple latent spaces and use anomalous samples as negatives during training. This results in a more complex model with a supervised training phase. NeuTraL AD

(Qiu et al., 2021) makes use of a contrastive loss with learnable transformations, and reuses this loss as an anomaly score. In contrast to our method, their approach has been evaluated on time-series and tabular data.

Figure 2: Localization of anomalous regions for different classes in the MVTec-AD dataset. The top row shows the original input image, the mid row depicts the superimposition of the image and corresponding InfoNCE loss values (brighter colors represent higher loss values) and the bottom row shows the ground truth annotation. We find that our model consistently highlights anomalous regions across many classes. One notable exception is the screw class (right), for which the model assigns high loss values to the background in many cases.

3 Contrastive Predictive Coding

Contrastive Predictive Coding (Oord et al., 2019) is a self-supervised representation learning approach that leverages the structure of the data and enforces temporally nearby inputs to be encoded similarly in latent space. It achieves this by making the model decide whether a pair of samples is made up of temporally nearby samples or randomly assigned samples. This approach can also be applied to static image data by splitting the images up into patches, and interpreting each row of patches as a separate time-step.

The CPC model makes use of a contrastive loss function, coined InfoNCE, that is based on Noise-Contrastive Estimation

(Gutmann & Hyvärinen, 2010) and is designed to optimize the mutual information between the latent representations of patches () and their surrounding patches ():

(1)

where and represents a non-linear encoder, and

represents an autoregressive model. Furthermore,

describes a linear transformation used for predicting

time-steps ahead. The set of samples consists of one positive sample and negative samples for which is randomly sampled from across the current batch.

4 CPC for Anomaly Detection

We propose to apply the CPC model for anomaly detection and segmentation (Fig. 1). In order to improve the performance of the CPC model in this setting, we introduce two adjustments to its architecture: (1) We omit the autoregressive model . As a result, our loss function changes to:

(2)

This formulation is equivalent to the loss used in the Greedy InfoMax model (Löwe et al., 2019). This adjustment results in a simpler model, which is still able to learn useful latent representations – according to preliminary results. (2) We change the setup of the negative samples during testing. Previous implementations of the CPC model use random patches from within the same test-batch (Hénaff et al., 2020) as negative samples. However, this may result in negative samples containing anomalous patches, which could make it harder for the model to detect anomalous patches in the positive sample based on the contrastive loss. To avoid this, during testing, we draw negative samples from the (non-anomalous) training data.

In the test-phase, we use the loss function in Eq. 2 to decide whether an image patch can be classified as anomalous:

(3)

The threshold value

remains implicit, since we use the area under the receiver operating characteristic curve (AUROC) as performance measure. While we can create anomaly segmentation masks by making use of the anomaly scores per patch

, we can also apply our approach to decide whether a sample is anomalous – either by averaging over the scores of all patches within an image, or by examining the patch with the highest score.

5 Experiments

We evaluate the proposed Contrastive Predictive Coding model for anomaly detection and segmentation on the MVTec-AD dataset (Bergmann et al., 2019). This dataset contains high-resolution images of ten objects and five textures with pixel-accurate annotations and provides between 60 and 391 training images per class. During training we randomly crop every image to times the original dimensions. Then, both train and test images are resized to 768768 pixels. The resulting image is split into patches of size 256256, where each patch has 50% overlap with its neighbouring patches. These patches are further divided into sub-patches of size 6464, also with 50% overlap. These sub-patches are used in the InfoNCE loss (Fig. 1

) to detect anomalies. The cropped and resized images are horizontally flipped with a probability of 50% during training.

We use a ResNet-18 v2 (He et al., 2016) up until the third residual block as encoder

. We train a separate model from scratch for each class with a batch size of 16 for 150 epochs using the Adam optimizer

(Kingma & Ba, 2015) with a learning rate of . As proposed by Oord et al. (2019), we train and evaluate the model on grayscale images. For both training and evaluation we use 16 negative samples in the InfoNCE loss. To increase the accuracy of the InfoNCE loss as an indicator for anomalous patches, we apply four separate models in four different directions – predicting patches using context from above, below, left and right – and combine their losses in the test-phase.

5.1 Anomaly Detection

To evaluate our model’s performance for detecting anomalies, we average the top- InfoNCE loss values across all sub-patches within an image and use this value to calculate the AUROC score. In Table 1, we compare against previously published works from peer-reviewed venues that do not make use of pre-trained feature extractors. We find that our proposed CPC-AD

model substantially improves upon a kernel density estimation model (

KDE) and an autoencoding model (Auto) as presented in Kauffmann et al. (2020). We also improve upon the contrastive learning approach combined with a KDE model (avg. AUROC: 0.865) as proposed by Sohn et al. (2021). The performance of our model lags behind the CutPaste model (Li et al., 2021a). However, we argue that CPC-AD provides a more generally applicable approach for anomaly detection. The CutPaste model relies heavily on randomly sampled artificial anomalies that are designed to resemble the anomalies encountered in the dataset. As a result, it is not applicable to a -classes-out task, where anomalies differ semantically from the normal data. For comparison, the current state-of-the-art model on this dataset which makes use of a pre-trained feature extractor achieves 0.979 AUROC averaged across all classes (Defard et al., 2020).

KDE Auto CutPaste CPC-AD
Bottle 0.833 0.950 0.983 0.998
Cable 0.669 0.573 0.806 0.880
Capsule 0.562 0.525 0.962 0.641
Carpet 0.348 0.368 0.931 0.809
Grid 0.717 0.746 0.999 0.983
Hazelnut 0.699 0.905 0.973 0.996
Leather 0.415 0.640 1.000 0.990
Metal nut 0.333 0.455 0.993 0.845
Pill 0.691 0.760 0.924 0.921
Screw 0.369 0.779 0.863 0.897
Tile 0.689 0.518 0.934 0.957
Toothbrush 0.933 0.494 0.983 0.878
Transistor 0.724 0.512 0.955 0.925
Wood 0.947 0.885 0.986 0.803
Zipper 0.614 0.350 0.994 0.993
Mean 0.636 0.631 0.952 0.901
Table 1: Anomaly detection AUROC score on the MVTec-AD test-set per category. We find that the proposed CPC-AD approach substantially outperforms the kernel density estimation model (KDE) and the autoencoding model (Auto) presented by Kauffmann et al. (2020). It is outperformed by the CutPaste model (Li et al., 2021a), which relies heavily on dataset-specific augmentations for its training.

5.2 Anomaly Segmentation

For the evaluation of the proposed CPC-AD model’s anomaly segmentation performance, we up-sample the sub-patch-wise InfoNCE loss values to match the pixel-wise ground truth annotations. To do so, we average the InfoNCE losses of overlapping sub-patches and assign the resulting values to all affected pixels. This allows us to create anomaly segmentation masks at the resolution of half a sub-patch (3232 pixels) that are of the same dimensions as the resized images (768768).

In Table 2

in the Appendix, we compare the anomaly segmentation performance of the proposed CPC-AD method against previously published works from peer-reviewed venues. The best results on the MVTec-AD dataset are achieved with extensive models that are pre-trained on ImageNet, such as

FCDD and PaDiM (Liznerski et al., 2021; Defard et al., 2020), or make use of additional artificial anomalies and ensemble methods, such as CutPaste (Li et al., 2021a). Our model is trained from scratch and uses merely the provided training data, making for a less complex and more general method. The proposed CPC-AD approach is further outperformed by one autoencoding approach (AE-SSIM) and a partially contrastive approach (DistAug), but is on par with another autoencoding approach (AE-L2). Our proposed method outperforms the GAN-based approach (AnoGAN) (Bergmann et al., 2019; Schlegl et al., 2017). Interestingly, the CPC-AD model scores relatively well on textures, compared to similar models.

Nonetheless, although the quantitative results achieved with CPC-AD are not state-of-the-art, the model succeeds in generating accurate segmentation masks for most classes (Fig. 2). Even for classes with a low pixelwise AUROC score, such as pill, it can be seen that the created segmentation masks correctly highlight anomalous input regions, although there is some background noise. This corresponds with the comparatively high detection performance that the CPC-AD method achieves for this class (Table 1). These results indicate that part of the low segmentation scores (compared to the detection scores) could be due to small spatial deviations from the ground truth. This effect might be exacerbated by the relatively low resolution of the segmentation masks that our patch-wise approach creates. Nonetheless, we argue that this resolution would be sufficient in practice to provide interpretable results for human inspection. Overall, CPC-AD provides a promising first step towards anomaly segmentation methods that are based on contrastive learning.

6 Conclusion

Overall, the CPC-AD model shows that contrastive learning can be used not just for anomaly detection, but also for anomaly segmentation. The proposed method performs well on the anomaly detection task, with competitive results for a majority of the data. Additionally the generated segmentation masks provide a promising first step towards anomaly segmentation methods that are based on contrastive losses.

References

Appendix A Additional Results

a.1 Anomaly Segmentation

In Table 2, we compare the anomaly segmentation performance of the proposed CPC-AD method against previously published works from peer-reviewed venues.