Patch SVDD: Patch-level SVDD for Anomaly Detection and Segmentation

06/29/2020 ∙ by Jihun Yi, et al. ∙ 0

In this paper, we tackle the problem of image anomaly detection and segmentation. Anomaly detection is to make a binary decision whether an input image contains an anomaly or not, and anomaly segmentation aims to locate the defect in a pixel-level. SVDD is a longstanding algorithm for an anomaly detection. We extend its deep learning variant to patch-level using self-supervised learning. The extension enables the anomaly segmentation, and it improves the detection performance as well. As a result, we achieved a state-of-the-art performances on a standard industrial dataset, MVTec AD. Detailed analysis on the proposed method offers a useful insight about its behavior.



There are no comments yet.


page 2

page 3

page 5

page 6

page 7

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection is a binary classification problem, determining if an input contains an anomaly. Detecting anomalies is a critical and long standing problems in manufacturing industries and financial companies. Typically, anomaly detection is formulated as a one-class classification. This is because the abnormal examples are not accessible at the training time, or they are not sufficient enough to model its distribution, compared to its huge diversity. When dealing with an image, the detected anomalies can be localized as well, and anomaly segmentation is a problem to localize the anomalies in a pixel-level. In this paper, we consider the problem of image anomaly detection and segmentation.

One-class SVM (OC-SVM) [ocsvm]

and support vector data description (SVDD) 


are classic algorithms for one-class classification. Given a kernel function, OC-SVM and SVDD seek for a separating hyperplane from the origin and a data-enclosing hypersphere in the kernel space, respectively. They are closely related, and Vert et al. 

[ocsvm_consistency] showed their equivalence under Gaussian kernel. Ruff et al. [deepSVDD]

proposed its deep learning variant, Deep SVDD, by deploying a deep neural network in the place of kernel function. The neural network is trained to extract a favorable representation from the high-dimensional and structured data, saving the efforts to choose an appropriate kernel function by hand. Furthermore, Ruff et al. 

[deep_sad] re-interpreted Deep SVDD in an information-theoretic perspective and applied to semi-supervised scenarios.

In this paper, we extend Deep SVDD to a patch-wise detection method, thereby proposing Patch SVDD. Due to relatively high level of intra-class variation of the patches, the extension is not trivial, and self-supervised learning facilitates it. The proposed method enables anomaly segmentation and improves the anomaly detection performance. Fig 1 shows an example of anomaly localization by the proposed method. Besides, some previous research [patch_location, lens] reported that the features of an untrained encoder can be useful to distinguish anomalies. We provide more detailed behavior of the untrained encoders and investigate why.

Figure 1: The proposed method localizes the anomaly in an image. Patch SVDD performs multi-scale inspection and aggregates the results. The resulting anomaly map pinpoints the defect (contoured in red line) in the image.

2 Background

2.1 Anomaly detection and segmentation

2.1.1 Problem formulation

Anomaly detection is to make a binary decision whether an input is anomaly or not. The definition of anomaly ranges from a tiny defect to an out-of-distribution image. We focus on detecting a defect in an image. A typical detection method trains a scoring function, , which measures an abnormality of each input. In test time, the inputs with high are deemed to be an anomaly. A de facto standard metric for the scoring function is AUROC, as written in Eq 1 [auroc].


Therefore, a good scoring function is one that assigns low anomaly score to normal data and high anomaly score to abnormal data. The anomaly segmentation problem is formulated similarly. It generates anomaly score for every pixel (i.e., an anomaly map) and measures the AUROC for every pixel.

2.1.2 Auto encoder-based methods

Early deep learning approaches for anomaly detection used auto encoders [vae_ad, recon_and_detect, ocgan]. The auto encoders are trained with the normal training data, and they do not reconstruct abnormal images accurately. Therefore, the difference between the reconstruction and the input provides a distinguishable signal of abnormality. Further variants have been proposed to utilize structural similarity index [ssim_ae], adversarial training [recon_and_detect], negative mining [ocgan], and iterative projection [iterative_project]. Some previous work utilized the learned latent feature of the auto encoder for anomaly detection. Akcay et al. [ganomaly] defined the reconstruction of latent feature as an anomaly score, and Yarlagadda et al. [satellite] trained OC-SVM [ocsvm] on top of the latent features. More recently, several methods make use of other than the reconstruction loss, such as restoration loss [itae] and an attention map [ve_vae].

Figure 2: Comparison of Deep SVDD [deepSVDD] and the proposed method. Patch SVDD performs inspection for every patch so that it can localize the defect. Furthermore, the representations are not necessarily uni-modal, attributing to the self-supervised learning. The multi-model representation enhances the anomaly detection capability.

2.1.3 Classifier-based methods

Starting from Golan et al. [geom]

, discriminative approaches have been proposed for anomaly detection. They exploit an observation that classifiers lose their confidence 

[oodbaseline] on abnormal input images. Given unlabeled dataset, they train a classifier to predict the synthetic labels. For example, Golan et al. [geom] randomly flip, rotate, and translate an image, and the classifier is trained to predict the particular transformation. If the classifier does not make a confident and correct prediction, the input image is deemed to be abnormal. Wang et al. [e3outlier] proved that such approach can be extended to an unsupervised scenario, where the train data also contains a few anomalies. Bergman et al. [goad] adopted an open-set classification method and generalized the method to non-image data.

2.1.4 SVDD-based methods

SVDD [svdd]

is a classic one-class classification algorithm. It maps all the normal training data into a predefined kernel space, and it seeks for the smallest enclosing hypersphere in the kernel space. The outliers are expected to locate outside the learned hypersphere. Since a kernel function determines the kernel space, the training procedure is merely deciding the radius and center of the hypersphere.

Ruff et al. [deepSVDD] improved this approach using deep neural network. They adopted neural network in the place of kernel function, and trained it along with the radius and center of the hypersphere. Such modification lets the encoder learn a data dependent transformation so that enhances the detection performance on high-dimensional and structured data. Ruff et al. [deep_sad] further applied the method to the semi-supervised scenario, thereby proposing Deep SAD.

2.2 Self-supervised representation learning

Learning a representation of an image is a core problem of computer vision. A series of methods have been proposed to learn a representation of data without annotation by learning with a

pretext task. They obtain the learning signals by learning from self-labeled task. When a network is trained to solve the pretext task well, the network is expected to extract useful features. The pretext tasks range from predicting relative patch location [patch_location], solving jigsaw puzzle [jigsaw]

, colorizing images 

[colorization], counting objects [count], and predicting rotations [rotation].

3 Methods

3.1 Patch-wise Deep SVDD

Deep SVDD [deepSVDD] learns an encoder that maps the whole training data to a small hypersphere lying in the representation space. At test time, the distance of the representation of the input from the center of the hypersphere is used as an anomaly score. The encoder,

, is trained using the following loss function:


The center is calculated in advance of the training as in Eq 3, where denotes the number of training examples.


Therefore, the training pushes the features to a single center. In this paper, we extend this approach to patches; the encoder transforms each patch, not the entire image, as in Fig 2. Patch-wise inspection has several advantages. First, the inspection is performed on each position, and hence we can localize the position of the defects. Moreover, such fine-grained examination improves the overall detection performance.

However, the extension is not trivial. Unlike images, patches contain distinct contents and have a high intra-class variation; some patches correspond to the background, while the others contain the object. Accordingly, mapping all the representations of dissimilar patches to a uni-modal cluster weakens their connections to the corresponding contents. Therefore, using a single center is inappropriate, yet deciding the number of multiple centers and the allocation of the patches to each center is cumbersome.

To bypass the above issues, we do not explicitly define the center and allocate the patches. Instead, we train the encoder to cluster the semantically similar patches by itself. Here, the semantically similar patches are obtained by sampling spatially adjacent patches, and the encoder is trained to minimize distances between their features, using the following loss function:


where is the patches near .

Figure 3: Self-supervision task. The encoder is trained to extract informative features to correctly predict the relative position of the patches.

An optimization of Eq 4 brings the representations of all patches together, and the consequence may be a representation collapse to a single cluster. To enforce the representation to capture the semantics of the patch, we adopt the following self-supervised learning.

3.2 Self-supervised learning

Doersch et al. [patch_location] trained a set of feature extractor and a classifier to predict a relative position of two patches from their representations, as depicted in Fig 3. A well-performing model implies that the trained encoder successfully extracts useful features for the location prediction. Besides the particular task, previous research [jigsaw, rotation, revisit_ssl] reported that the self-supervised encoder functions as a powerful feature extractor for downstream tasks.

For a randomly sampled patch , they sample another patch from one of its eight neighborhoods in 3 3 grid. Letting the true relative position be , the classifier is trained to correctly predict . The size of the patch is the same as the receptive field size of the encoder. To prevent the classifier from exploiting the shortcuts (e.g. color aberration), we randomly perturb the RGB channels of the patches. Following their approach, we add a self-supervised learning signal by adding the following loss term:


As a result, the encoder is trained using the combination of two losses with the scaling hyperparameter

, as presented in Eq 6.

Figure 4: Hierarchical encoding. Patches are hierarchically encoded.

3.3 Hierarchical encoding

Since anomalies vary in size, deploying multiple encoders with various receptive fields helps dealing with the variety in the size. The experimental results in Section 4.3.2 show that enforcing a hierarchical structure to the encoder boosts the anomaly detection performance. Therefore, we compose the encoders to have a hierarchical structure so that the multiple encoders constitute a single large encoder, as in Fig 4. More concretely, the encoder with bigger receptive field is defined as


Each encoder with receptive field size is respectively trained with the self-supervised task of the patch size . Throughout the experiment, the receptive field sizes of the big and small encoders are and , respectively.

3.4 Generating anomaly maps

After training the encoders as discussed in previous sections, the representations from the encoder are used for detecting the anomalies. First, the representation of every normal train patch, , is calculated and stored. Then, given a query image, for every patch

with a stride

, the L2 distance to the nearest normal patch in the feature space is defined to be its anomaly score, as in Eq 8. To mitigate the computational cost of the nearest neighbor (NN) search, we adopted its approximate algorithm111 The inspection of a single image of MVTec AD [mvtecad] takes about 0.48 second.


Patch-wise calculated anomaly scores are then distributed to the pixels. Therefore, pixels receive the average anomaly scores of every patch that they belong, and we denote the resulting anomaly map as .

Figure 5: Overall flow of the proposed method. For a given test image, Patch SVDD extracts patches of size with stride , and extract their features using the trained encoder. The L2 distance to the nearest normal patch in the feature space is defined to be the anomaly score of each patch.

Multiple encoders in Section 3.3 constitute multiple feature spaces, thereby yielding multiple anomaly maps. We aggregate the multiple maps by an element-wise multiplication (Eq 9), and the resulting anomaly map, , is the answer to anomaly segmentation problem.


The answer for the anomaly detection problem is straightforward. The maximum anomaly score of the pixels in the image is its anomaly score, as in Eq 10. Fig 5 illustrates the overall flow of the proposed method.

Figure 6: Anomaly maps generated by the proposed method. Patch SVDD generated anomaly maps of each image in fifteen classes of MVTec AD [mvtecad] dataset. The ground truth segmentation annotations are depicted as red contours in the image, and the darker heatmap indicates higher anomaly score.
  Method AUROC Task: Anomaly Detection 0.592 0.672 0.762 0.839 Patch-SVDD (Ours) 0.920     Method AUROC Task: Anomaly Segmentation L2-AE 0.804 SSIM-AE 0.818  [ve_vae] 0.861  [iterative_project] 0.893 Patch-SVDD (Ours) 0.957  
Table 1: Anomaly detection (left) and segmentation (right) results on MVTec AD [mvtecad] dataset.

4 Results and Discussion

To verify the validity of the proposed method, we applied the method to MVTec AD [mvtecad] dataset. The dataset consists of 15-class industrial images, each class categorized as either object or texture. Ten object classes contain regularly positioned objects, while the texture classes are full of repetitive patterns. Each class contains 60 to 390 normal train images, and 40 to 167 test images. Test images include both normal and abnormal examples, and the defects of the abnormal images are annotated in a pixel level. We downsampled every image to a resolution of 256 256, and we refer to [mvtecad] for more details about the dataset.

Figure 7: t-SNE of learned features. The color and size of each point represent the position ( and of the polar coordinates) within an image (b). From its color and size, we can infer the position of the patch within the image features (a and c).

4.1 Anomaly detection and segmentation results

Fig 6 shows the generated anomaly maps using the proposed method. Defects in both object classes and texture classes are properly localized by the anomaly maps. Table 1 shows the detection and segmentation performances in AUROC for MVTec AD dataset compared with state-of-the-art baselines. Patch SVDD shows the new state-of-the-art performance over the powerful baselines and beats Deep SVDD [deepSVDD] by a large margin.

4.2 Detailed analysis

4.2.1 t-SNE visualization

Fig 7 shows the t-SNE visualization [tsne] of the learned features from multiple images. Patches locating at the colored points in Fig 7(b) are mapped to the point with the same color and size in Fig 7(a) and Fig 7(c). Fig 7(a) clearly shows that the points with similar color and size form clusters in the feature space. Since the images in cable class are regularly positioned, the patches from the same position have similar contents even if they are from the different images. On the other hand, features of leather class in Fig 7(c) show the opposite tendency. This is because the patches in texture classes are analogous regardless of the position in the image.

4.2.2 Random encoder


Random Encoder Raw Patch AE
Classes Det. Seg. Det. Seg. Seg.
bottle 0.888 0.799 0.924 0.818 0.909
cable 0.564 0.703 0.572 0.839 0.732
capsule 0.672 0.925 0.666 0.937 0.786
carpet 0.416 0.655 0.480 0.735 0.539
grid 0.754 0.609 0.830 0.756 0.960
hazelnut 0.831 0.931 0.826 0.951 0.976
leather 0.687 0.851 0.687 0.875 0.751
metal_nut 0.370 0.787 0.521 0.888 0.880
pill 0.723 0.915 0.767 0.917 0.885
screw 0.694 0.931 0.562 0.927 0.979
tile 0.677 0.539 0.701 0.508 0.476
toothbrush 0.808 0.952 0.919 0.970 0.971
transistor 0.534 0.876 0.675 0.917 0.906
wood 0.902 0.737 0.944 0.784 0.630
zipper 0.700 0.837 0.938 0.919 0.680


Table 2: The nearest neighbor algorithm using random encoders and raw patches. For some classes, the nearest neighbor algorithm using the features from a random encoder can detect anomalies very well. For those classes, using the raw patches yields high performance as well.

Doersch et al. [patch_location]

showed that a randomly initialized encoders perform quite well in image retrieval; given an image, the nearest images in the random feature space are semantically similar to humans as well. Inspired by the observation, we examined the anomaly detection performance of random encoders and provided the results in Table 

2. Likewise in Eq 8, the anomaly score is defined to be the distance to the nearest normal patch, but in the random feature space. Surprisingly, for some classes, features of the random encoder are very effective to distinguish between the normal and the abnormal. Some results even outperform the trained deep neural network model (AE).

Here, we investigate the reason for the high separability of the random features. For simplicity, let us assume the encoder to be a one-layered convolutional layer parametrized by and followed by a nonlinearity, . Given two patches and , their features and are given as Eq 11.


As Eq 12 suggests, when the features are close, so are the patches, and vice versa. Therefore, retrieving the nearest patch in the feature space is analogous to doing so in the image space. In Table 2, we also provide the anomaly detection result using the nearest patch distance (i.e., in Eq 8). Well-separated classes by the random encoder are well-separated by the raw patch nearest neighbor algorithm, and vice versa. This observation coincides with the conclusion of Eq 12. To summarize, the random features of anomalies are easily separable because they are alike the raw patch, and the raw patches are easily separable.


4.3 Ablation study

  Det. Seg. 0.748 0.821 0.756 0.871 0.920 0.957  
Table 3: The effect of the losses. Using yields improved performances than using and .
Figure 8: The effects of the losses vary among classes.
Figure 9: Features of transistor class by the encoders trained with different losses.

4.3.1 Training losses

Patch SVDD trains an encoder with two losses: and , where is a variant of . To compare the roles of the proposed loss terms, we conduct an ablation study. Table 3 suggests that modifying to and adopting both improve the anomaly detection performance. However, Fig 8 shows that the effect of the loss vary among classes. The texture classes (e.g. tile and wood) are insensitive to the choice of loss, while the object classes including cable and transistor benefit a lot from .

In Fig 9, we provide t-SNE visualizations of features of an object class (transistor) when the encoders are trained with , , and , respectively. In Fig 9(a, b), the resulting features form a single cluster. This is unlike in Fig 9(c), where

is additionally used, and the features are separately clustered based on their semantics. This observation implies that the semi-supervised learning makes the representations not necessarily uni-modal, so that the features can stay more informative about the corresponding patches. This effect is particularly important to object classes in which the patches from an image have high variations. Since the features are clustered in regard to their contents, the use of

enables more accurate and deliberate anomaly inspection.

Figure 10: Intrinsic dimensions of the features under different losses.

Intrinsic dimensions [intrinsic_dimension] (ID) of the features also support the effectiveness of . ID is the minimal number of coordinates needed to describe the points without significant information loss [intrinsic_dimension2]. Larger ID denotes that the points are distributed to every direction, while smaller ID indicates that the points lie on low-dimensional manifolds with high separability. In Fig 10, we provide the average IDs of features in each class trained with three different losses. Training encoder with the propose yields features with the lowest ID, implying that they are neatly distributed.

Figure 11: Multi-scale inspection. Patch SVDD performs a multi-scale inspection and aggregates the results.

4.3.2 Hierarchical encoding

In Section 3.3, we proposed to make use of hierarchical encoders. We provide an example of multi-scale inspection results and their aggregated anomaly map in Fig 11. The anomaly maps from various scales provide complementary inspection results. The encoder with large receptive field coarsely locate the defect, and the one with smaller receptive field refines the result. Since the smaller encoder deals with patches with higher variation, they assign high anomaly score overall. Therefore, an element-wise multiplication of the two maps localizes the accurate position of the defect. Fig 12 quantitatively shows that aggregating multi-scale results improves the inspection result.

Figure 12: The effect of hierarchical encoding. Aggregating the results from multi-scale inspection boosts the performance, and adopting hierarchical structure to the encoder is helpful as well.

An ablation study with non-hierarchical encoder in Fig 12

shows that the hierarchical structure itself boosts the performance as well. We postulate that the specific structure of the encoder functions as a regularization for the feature extraction. Note that the non-hierarchical encoder has the similar number of parameters to the hierarchical counterpart.

4.3.3 Hyperparameter sensitivity

In Eq 6, the hyperparamter balances between and . Large puts more importance on squashing the features, while small promotes their informativeness. Interestingly, most favorable value of varies among the classes. Anomalies in the object classes are well detected under smaller , while the texture classes prefer the larger. Fig 14 shows an example; the anomaly detection performance on the cable class (object) improves as decreases, while the wood class (texture) shows the opposite trend. This result coincides with Fig 8 because using as a loss is equivalent to using with .

The number of feature dimensions, , is another hyperparameter for the encoder. In Fig 14(a), we provide the anomaly inspection performance as varies. Larger improves the performance, and such trend has been discussed in self-supervised learning venue [revisit_ssl]. Fig 14(b) depicts that the ID of the resulting features grows as grows. The black dashed line is graph, and it is the upper bound of ID. The average ID of features among classes saturates since . Therefore we chose and used that value throughout the paper.

Figure 13: The effect of . The anomaly detection performances for the two classes show the opposite tendency.
Figure 14: The effect of embedding dimension, . Larger yields better inspection results (a) and larger intrinsic dimensions (b).

5 Conclusion

In this paper, we proposed Patch SVDD, a method for anomaly detection and segmentation. Unlike Deep SVDD [deepSVDD], we inspect the image in a patch-level, and hence we can localize the defect as well. Moreover, additional self-supervised learning improves the detection performance. As a result, the proposed method achieved new state-of-the-art performance on the industrial anomaly detection dataset.

Due to their high-dimensional and structured nature, images have been featurized prior to the following downstream tasks in previous work [jigsaw, colorization]. However, the results in our analysis suggest that nearest neighbor algorithm with the raw patch often discriminates the anomaly surprisingly well. Moreover, since the distance in the random feature space is closely related to that in the raw images, random features can be useful to distinguish anomaly.