Anomaly detection is a binary classification problem, determining if an input contains an anomaly. Detecting anomalies is a critical and long standing problems in manufacturing industries and financial companies. Typically, anomaly detection is formulated as a one-class classification. This is because the abnormal examples are not accessible at the training time, or they are not sufficient enough to model its distribution, compared to its huge diversity. When dealing with an image, the detected anomalies can be localized as well, and anomaly segmentation is a problem to localize the anomalies in a pixel-level. In this paper, we consider the problem of image anomaly detection and segmentation.
One-class SVM (OC-SVM) [ocsvm]
and support vector data description (SVDD)[svdd]
are classic algorithms for one-class classification. Given a kernel function, OC-SVM and SVDD seek for a separating hyperplane from the origin and a data-enclosing hypersphere in the kernel space, respectively. They are closely related, and Vert et al.[ocsvm_consistency] showed their equivalence under Gaussian kernel. Ruff et al. [deepSVDD]
proposed its deep learning variant, Deep SVDD, by deploying a deep neural network in the place of kernel function. The neural network is trained to extract a favorable representation from the high-dimensional and structured data, saving the efforts to choose an appropriate kernel function by hand. Furthermore, Ruff et al.[deep_sad] re-interpreted Deep SVDD in an information-theoretic perspective and applied to semi-supervised scenarios.
In this paper, we extend Deep SVDD to a patch-wise detection method, thereby proposing Patch SVDD. Due to relatively high level of intra-class variation of the patches, the extension is not trivial, and self-supervised learning facilitates it. The proposed method enables anomaly segmentation and improves the anomaly detection performance. Fig 1 shows an example of anomaly localization by the proposed method. Besides, some previous research [patch_location, lens] reported that the features of an untrained encoder can be useful to distinguish anomalies. We provide more detailed behavior of the untrained encoders and investigate why.
2.1 Anomaly detection and segmentation
2.1.1 Problem formulation
Anomaly detection is to make a binary decision whether an input is anomaly or not. The definition of anomaly ranges from a tiny defect to an out-of-distribution image. We focus on detecting a defect in an image. A typical detection method trains a scoring function, , which measures an abnormality of each input. In test time, the inputs with high are deemed to be an anomaly. A de facto standard metric for the scoring function is AUROC, as written in Eq 1 [auroc].
Therefore, a good scoring function is one that assigns low anomaly score to normal data and high anomaly score to abnormal data. The anomaly segmentation problem is formulated similarly. It generates anomaly score for every pixel (i.e., an anomaly map) and measures the AUROC for every pixel.
2.1.2 Auto encoder-based methods
Early deep learning approaches for anomaly detection used auto encoders [vae_ad, recon_and_detect, ocgan]. The auto encoders are trained with the normal training data, and they do not reconstruct abnormal images accurately. Therefore, the difference between the reconstruction and the input provides a distinguishable signal of abnormality. Further variants have been proposed to utilize structural similarity index [ssim_ae], adversarial training [recon_and_detect], negative mining [ocgan], and iterative projection [iterative_project]. Some previous work utilized the learned latent feature of the auto encoder for anomaly detection. Akcay et al. [ganomaly] defined the reconstruction of latent feature as an anomaly score, and Yarlagadda et al. [satellite] trained OC-SVM [ocsvm] on top of the latent features. More recently, several methods make use of other than the reconstruction loss, such as restoration loss [itae] and an attention map [ve_vae].
2.1.3 Classifier-based methods
Starting from Golan et al. [geom]
, discriminative approaches have been proposed for anomaly detection. They exploit an observation that classifiers lose their confidence[oodbaseline] on abnormal input images. Given unlabeled dataset, they train a classifier to predict the synthetic labels. For example, Golan et al. [geom] randomly flip, rotate, and translate an image, and the classifier is trained to predict the particular transformation. If the classifier does not make a confident and correct prediction, the input image is deemed to be abnormal. Wang et al. [e3outlier] proved that such approach can be extended to an unsupervised scenario, where the train data also contains a few anomalies. Bergman et al. [goad] adopted an open-set classification method and generalized the method to non-image data.
2.1.4 SVDD-based methods
is a classic one-class classification algorithm. It maps all the normal training data into a predefined kernel space, and it seeks for the smallest enclosing hypersphere in the kernel space. The outliers are expected to locate outside the learned hypersphere. Since a kernel function determines the kernel space, the training procedure is merely deciding the radius and center of the hypersphere.
Ruff et al. [deepSVDD] improved this approach using deep neural network. They adopted neural network in the place of kernel function, and trained it along with the radius and center of the hypersphere. Such modification lets the encoder learn a data dependent transformation so that enhances the detection performance on high-dimensional and structured data. Ruff et al. [deep_sad] further applied the method to the semi-supervised scenario, thereby proposing Deep SAD.
2.2 Self-supervised representation learning
Learning a representation of an image is a core problem of computer vision. A series of methods have been proposed to learn a representation of data without annotation by learning with apretext task. They obtain the learning signals by learning from self-labeled task. When a network is trained to solve the pretext task well, the network is expected to extract useful features. The pretext tasks range from predicting relative patch location [patch_location], solving jigsaw puzzle [jigsaw]colorization], counting objects [count], and predicting rotations [rotation].
3.1 Patch-wise Deep SVDD
Deep SVDD [deepSVDD] learns an encoder that maps the whole training data to a small hypersphere lying in the representation space. At test time, the distance of the representation of the input from the center of the hypersphere is used as an anomaly score. The encoder,
, is trained using the following loss function:
The center is calculated in advance of the training as in Eq 3, where denotes the number of training examples.
Therefore, the training pushes the features to a single center. In this paper, we extend this approach to patches; the encoder transforms each patch, not the entire image, as in Fig 2. Patch-wise inspection has several advantages. First, the inspection is performed on each position, and hence we can localize the position of the defects. Moreover, such fine-grained examination improves the overall detection performance.
However, the extension is not trivial. Unlike images, patches contain distinct contents and have a high intra-class variation; some patches correspond to the background, while the others contain the object. Accordingly, mapping all the representations of dissimilar patches to a uni-modal cluster weakens their connections to the corresponding contents. Therefore, using a single center is inappropriate, yet deciding the number of multiple centers and the allocation of the patches to each center is cumbersome.
To bypass the above issues, we do not explicitly define the center and allocate the patches. Instead, we train the encoder to cluster the semantically similar patches by itself. Here, the semantically similar patches are obtained by sampling spatially adjacent patches, and the encoder is trained to minimize distances between their features, using the following loss function:
where is the patches near .
An optimization of Eq 4 brings the representations of all patches together, and the consequence may be a representation collapse to a single cluster. To enforce the representation to capture the semantics of the patch, we adopt the following self-supervised learning.
3.2 Self-supervised learning
Doersch et al. [patch_location] trained a set of feature extractor and a classifier to predict a relative position of two patches from their representations, as depicted in Fig 3. A well-performing model implies that the trained encoder successfully extracts useful features for the location prediction. Besides the particular task, previous research [jigsaw, rotation, revisit_ssl] reported that the self-supervised encoder functions as a powerful feature extractor for downstream tasks.
For a randomly sampled patch , they sample another patch from one of its eight neighborhoods in 3 3 grid. Letting the true relative position be , the classifier is trained to correctly predict . The size of the patch is the same as the receptive field size of the encoder. To prevent the classifier from exploiting the shortcuts (e.g. color aberration), we randomly perturb the RGB channels of the patches. Following their approach, we add a self-supervised learning signal by adding the following loss term:
As a result, the encoder is trained using the combination of two losses with the scaling hyperparameter, as presented in Eq 6.
3.3 Hierarchical encoding
Since anomalies vary in size, deploying multiple encoders with various receptive fields helps dealing with the variety in the size. The experimental results in Section 4.3.2 show that enforcing a hierarchical structure to the encoder boosts the anomaly detection performance. Therefore, we compose the encoders to have a hierarchical structure so that the multiple encoders constitute a single large encoder, as in Fig 4. More concretely, the encoder with bigger receptive field is defined as
Each encoder with receptive field size is respectively trained with the self-supervised task of the patch size . Throughout the experiment, the receptive field sizes of the big and small encoders are and , respectively.
3.4 Generating anomaly maps
After training the encoders as discussed in previous sections, the representations from the encoder are used for detecting the anomalies. First, the representation of every normal train patch, , is calculated and stored. Then, given a query image, for every patch
with a stride, the L2 distance to the nearest normal patch in the feature space is defined to be its anomaly score, as in Eq 8. To mitigate the computational cost of the nearest neighbor (NN) search, we adopted its approximate algorithm111https://github.com/yahoojapan/NGT. The inspection of a single image of MVTec AD [mvtecad] takes about 0.48 second.
Patch-wise calculated anomaly scores are then distributed to the pixels. Therefore, pixels receive the average anomaly scores of every patch that they belong, and we denote the resulting anomaly map as .
Multiple encoders in Section 3.3 constitute multiple feature spaces, thereby yielding multiple anomaly maps. We aggregate the multiple maps by an element-wise multiplication (Eq 9), and the resulting anomaly map, , is the answer to anomaly segmentation problem.
The answer for the anomaly detection problem is straightforward. The maximum anomaly score of the pixels in the image is its anomaly score, as in Eq 10. Fig 5 illustrates the overall flow of the proposed method.
4 Results and Discussion
To verify the validity of the proposed method, we applied the method to MVTec AD [mvtecad] dataset. The dataset consists of 15-class industrial images, each class categorized as either object or texture. Ten object classes contain regularly positioned objects, while the texture classes are full of repetitive patterns. Each class contains 60 to 390 normal train images, and 40 to 167 test images. Test images include both normal and abnormal examples, and the defects of the abnormal images are annotated in a pixel level. We downsampled every image to a resolution of 256 256, and we refer to [mvtecad] for more details about the dataset.
4.1 Anomaly detection and segmentation results
Fig 6 shows the generated anomaly maps using the proposed method. Defects in both object classes and texture classes are properly localized by the anomaly maps. Table 1 shows the detection and segmentation performances in AUROC for MVTec AD dataset compared with state-of-the-art baselines. Patch SVDD shows the new state-of-the-art performance over the powerful baselines and beats Deep SVDD [deepSVDD] by a large margin.
4.2 Detailed analysis
4.2.1 t-SNE visualization
Fig 7 shows the t-SNE visualization [tsne] of the learned features from multiple images. Patches locating at the colored points in Fig 7(b) are mapped to the point with the same color and size in Fig 7(a) and Fig 7(c). Fig 7(a) clearly shows that the points with similar color and size form clusters in the feature space. Since the images in cable class are regularly positioned, the patches from the same position have similar contents even if they are from the different images. On the other hand, features of leather class in Fig 7(c) show the opposite tendency. This is because the patches in texture classes are analogous regardless of the position in the image.
4.2.2 Random encoder
|Random Encoder||Raw Patch||AE|
Doersch et al. [patch_location]
showed that a randomly initialized encoders perform quite well in image retrieval; given an image, the nearest images in the random feature space are semantically similar to humans as well. Inspired by the observation, we examined the anomaly detection performance of random encoders and provided the results in Table2. Likewise in Eq 8, the anomaly score is defined to be the distance to the nearest normal patch, but in the random feature space. Surprisingly, for some classes, features of the random encoder are very effective to distinguish between the normal and the abnormal. Some results even outperform the trained deep neural network model (AE).
Here, we investigate the reason for the high separability of the random features. For simplicity, let us assume the encoder to be a one-layered convolutional layer parametrized by and followed by a nonlinearity, . Given two patches and , their features and are given as Eq 11.
As Eq 12 suggests, when the features are close, so are the patches, and vice versa. Therefore, retrieving the nearest patch in the feature space is analogous to doing so in the image space. In Table 2, we also provide the anomaly detection result using the nearest patch distance (i.e., in Eq 8). Well-separated classes by the random encoder are well-separated by the raw patch nearest neighbor algorithm, and vice versa. This observation coincides with the conclusion of Eq 12. To summarize, the random features of anomalies are easily separable because they are alike the raw patch, and the raw patches are easily separable.
4.3 Ablation study
4.3.1 Training losses
Patch SVDD trains an encoder with two losses: and , where is a variant of . To compare the roles of the proposed loss terms, we conduct an ablation study. Table 3 suggests that modifying to and adopting both improve the anomaly detection performance. However, Fig 8 shows that the effect of the loss vary among classes. The texture classes (e.g. tile and wood) are insensitive to the choice of loss, while the object classes including cable and transistor benefit a lot from .
In Fig 9, we provide t-SNE visualizations of features of an object class (transistor) when the encoders are trained with , , and , respectively. In Fig 9(a, b), the resulting features form a single cluster. This is unlike in Fig 9(c), where
is additionally used, and the features are separately clustered based on their semantics. This observation implies that the semi-supervised learning makes the representations not necessarily uni-modal, so that the features can stay more informative about the corresponding patches. This effect is particularly important to object classes in which the patches from an image have high variations. Since the features are clustered in regard to their contents, the use ofenables more accurate and deliberate anomaly inspection.
Intrinsic dimensions [intrinsic_dimension] (ID) of the features also support the effectiveness of . ID is the minimal number of coordinates needed to describe the points without significant information loss [intrinsic_dimension2]. Larger ID denotes that the points are distributed to every direction, while smaller ID indicates that the points lie on low-dimensional manifolds with high separability. In Fig 10, we provide the average IDs of features in each class trained with three different losses. Training encoder with the propose yields features with the lowest ID, implying that they are neatly distributed.
4.3.2 Hierarchical encoding
In Section 3.3, we proposed to make use of hierarchical encoders. We provide an example of multi-scale inspection results and their aggregated anomaly map in Fig 11. The anomaly maps from various scales provide complementary inspection results. The encoder with large receptive field coarsely locate the defect, and the one with smaller receptive field refines the result. Since the smaller encoder deals with patches with higher variation, they assign high anomaly score overall. Therefore, an element-wise multiplication of the two maps localizes the accurate position of the defect. Fig 12 quantitatively shows that aggregating multi-scale results improves the inspection result.
An ablation study with non-hierarchical encoder in Fig 12
shows that the hierarchical structure itself boosts the performance as well. We postulate that the specific structure of the encoder functions as a regularization for the feature extraction. Note that the non-hierarchical encoder has the similar number of parameters to the hierarchical counterpart.
4.3.3 Hyperparameter sensitivity
In Eq 6, the hyperparamter balances between and . Large puts more importance on squashing the features, while small promotes their informativeness. Interestingly, most favorable value of varies among the classes. Anomalies in the object classes are well detected under smaller , while the texture classes prefer the larger. Fig 14 shows an example; the anomaly detection performance on the cable class (object) improves as decreases, while the wood class (texture) shows the opposite trend. This result coincides with Fig 8 because using as a loss is equivalent to using with .
The number of feature dimensions, , is another hyperparameter for the encoder. In Fig 14(a), we provide the anomaly inspection performance as varies. Larger improves the performance, and such trend has been discussed in self-supervised learning venue [revisit_ssl]. Fig 14(b) depicts that the ID of the resulting features grows as grows. The black dashed line is graph, and it is the upper bound of ID. The average ID of features among classes saturates since . Therefore we chose and used that value throughout the paper.
In this paper, we proposed Patch SVDD, a method for anomaly detection and segmentation. Unlike Deep SVDD [deepSVDD], we inspect the image in a patch-level, and hence we can localize the defect as well. Moreover, additional self-supervised learning improves the detection performance. As a result, the proposed method achieved new state-of-the-art performance on the industrial anomaly detection dataset.
Due to their high-dimensional and structured nature, images have been featurized prior to the following downstream tasks in previous work [jigsaw, colorization]. However, the results in our analysis suggest that nearest neighbor algorithm with the raw patch often discriminates the anomaly surprisingly well. Moreover, since the distance in the random feature space is closely related to that in the raw images, random features can be useful to distinguish anomaly.