Visual inspection is essential in many industrial manufacturing pipelines to ensure high production quality and increased cost effectiveness by quickly discarding defective parts. Since manual inspection by humans is slow, expensive, and error prone, the usage of fully automated computer vision systems is becoming increasingly popular. Supervised methods, where the system learns how to segment defective regions by training on both defective and non-defective samples, are commonly used. However, this involves a high amount of effort to generate labeled data and all possible defect types need to be known beforehand. Furthermore, in some production processes the scrap rate might be too small to produce a sufficient number of defective samples for training, especially for data-hungry deep learning models.
In this work, we focus on unsupervised defect segmentation for visual inspection. Our goal is to segment defective regions in images after having trained on exclusively non-defective samples. It has been shown that architectures based on convolutional neural networks (CNNs), such as autoencoders or generative adversarial networks (GANs) , can be used for this task. We give a brief overview of such methods in Section 2
. These models aim to reconstruct their inputs in the presence of certain constraints such as a bottleneck and hereby manage to capture the essence of high-dimensional data (e.g. images) in a lower-dimensional space. Thus, anomalies in the test data deviate from the training data manifold and the model fails to reproduce them. As a result, large reconstruction errors indicate the presence of defects. Typically, the error measure that is employed is a pixel-wiseor distance. However, these measures yield high novelty scores in locations where the reconstruction is only slightly inaccurate, for example due to small localization imprecisions of edges. They also fail to detect structural differences between input and reconstructed images when the respective pixels’ color values are roughly consistent. This limits the usefulness of these methods when employed in complex real-world scenarios.
To alleviate these problems, we propose to compare input and reconstructed images using the structural similarity (SSIM) metric , a distance measure designed to capture perceptual similarity. By applying this method to a real-world inspection dataset of industrial relevance, we show that it solves the aforementioned problems and yields a performance that is on par with other state-of-the-art unsupervised defect segmentation approaches (cf. Section 4.3). In contrast to these, we do not rely on any model priors, such as handcrafted features or pretrained networks. Figure 1 shows some qualitative results of our method.
2 Related Work
Detecting anomalies that deviate from the training data has been a long-standing problem in machine learning. Pimentel et al.
give a comprehensive overview of the field. In computer vision, one needs to distinguish between two setups of this task. First, there is the classification scenario, where novel samples appear as entirely different object classes that shall be labeled as outliers. Second, there is a scenario where novelties manifest themselves in subtle deviations from otherwise known structures and a segmentation of these deviations is required. For the first subproblem, a number of approaches have been proposed[5, 6]. We will limit ourselves to an overview of methods that attempt to tackle the latter problem.
Napoletano et al.  extract features from a CNN that has been pretrained on a classification task. The features are clustered in a dictionary during training and anomalous structures are identified when the extracted features strongly deviate from the learned cluster centers. General applicability of this approach is not guarenteed since the pretrained network might not extract useful features for the new task at hand and it is unclear which features of the network should be selected for clustering. The results achieved with this method are the current state-of-the-art on the NanoTWICE dataset (cf. Section 4.1) we use in our experiments. They improve upon previous results by Carrera et al. 
, who build a dictionary that yields a sparse representation of the normal data. Similar approaches using sparse representations for novelty detection are[9, 10, 11].
Schlegl et al.  train a GAN on optical coherence tomography images of the retina and detect anomalies such as retinal fluid by searching for a latent sample that minimizes the pixel-wise reconstruction error as well as a discriminator loss. The rather large number of optimization steps that must be performed to find a suitable latent sample makes this approach very slow. Therefore, it is only of use in practical applications that are not time-critical. Recently, Zenati et al.  proposed to use bi-directional GANs  to add the missing encoder network for faster inference. However, GANs are prone to run into mode collapse, meaning that there is no guarantee that all modes of the distribution of non-defective images are captured by the model. Furthermore, they are more difficult to train than autoencoders since the loss function of the adversarial training can typically not be trained to convergence . Instead, the training results must be judged manually after regular optimization intervals.
Baur et al.  propose a general framework for defect segmentation using autoencoding architectures and a per-pixel reconstruction loss. To circumvent the disadvantages of their loss function, they improve the reconstruction quality by requiring aligned input data and adding an adversarial loss to enhance the visual quality of the reconstructed images. However, for many applications that work on unstructured data, prior alignment is impossible. In addition to the instabilities during training, they might alter the visual appearance of the reconstruction, which further discourages the use of a per-pixel error function.
Other approaches take into account the structure of the latent space of variational autoencoders 
in order to define measures for outlier detection. An et al.19]
disregard the decoder output entirely and instead compute the KL divergence as a novelty measure between the prior and the encoder distribution. This is based on the underlying assumption that defective inputs will manifest themselves in mean and variance values that are very different from those of the prior. Similarly, Vasilev et al. define multiple novelty measures, either purely considering latent space behavior or combined measures with pixel-wise reconstruction losses. Obtaining only a single scalar value that indicates novelty can quickly become a performance bottleneck in a segmentation scenario, where a separate forward pass would be required for each image pixel to obtain an accurate segmentation result. Furthermore, we show that pixel-wise reconstruction probabilities obtained from variational autoencoders suffer from the same problems as pixel-wise deterministic losses (cf. Section 4.3).
Ridgeway et al.  show that SSIM  and the closely related multi-scale version MS-SSIM  can be used as differentiable loss functions to generate sharper reconstructions in autoencoding architectures. Autoencoders are straightforward to train and reliably reconstruct non-defective images while visually altering defective regions to keep the reconstruction close to the learned manifold of the training data. While pixel-wise loss functions are not designed to detect such structural changes, SSIM performs much better at identifying these alterations since it is designed to measure perceptual similarity.
Autoencoders attempt to reconstruct an input image through a bottleneck, effectively projecting the input image into a lower-dimensional space, called latent space. An autoencoder consists of an encoder function and a decoder function , where denotes the dimensionality of the latent space and denote the number of channels, height, and width of the input image, respectively. Choosing prevents the architecture from simply copying its input and forces the encoder to extract useful features from the input patches that facilitate accurate reconstruction by the decoder. The overall process can be summarized as
denotes the latent vector andis the reconstruction of the input. In the following, the functions and
are parameterized by CNNs. Strided convolutions are used to down-sample the input feature maps in the encoder and to up-sample them in the decoder.
To force the autoencoder to reconstruct its input, a loss function must be defined that guides it towards this behavior. For simplicity, one often chooses a per-pixel error measure, such as the loss
where denotes the intensity value of image x at row and column indices and . This loss function is also widely used for both the training and the evaluation of unsupervised defect segmentation autoencoders. We will discuss the usefulness of such a pixel-wise error measure and present a better alternative — the structural similarity index — in Section 3.2.
There exist various extensions to the deterministic autoencoder framework. Some works, such as the recently introduced variational autoencoder (VAE)  impose constraints on the latent variables to follow a certain distribution . For simplicity, the distribution is typically chosen to be a unit-variance Gaussian. This turns the entire framework into a probabilistic model that enables efficient posterior inference and also allows to generate new data from the training manifold by sampling from the latent distribution. The approximate posterior distribution obtained by encoding an input image can be used to define further novelty measures. One option is to compute a distance between the two distributions such as the KL-divergence and indicate novelty for large deviations from the prior . However, this approach by itself does not yield a pixel-accurate segmentation and a forward pass needs to be performed for a patch centered around each pixel of the entire input image. A second approach for utilizing the posterior which yields a novelty score for each pixel is to decode latent samples drawn from and then to evaluate the per-pixel reconstruction probability as described in .
Another extension to standard autoencoders was proposed by Dosovitskiy et al. . They increase the quality of the produced reconstructions by extracting features from both the input image x and its reconstruction and enforcing them to be equal. Let be a feature extractor that obtains an -dimensional feature vector from an input image. Then a regularizer can be added to the loss function of the autoencoder, yielding the feature matching autoencoder (FM-AE) loss
where denotes the weighting factor between the two loss terms. can be parametrized using the first layers of a CNN pretrained on an image classification task. We show that employing such more elaborate architectures does not yield satisfactory improvements over deterministic autoencoders trained and evaluated with a pixel-wise distance.
3.2 Structural Similarity
The SSIM metric  defines a symmetric distance measure between two sized image patches p and q, taking into account their similarity in luminance , contrast , and structure . These are combined as a product
where are weight factors for the three terms. They are typically set to to simplify the expression. Based on the mean values and , variances and , and covariance , the above equation can then be compactly rewritten as
The constants and ensure numerical stability and are typically set to and . It holds that . In particular, if and only if p and q are identical .
To compute the structural similarity between an entire image x and its reconstruction , one slides a sized window across the image and computes a SSIM value at each pixel location. Since Equation 5 is differentiable, it can be employed as a loss function in deep learning architectures that are getting optimized using gradient descent.
Figure 2 shows the advantages that SSIM has over pixel-wise error functions such as . In the left image of Figure 2, we see the input to an autoencoder that contains four gray strokes that simulate defects. The right image shows the corresponding reconstruction created by an autoencoder trained on defect-free checkerboard patterns. Figure 2 shows the error maps when computing the SSIM distance with a window size of (left) and the distance (right) between the two images. For the distance, both the defects and the inaccuracies in the reconstruction of the edges are weighted equally in the error map, which makes them indistinguishable. In contrast, SSIM gives more weight to the actual defects, assigning less importantance to the small inaccuracies in the reconstruction of the edges. This ultimately enables us to detect and segment defects in complex structures.
We evaluate our method on a dataset of nanofibrous materials  and compare it to -loss-based deterministic, variational, and feature matching autoencoders. Figure 1 shows two images of the dataset where red contours outline the ground truth of present defects and green areas indicate defective regions found by our method.
4.1 The NanoTWICE Dataset
The dataset consists of 45 gray-scale images of nanofibrous materials acquired by a scanning electron microscope and is publicly available111http://www.mi.imati.cnr.it/ettore/NanoTWICE/. A detailed description of the acquisition process can be found in . All images are of size and the dataset is composed of two disjoint subsets. The first set consists of five images that do not contain any anomalies. We use four of these images for training. The fifth can be used as a validation image for setting the threshold during test time by fixing a certain false positive rate. The remaining 40 images constitute the second subset which is used for testing. These images contain various defects such as beads, specks of dust, or flattened areas, which are annotated with a pixel-wise segmentation map.
General outline of our autoencoder architecture. The depicted values correspond to the structure of the encoder, the decoder is built as a reversed version of this. Leaky rectified linear units (ReLUs) with slope
are applied as activation functions after each layer except for the output layers of both the encoder and the decoder, in which linear activation functions are used.
4.2 Training and Testing Procedure
For the training of our autoencoder, we employ the following steps. First, we extract 20,000 patches of size from the given training images, since the input images are comparably large and only few of them are available. Based on our general autoencoding structure as shown in Figure 3, we set up four different architectures for training and evaluation. First, we train three networks using the
error metric: a deterministic, a variational, and a feature matching autoencoder. The forth architecture is a deterministic autoencoder using SSIM. We train each network for 200 epochs, using the ADAM optimizer with a learning rate of 0.0002 and a weight decay set to .
In order to improve the quality of our reconstructions which might enable the error metric to find defects more reliably, we train a deterministic autoencoder with the feature matching loss defined in Equation 3, setting . For calculating the features to be compared between the input and reconstructed image, we use the first three convolutional layers of an AlexNet 
pretrained on ImageNet.
|Latent dimension||AUC||SSIM window size||AUC||Patch size||AUC|
Area under the ROC curve for different hyperparameters
The evaluation is performed by striding over the testing images and reconstructing image patches of size using the trained autoencoder. In principle, it would be possible to set the horizontal and vertical stride to . We noted, however, that at different spatial locations the autoencoder produces slightly different reconstructions of the same data, which leads to some striding artifacts. Therefore, we decreased the stride to pixels and averaged the reconstructed pixel values. Then, we compare the input to the reconstruction using the respective error metric that was used for training ( or SSIM). In the case of the variational autoencoder, we decode latent samples from the approximate posterior distribution and evaluate the reconstruction probability for each pixel as a novelty score. We expect larger variance of for defective input patches, yielding lower reconstruction probabilities which might improve the performance in comparison to the deterministic autoencoder. The resulting novelty maps are thresholded to obtain candidate regions where a defect might be present. An opening with a circular structuring element of diameter four is applied as a morphological post-processing to delete outlier regions that are only a few pixels wide . We take the convex hull of each region found in order to close spurious holes in the segmentation result. An overview of the final novelty detection pipeline is depicted in Figure 3.
Using this setup, a forward pass through our architecture for a patch of size takes 14.1 milliseconds (ms) on a Tesla K40c GPU and the inference on a full input image takes around 9.6 seconds. This is comparable to the runtime reported by Napoletano et al. . One should keep in mind, however, that the segmentations produced in their experiments are made up of blocks consisting of pixels each. For their method to achieve a truly pixel-accurate segmentation, a much higher runtime would be required. Additionally, as argued by , the computational time achieved with our method falls way below the time needed to produce a nanofiber sample and is therefore sufficient for the applicability of our algorithm.
We tested different hyperparameter settings using the deterministic autoencoder trained with the SSIM loss, before using the same values for all architectures ensuring comparability. We varied the latent space dimension of the autoencoder, window size of the SSIM similarity measure, and the size of the patches that the autoencoder was trained on. Table 1
shows the respective areas under the receiver operating characteristic (ROC) curves when evaluating the trained networks. Here, the true positive rate is defined as the percentage of pixels that were correctly classified as defect across the entire dataset. The false positive rate is the percentage of pixels that were wrongly classified as defective. Our approach is rather insensitive to different hyperparameter settings. However, if the latent space dimension is not set to a sufficiently large value, the autoencoder fails to reconstruct non-defective images and therefore its performance decreases. Nevertheless, increasing the latent space dimension does not improve the performance indefinitely. As it weakens the effect of the bottleneck, it ultimately enables the network to copy its inputs and thus perfectly reconstruct defective regions, rendering their detection impossible.
In Figure 4, we see an example that visualizes the difference in performance of autoencoders using the error metric and SSIM. Both approaches manage to reconstruct the non-defective parts of the image and significantly alter the appearance of the defect in the reconstruction. The distance fails to segment the defect since it cannot be distinguished from the large novelty scores that are produced around the reconstructed non-defect edges. Moreover, since the defect is replaced by a structure that has similar color values as the input, the error fails to detect a large portion of the defect surface. In contrast, SSIM gives more weight to the visually altered area such that the defect can be reliably segmented.
This general behavior manifests itself in our numerical results as well. Figure 5 compares the ROC curves and their respective area under the curve (AUC) values of our approach using SSIM to the ones of deterministic, variational, and feature matching autoencoders that employ the pixel-wise distance. The performance of the deterministic and variational autoencoder is only marginally better than classifying each pixel randomly. We found the reconstructions obtained by different latent samples from the posterior of the VAE not to vary greatly. Thus, it could not improve on the deterministic framework. Feature matching yields a better performance as it manages to produce better reconstructions with more accurate edge locations. This enables the error metric to detect some of the anomalies. However, the results are still not competitive with other state-of-the-art methods on this dataset. Our method using SSIM outperforms all other tested architectures, indicating that altering the loss function can indeed boost performance on complex, unstructured datasets. The achieved AUC of 0.966 is comparable to the state-of-the-art as given in , where they report values of up to 0.974. In contrast to their method, our approach does not rely on any model priors such as handcrafted features or pretrained networks.
Since defects of smaller size contribute less to the overall true positive rate when weighting all pixel equally, we further evaluated the overlap of each detected anomaly region with the ground truth and report the
-quantiles forin Figure 5. We can see that for false positive rates as low as , more than of the defects have an overlap with the ground truth that is larger than . Therefore, we outperform the results achieved by , who report a minimal overlap of in this setting.
Figure 6 shows four close-ups of test images together with reconstructions produced by our autoencoder and the corresponding detection results. Our approach manages to find defects of various sizes as well as broken fibers. Note how the autoencoder alters the visual appearance of the defects in the reconstructed images, which ultimately enables us to detect them using SSIM.
We propose to use a structural similarity measure in combination with autoencoders for unsupervised defect segmentation. This measure is less sensitive to small inaccuracies of edge locations and instead focuses on structural differences that are more salient for humans. Employing it for the comparison of input images and reconstructions produced by an autoencoder, we manage to achieve state-of-the-art performance on a challenging dataset of nanofibrous materials which is of industrial relevance. We show that our approach constructs accurate error maps and manages to reliably detect defects of various scales. In contrast to the present state-of-the-art on this dataset, our method does not require the existence and selection of a layer of a pretrained CNN suited to the task at hand. Furthermore, it provides a pixel-accurate segmentation with an acceptable runtime.
In comparison, we evaluate the performance of autoencoders using the commonly used pixel-wise reconstruction error. We show that this approach is not well suited for the segmentation of defects in complex, real-world data. Even if we employ more sophisticated probabilistic novelty measures obtained from variational autoencoders or if we improve the quality of our reconstructions by employing a feature matching loss, per-pixel error metrics still perform significantly worse.
- Goodfellow et al.  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
- Goodfellow et al.  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
- Wang et al.  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
- Pimentel et al.  M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A review of novelty detection,” Signal Processing, vol. 99, pp. 215–249, 2014.
- Perera and Patel  P. Perera and V. M. Patel, “Learning Deep Features for One-Class Classification,” arXiv preprint arXiv:1801.05365, 2018.
Sabokrou et al. 
M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially Learned
One-Class Classifier for Novelty Detection,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3379–3388.
Napoletano et al. 
P. Napoletano, F. Piccoli, and R. Schettini, “Anomaly Detection in Nanofibrous Materials by CNN-Based Self-Similarity,”Sensors, vol. 18, no. 1, p. 209, 2018.
- Carrera et al.  D. Carrera, F. Manganini, G. Boracchi, and E. Lanzarone, “Defect Detection in SEM Images of Nanofibrous Materials,” IEEE Transactions on Industrial Informatics, vol. 13, no. 2, pp. 551–561, 2017.
- Boracchi et al.  G. Boracchi, D. Carrera, and B. Wohlberg, “Novelty Detection in Images by Sparse Representations,” in 2014 IEEE Symposium on Intelligent Embedded Systems (IES). IEEE, 2014, pp. 47–54.
- Carrera et al.  D. Carrera, G. Boracchi, A. Foi, and B. Wohlberg, “Detecting anomalous structures by convolutional sparse models,” in 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 2015, pp. 1–8.
- Carrera et al.  ——, “Scale-invariant anomaly detection with multiscale group-sparse models,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 3892–3896.
- Schlegl et al.  T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, “Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery,” in International Conference on Information Processing in Medical Imaging. Springer, 2017, pp. 146–157.
- Zenati et al.  H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar, “Efficient GAN-Based Anomaly Detection,” arXiv preprint arXiv:1802.06222, 2018.
- Donahue et al.  J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial Feature Learning,” International Conference on Learning Representations, 2017.
- Arjovsky and Bottou  M. Arjovsky and L. Bottou, “Towards Principled Methods for Training Generative Adversarial Networks,” International Conference on Learning Representations, 2017.
- Baur et al.  C. Baur, B. Wiestler, S. Albarqouni, and N. Navab, “Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images,” arXiv preprint arXiv:1804.04488, 2018.
- Kingma and Welling  D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” International Conference on Learning Representations, 2014.
- An and Cho  J. An and S. Cho, “Variational Autoencoder based Anomaly Detection using Reconstruction Probability,” SNU Data Mining Center, Tech. Rep., 2015.
- Soukup and Pinetz  D. Soukup and T. Pinetz, “Reliably Decoding Autoencoders’ Latent Spaces for One-Class Learning Image Inspection Scenarios,” in OAGM Workshop 2018. Verlag der Technischen Universität Graz, 2018.
- Vasilev et al.  A. Vasilev, V. Golkov, I. Lipp, E. Sgarlata, V. Tomassini, D. K. Jones, and D. Cremers, “q-Space Novelty Detection with Variational Autoencoders,” arXiv preprint arXiv:1806.02997, 2018.
- Ridgeway et al.  K. Ridgeway, J. Snell, B. Roads, R. S. Zemel, and M. C. Mozer, “Learning to generate images with perceptual similarity metrics,” arXiv preprint arXiv:1511.06409, 2015.
- Wang et al.  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 2. Ieee, 2003, pp. 1398–1402.
- Dosovitskiy and Brox  A. Dosovitskiy and T. Brox, “Generating Images with Perceptual Similarity Metrics based on Deep Networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
- Kingma and Ba  D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations, 2015.
- Krizhevsky et al.  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification With Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
- Russakovsky et al.  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- Steger et al.  C. Steger, M. Ulrich, and C. Wiedemann, Machine Vision Algorithms and Applications, 2nd ed. Weinheim: Wiley-VCH, 2018.