In the era of large-scale datasets, the ability to automatically detect outliers (or anomalies) in data is relevant in many application fields. Recognizing anomalous textures is often required e.g. for the detection of defects in industrial manufacturing[tout2017automatic, bergmann2019mvtec] and large-scale infrastructural maintenance of roads, bridges, rails, etc, while other applications require the detection of more complex shapes and structures, like biomedical applications [prokopetc2017slim, schlegl2017unsupervised]
dealing for example with early disease detection via medical imaging. Even if this problem could be tackled with discriminative approaches — e.g. as a supervised binary classification problem in which samples can be classified as normal or anomalous — the cost of collecting a representative training dataset, especially anomalous samples, is often prohibitive. Thus, interest has grown for one-class anomaly detection. This is often cast as a learning problem where normal data is modelled exploiting generative approaches, relying on an unlabeled dataset mostly comprised of non-anomalous samples.
Recent developments in deep learning and image representation has introduced visual anomaly detection in many applications as for instance biomedical (e.g. diagnoses aids) and industrial (e.g. quality assurance) applications. In these applications, classical approaches to anomaly detection, such as statistical-based[yang2009outlier, cohen2008novelty], proximity-based [radovanovic2014reverse, chehreghani2016k], or clustering-based [aggarwal2015outlier, manzoor2016fast] approaches, often offer poor performance when applied directly to images, due to the high-dimensionality involved in this type of data, and usually have to rely on the extraction of lower-dimensional hand-crafted or learned features.
Recent approaches directly adopt deep models to contextually map and model the data in a feature space. In this context, deep generative models for images based on Autoencoders (AEs)[bergmann2018improving, wang2020advae]
or Generative Adversarial Networks (GANs)[schlegl2017unsupervised, schlegl2019f, akcay2018ganomaly] proved to be effective in anomaly detection. Most approaches in this category are reconstruction-based: starting from a given sample, they reconstruct the nearest sample on the normal data manifold learned by the model and measure the deviation in a predefined space (e.g. pixel, latent, or combinations) to assess an anomaly. Autoencoder-based approaches implement this strategy in a straightforward way: the encoder projects the given sample in the normal data manifold, while the decoder performs the reconstruction. However, generative models based on autoencoders like VAEs are known to produce blurry reconstructions for photo-realistic images [dumoulin2017adversarially] and are often outperformed by GANs; thus in this work, we focus our attention on GAN-based approaches for one-class anomaly detection.
Texture - Grid
Object - Screw
Input reconstruction in GAN-based approaches poses a challenge as, in the standard formulation, there is no encoder that projects the sample in the latent space, and an expensive latent space optimization is often required [schlegl2017unsupervised]. Efficient approaches based on Bidirectional GANs (BiGAN) solve this problem by jointly learning an encoder that complements the decoder (i.e. generator) and provide the projection needed (e.g. EGBAD [zenati2018efficient]). However, reconstructed samples are often misaligned: the reconstruction goal is only defined by the discriminator, and there is no guarantee that encoding and then decoding a sample will reconstruct it precisely (see Figure 1).
In this work, we tackle the aforementioned problems and propose a novel method for anomaly detection in images, where we introduce a consistency constraint as a regularization term, on both encoding and decoding parts of a BiGAN. In the reminder of the paper we call our model CBiGAN. Our new formulation is able to improve the reconstruction ability with respect to a BiGAN. It also generalizes both EGBAD [zenati2018efficient] and AEs by combining the modelling power of the former and the reconstruction consistency of the latter. We evaluate the proposed method on MVTec AD — a real-world benchmark for unsupervised anomaly detection on high-resolution images — and compare against standard baselines and state-of-the-art approaches. Our results show that the proposed method solves misalignment problems commonly occurring in GAN formulations improving the performance of BiGAN for complex objects by a large margin. Moreover, our proposal performs comparably to expensive state-of-the-art iterative methods while reducing computational cost requiring a single-pass evaluation strategy. We observe that our model is particularly effective on texture-type anomaly detection, as it sets a new state of the art in this category outperforming also models using additional data.
Ii Related Work
Most of recent approaches for anomaly detection on images adopt reconstruction-based techniques relying on some sort of deep generative models. Autoencoders (standard [zhou2017anomaly, chen2017outlier, bergmann2018improving] and variational ones [dehaene2020iterative, wang2020advae]) and generative adversarial networks [schlegl2017unsupervised, akcay2018ganomaly, zenati2018efficient] comprise the most commonly adopted techniques: the former are trained to minimize a reconstruction loss usually in the pixel space, while the latter focus on generating samples indistinguishable from normal data indirectly leading to reconstruction. Anomaly detection is then implemented by defining a score based on reconstruction error and additional metrics, such as feature matching in latent or intermediate space. Concerning AE techniques, bergmann2018improving point out that dependencies between pixels are not properly modelled in autoencoders and show that vanilla AEs using a structural similarity index metric (SSIM) loss outperform even complex architectures (e.g. AEs with feature matching and VAEs) that rely on L2 pixel-wise losses. In [pawlowski2018unsupervised], the authors instead proposed bayesian convolutional autoencoders: they show that sampling in variational models smooths noisy reconstruction error compared to other AE architectures on medical data. zimmerer2019unsupervised suggest that the gradient w.r.t the input image of the KL-loss in VAE provides useful information on the normality of data and adopt it for anomaly localization. Building on this concept, dehaene2020iterative propose to iteratively integrate these gradients to perform reconstruction by moving samples towards the normality manifold in pixel-space in a similar manner used by adversarial example crafting. golan2018deep set up a self-supervised multi-classification pretext task of geometric transformation recognition and use the softmax output on transformed samples as feature to characterize abnormality. Similarly, huang2019inverse propose to train a deep architecture to invert geometric transformation of the input and uses the reconstruction error as anomaly score for the input image.
In our work, we build upon GAN-based anomaly detection — techniques that exploit GANs to learn the anomaly-free distribution of data assuming a uniformly or normally distributed latent space explaining it. We focus on the scenario in which we assume a mostly anomaly-free training dataset, and we refer the interested reader toberg2019unsupervised for an analysis of GAN-based detectors when this assumption does not hold. Among the seminal works in this category, schlegl2017unsupervised
proposed AnoGAN: the generator is used to perform reconstruction, and the discriminator to extract features of original and reconstructed samples; both are then adopted to define an anomaly score. The major drawback of this approach is the need of a computational expensive latent space optimization using backpropagation to find the latent space representation that encodes the given sample. Recent work solve this problem by introducing an encoding module that learns to map samples from the input to the latent space.schlegl2019f propose fast-AnoGAN — an enhanced AnoGAN with a learned encoding module and adopting the more stable Wasserstein GAN formulation. akcay2018ganomaly propose an adversarially trained encoder-decoder-encoder architecture named GANomaly and define their anomaly score as the L1 error in the latent space between original and reconstructed sample. zenati2018efficient propose EGBAD (Efficient GAN Based Anomaly Detection) that directly adopts a Bidirectional GAN (BiGANs [donahue2017adversarial]
) — an improved GAN formulation that learns the joint distribution of the latent and input spaces — and adopt a combination of reconstruction loss and discriminator loss as anomaly score. In a similar vein, several works propose different architectures that implement some sort of adversarial training[sabokrou2018avid, venkataramanan2019attention, wang2020advae, zenati2018adversarially]. sabokrou2018avid propose a self-supervised approach for anomaly localization that adopts adversarially-trained fully convolutional network plus a discriminator to detect anomalous pixels. venkataramanan2019attention propose an adversarially-trained variational autoencoder combined with a specialized attention regularization that encourages the network to model all the parts of the normal input and thus perform precise anomaly localization.
Iii-a Generative Adversarial Networks
In its basic formulation, a Generative Adversarial Network (GAN [goodfellow2014generative]) is comprised of a generator that generates data starting from a latent variable and a discriminator that discerns whether its input is generated by or coming from the real data distribution . and are adversarially trained to compete — is optimized to fool , while to tell apart data generated by and real one. In a game-theoretic framework, and plays the following two-player minimax game
where indicates how real the discriminator this its input is. At the Nash equilibrium of the game, is not able to discern fake samples from real ones, and thus, is producing samples from .
Iii-B Wasserstein GAN
Reaching the Nash equilibrium is known to be hard due to training instabilities [salimans2016improved]. Wasserstein GAN (WGAN [arjovsky2017wasserstein]) facilitates GAN training by measuring the distance between real and fake data distributions using the Wasserstein distance that assures more stable gradients. In this formulation, is defined as a scalar helper function used to compute the Wasserstein distance that substitutes the value function in the minimax game
The function of is shifted from a discriminating classifier to a critic that produces authenticity scores and tends to give high scores to real samples and low ones to fake data. To ensure Lipschitz-continuity of , that is a prerequisite for obtaining Equation 2, we regularize the norm of the gradient of with respect to its inputs when optimized as in [gulrajani2017improved].
Bidirectional GANs (BiGANs [donahue2017adversarial]) improves the modeling of the latent space by exposing it to the discriminator together with images generated from it. An encoder module is introduced to map real samples to the corresponding latent space and trained together with . The discriminator is trained to discern whether the couple comes from a real or generated image. The minimax problem for the BiGAN is
Thus, fooling leads and to minimize the difference between and couples.
We tackle anomaly detection as a one-class classification problem — we assume a training dataset ( for images) comprised of only non-anomalous samples. Given a test sample, we want to label it as normal/non-anomalous/defect-free (we consider it the negative class) or anomalous (positive). We rely on a GAN-based model to capture the distribution of normal data conditioned to the latent space . Similarly to [zenati2018efficient], we adopt a BiGAN as generative model, but we instantiated it with the Wassestein distance formulation; its minimax problem is defined as
where is the generator producing fake images from latent representations, is the encoder projecting an image in its corresponding latent representation, and is the discriminator/critic that scores samples and helps implementing the Wasserstein distance computation.
All three the modules are defined as deep neural networks and are optimized alternatively (onceand , once ) with mini-batch gradient descent. Given a mini-batch of elements comprised of real samples and randomly sampled latent representations , the losses for the players and are defined as
Once trained on normal data, the anomaly detection procedure is reconstruction-based: given a test sample , we compute to find its closest latent representation on the normal data manifold learned by the model and then compute to build its reconstruction. Following the AnoGAN approach [schlegl2017unsupervised], we define the anomaly score as a linear combination of two terms: a) the pixel-based L1 reconstruction error , and b) the feature-based discriminator error . Formally,
where is the norm,
is feature vector extracted from an intermediate output of the discriminator, and
is a balancing hyperparameter whose value is commonly domain-specific. Intuitively, for an anomalous image both the reconstruction errorand the distance between discriminator features increase, as the encoder usually maps the input into an out-of-distribution latent representation; therefore, the generator fails to reconstruct the input, and the discriminator extracts different representations for the original and reconstructed inputs.
Unfortunately for anomaly detection, the BiGAN training procedure does not put any constraints on and being aligned (i.e. and vice-versa are not guaranteed,) and this can lead to errors and misalignment in reconstructed samples of normal data and thus to a high false positive rate. Figure 1 shows an example of this phenomenon; the reconstructed sample presents a slight rotation with respect to the input sample, and this results in an erroneous high anomaly score. To cope with this problem, we add a cycle consistency regularization term to both and to promote their alignment. Formally,
The new loss for and is a linear combination of and
where controls the weight of each contribution. We refer to the model trained with this formulation as Consistency BiGAN (CBiGAN) whose architecture is depicted in Figure 2. Note that setting we obtain the standard BiGAN approach (EBGAD [zenati2018efficient]), while with our model collapses into a non-adversarially-trained double autoencoder (on both latent and input space).
We tested and compared our approach on MVTec AD [bergmann2019mvtec] — a recent real-world benchmark specifically tailored for unsupervised anomaly detection that is increasingly adopted in recent literature on this topic [venkataramanan2019attention, huang2019inverse, liu2019towards]. It is comprised of over 5,000 high-resolution industrial images divided into 10 objects and 5 textures categories; for each category, MVTec AD provides a training set of non-anomalous (defect-free) images and a labelled test set containing both anomalous and normal images. Details of MVTec AD are reported in Table I, and examples (together with the reconstruction performed by our model) are depicted in Figure 4.
V-B Implementation Details
We follow the evaluation procedure commonly adopted in related work [schlegl2017unsupervised] in the unsupervised setting. We implemented , , and as standard residual convolutional networks. Architectural details are reported in Figure 3. We train CBiGAN separately for each category using the corresponding anomaly-free training set. Different preprocessing are adopted depending on the type of object considered:
for ‘Object’ categories, we resize the input images to 128x128, and we apply data augmentation in the form of random rotation in the range whenever the orientation of the object is independent of its abnormality — for Bottle, Hazelnut, Metal nut, and Screw categories;
for ‘Texture’ categories, we resize the input image to 512x512 and processed in 64x64 patches; when training, we randomly crop 64x64 patches from the input image and apply random clockwise rotation in the range, and during testing, we divide the input image in 64x64 patches, derive a local anomaly score for each patch, and obtain a global score by picking the maximum local score.
For all models and categories, we use a 64-dimensional latent space () sampled using a normal distribution. We tuned the weight of the consistency loss experimentally; good values are for ‘Objects’ and for ‘Textures’ depending on the importance of pixel-level details in detecting an anomaly. Note that small values of are needed to balance the usually large values of the consistency loss term that depends on the input and latent dimensions. We also produce the results for EGBAD [zenati2018efficient] by setting . All models are trained using the Adam optimizer with a learning rate of . Following [schlegl2017unsupervised], we set when computing the anomaly score for testing images.
V-C Evaluation Metrics
To quantitatively evaluate the quality of the tested approaches, for each category we compute
the maximum Balanced Accuracy obtained when varying the threshold on the anomaly score; this is often reported by previous work [bergmann2019mvtec, venkataramanan2019attention] and referred to as “the mean of the ratio of correctly classified samples of anomaly-free (TNR) and anomalous images (TPR)”. The choice of the threshold is equivalent to the maximization of the Youden’s index [sokolova2006beyond] .
the Area Under the ROC Curve (auROC), commonly adopted as a threshold-independent quality metric for classification.
We also report per-type and overall means of the above metrics as a unique indicator of the quality of the compared methods.
V-D Comparison with State of the Art
We compare our approaches with several methods tackling one-class anomaly detection and obtaining state-of-the-art results on MVTec AD. We divide the methods into
iterative methods that require an iterative optimization for each sample on which anomaly detection is performed and include AnoGAN [schlegl2017unsupervised] and VAE-grad [dehaene2020iterative], and
single-pass methods that perform anomaly detection in the testing phase with a single forward evaluation of the model and include AE and AE [bergmann2018improving], AVID [sabokrou2018avid], LSA [abati2019latent], EGBAD [zenati2018efficient], GeoTrans [golan2018deep], GANomaly [akcay2018ganomaly], ITAE [huang2019inverse], and our CBiGAN.
In the latter, we also include CAVGA [venkataramanan2019attention] even if the authors report results using models trained with additional data and thus not directly comparable with the other methods. For EGBAD, results are coming from our implementation, as it is generalized by CBiGAN and can be obtained with .
Tables II and III report results for each compared methods and for each category of MVTec AD. Compared to EGBAD, that is the approach we extend, we observe that the consistency constraint added in our model consistently improve the detection of anomalies in all the categories (with Zipper being the only exception) by at most 49%, achieving a +15% improvement on the overall balanced accuracy; the same conclusion can be drawn also from the auROC metric. These performance gains are achieved maintaining the exact same computational cost of EGBAD during prediction (a single forward pass of and ) and avoiding expensive iterative methods. The only additional cost to the BiGAN approach is the computation of the consistency loss during the offline training phase.
Figure 5 reports some relevant examples of the differences between EGBAD and CBiGAN. Note that our higher quality of reconstructions is due to the recovered alignment and the higher color fidelity in background colors, that overall inducing smaller differences in non-anomalous areas. The Zipper class is challenging for both EGBAD and our model, as we deem the anomalous parts of the image (usually dents) are small with respect to other source of variability (the border of the zipper) that need to be modeled (see Figure 5 last row); thus both models tend to have a high reconstruction loss for both normal and anomalous samples.
Both metrics also show that CBiGAN improves on all the compared methods when dealing with textures anomalies, reaching respectively a mean balanced accuracy and mean auROC of 0.84 and 0.85 and outperforming also methods using additional data. When all categories are concerned, our method performs comparably to both other single-pass and iterative methods in terms of overall balanced accuracy, respectively obtaining a +3% (vs AVID, LSA) and -1% (vs VAE-grad) performance with respect to the second best methods.
Experiments on objects display a more complex situation; among single-pass methods, an absolute best does not emerge, and different categories benefit from specific peculiarities of each method. CBiGAN does improve on vanilla BiGAN (EGBAD) on objects, but overall, our method offers an average performance comparable or slightly degraded (2-3%) with respect to other single-pass methods. Note that the performance of the vanilla L2 autoencoder suggests that tuning the hyperparameter for each particular object category could further increase performance of CBiGAN (e.g. Zipper could benefit from a higher ), but for sake of simplicity, we refrain from exploring class-specific parameter values in this work and prefer presenting results for fixed reasonable values. Among the few methods reporting the auROC on MVTec AD data, our model still achieves state-of-the-art performance on textures and offers a better or comparable performance with respect to other methods on objects with the only exception represented by ITAE [huang2019inverse]
— a data-augmentation-based denoising autoencoder. At the time of writing, we abstain from discussing the performance of ITAE due to reproducibility issues.
Texture - Grid
Obj. - Toothbrush
|Methods using additional data|
|EGBAD [zenati2018efficient]||CBiGAN (ours)|
We tackled one-class anomaly detection of images using deep generative models and reconstruction-based approaches. We proposed CBiGAN — an improved Bidirectional GAN model with a consistency regularization on both the encoder and decoder modules. Our model generalizes and combines both BiGANs and Autoencoders to retain the modelling power of the former and the reconstruction accuracy of the latter. We evaluated our proposal on MVTec AD — a real-world benchmark for unsupervised visual anomaly detection with focus on industrial applications. The results of our experiments showed that our proposal greatly improves the reconstruction ability (and thus performance) of vanilla bidirectional GANs on both texture and object categories while maintaining its efficiency at test time. Our CBiGAN is particularly effective on texture-type images where it sets the new state of the art obtaining the best mean accuracy and auROC among competing methods including expensive iterative ones and approaches using additional data. Concerning object-type images, we observed that no particular method prevails on others, as each object category comes with different peculiarities. In this context, our model provides an average performance comparable with other efficient (single-pass) methods. In future work, we plan to evaluate our model also on the task of anomaly localization and gain insight on the effect of in that context.
This work was partially funded by “Automatic Data and documents Analysis to enhance human-based processes” (ADA, CUP CIPE D55F17000290009) and the AI4EU project (funded by the EC, H2020 - Contract n. 825619). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a Jetson TX2 used for this research.