I Introduction
In the era of largescale datasets, the ability to automatically detect outliers (or anomalies) in data is relevant in many application fields. Recognizing anomalous textures is often required e.g. for the detection of defects in industrial manufacturing
[tout2017automatic, bergmann2019mvtec] and largescale infrastructural maintenance of roads, bridges, rails, etc, while other applications require the detection of more complex shapes and structures, like biomedical applications [prokopetc2017slim, schlegl2017unsupervised]dealing for example with early disease detection via medical imaging. Even if this problem could be tackled with discriminative approaches — e.g. as a supervised binary classification problem in which samples can be classified as normal or anomalous — the cost of collecting a representative training dataset, especially anomalous samples, is often prohibitive. Thus, interest has grown for oneclass anomaly detection. This is often cast as a learning problem where normal data is modelled exploiting generative approaches, relying on an unlabeled dataset mostly comprised of nonanomalous samples.
Recent developments in deep learning and image representation has introduced visual anomaly detection in many applications as for instance biomedical (e.g. diagnoses aids) and industrial (e.g. quality assurance) applications. In these applications, classical approaches to anomaly detection, such as statisticalbased
[yang2009outlier, cohen2008novelty], proximitybased [radovanovic2014reverse, chehreghani2016k], or clusteringbased [aggarwal2015outlier, manzoor2016fast] approaches, often offer poor performance when applied directly to images, due to the highdimensionality involved in this type of data, and usually have to rely on the extraction of lowerdimensional handcrafted or learned features.Recent approaches directly adopt deep models to contextually map and model the data in a feature space. In this context, deep generative models for images based on Autoencoders (AEs)
[bergmann2018improving, wang2020advae]or Generative Adversarial Networks (GANs)
[schlegl2017unsupervised, schlegl2019f, akcay2018ganomaly] proved to be effective in anomaly detection. Most approaches in this category are reconstructionbased: starting from a given sample, they reconstruct the nearest sample on the normal data manifold learned by the model and measure the deviation in a predefined space (e.g. pixel, latent, or combinations) to assess an anomaly. Autoencoderbased approaches implement this strategy in a straightforward way: the encoder projects the given sample in the normal data manifold, while the decoder performs the reconstruction. However, generative models based on autoencoders like VAEs are known to produce blurry reconstructions for photorealistic images [dumoulin2017adversarially] and are often outperformed by GANs; thus in this work, we focus our attention on GANbased approaches for oneclass anomaly detection.Query  Reconstruction  Abs. Difference  

Texture  Grid 

Object  Screw 
Input reconstruction in GANbased approaches poses a challenge as, in the standard formulation, there is no encoder that projects the sample in the latent space, and an expensive latent space optimization is often required [schlegl2017unsupervised]. Efficient approaches based on Bidirectional GANs (BiGAN) solve this problem by jointly learning an encoder that complements the decoder (i.e. generator) and provide the projection needed (e.g. EGBAD [zenati2018efficient]). However, reconstructed samples are often misaligned: the reconstruction goal is only defined by the discriminator, and there is no guarantee that encoding and then decoding a sample will reconstruct it precisely (see Figure 1).
In this work, we tackle the aforementioned problems and propose a novel method for anomaly detection in images, where we introduce a consistency constraint as a regularization term, on both encoding and decoding parts of a BiGAN. In the reminder of the paper we call our model CBiGAN. Our new formulation is able to improve the reconstruction ability with respect to a BiGAN. It also generalizes both EGBAD [zenati2018efficient] and AEs by combining the modelling power of the former and the reconstruction consistency of the latter. We evaluate the proposed method on MVTec AD — a realworld benchmark for unsupervised anomaly detection on highresolution images — and compare against standard baselines and stateoftheart approaches. Our results show that the proposed method solves misalignment problems commonly occurring in GAN formulations improving the performance of BiGAN for complex objects by a large margin. Moreover, our proposal performs comparably to expensive stateoftheart iterative methods while reducing computational cost requiring a singlepass evaluation strategy. We observe that our model is particularly effective on texturetype anomaly detection, as it sets a new state of the art in this category outperforming also models using additional data.
Ii Related Work
Most of recent approaches for anomaly detection on images adopt reconstructionbased techniques relying on some sort of deep generative models. Autoencoders (standard [zhou2017anomaly, chen2017outlier, bergmann2018improving] and variational ones [dehaene2020iterative, wang2020advae]) and generative adversarial networks [schlegl2017unsupervised, akcay2018ganomaly, zenati2018efficient] comprise the most commonly adopted techniques: the former are trained to minimize a reconstruction loss usually in the pixel space, while the latter focus on generating samples indistinguishable from normal data indirectly leading to reconstruction. Anomaly detection is then implemented by defining a score based on reconstruction error and additional metrics, such as feature matching in latent or intermediate space. Concerning AE techniques, bergmann2018improving point out that dependencies between pixels are not properly modelled in autoencoders and show that vanilla AEs using a structural similarity index metric (SSIM) loss outperform even complex architectures (e.g. AEs with feature matching and VAEs) that rely on L2 pixelwise losses. In [pawlowski2018unsupervised], the authors instead proposed bayesian convolutional autoencoders: they show that sampling in variational models smooths noisy reconstruction error compared to other AE architectures on medical data. zimmerer2019unsupervised suggest that the gradient w.r.t the input image of the KLloss in VAE provides useful information on the normality of data and adopt it for anomaly localization. Building on this concept, dehaene2020iterative propose to iteratively integrate these gradients to perform reconstruction by moving samples towards the normality manifold in pixelspace in a similar manner used by adversarial example crafting. golan2018deep set up a selfsupervised multiclassification pretext task of geometric transformation recognition and use the softmax output on transformed samples as feature to characterize abnormality. Similarly, huang2019inverse propose to train a deep architecture to invert geometric transformation of the input and uses the reconstruction error as anomaly score for the input image.
In our work, we build upon GANbased anomaly detection — techniques that exploit GANs to learn the anomalyfree distribution of data assuming a uniformly or normally distributed latent space explaining it. We focus on the scenario in which we assume a mostly anomalyfree training dataset, and we refer the interested reader to
berg2019unsupervised for an analysis of GANbased detectors when this assumption does not hold. Among the seminal works in this category, schlegl2017unsupervisedproposed AnoGAN: the generator is used to perform reconstruction, and the discriminator to extract features of original and reconstructed samples; both are then adopted to define an anomaly score. The major drawback of this approach is the need of a computational expensive latent space optimization using backpropagation to find the latent space representation that encodes the given sample. Recent work solve this problem by introducing an encoding module that learns to map samples from the input to the latent space.
schlegl2019f propose fastAnoGAN — an enhanced AnoGAN with a learned encoding module and adopting the more stable Wasserstein GAN formulation. akcay2018ganomaly propose an adversarially trained encoderdecoderencoder architecture named GANomaly and define their anomaly score as the L1 error in the latent space between original and reconstructed sample. zenati2018efficient propose EGBAD (Efficient GAN Based Anomaly Detection) that directly adopts a Bidirectional GAN (BiGANs [donahue2017adversarial]) — an improved GAN formulation that learns the joint distribution of the latent and input spaces — and adopt a combination of reconstruction loss and discriminator loss as anomaly score. In a similar vein, several works propose different architectures that implement some sort of adversarial training
[sabokrou2018avid, venkataramanan2019attention, wang2020advae, zenati2018adversarially]. sabokrou2018avid propose a selfsupervised approach for anomaly localization that adopts adversariallytrained fully convolutional network plus a discriminator to detect anomalous pixels. venkataramanan2019attention propose an adversariallytrained variational autoencoder combined with a specialized attention regularization that encourages the network to model all the parts of the normal input and thus perform precise anomaly localization.Iii Background
Iiia Generative Adversarial Networks
In its basic formulation, a Generative Adversarial Network (GAN [goodfellow2014generative]) is comprised of a generator that generates data starting from a latent variable and a discriminator that discerns whether its input is generated by or coming from the real data distribution . and are adversarially trained to compete — is optimized to fool , while to tell apart data generated by and real one. In a gametheoretic framework, and plays the following twoplayer minimax game
(1) 
where indicates how real the discriminator this its input is. At the Nash equilibrium of the game, is not able to discern fake samples from real ones, and thus, is producing samples from .
IiiB Wasserstein GAN
Reaching the Nash equilibrium is known to be hard due to training instabilities [salimans2016improved]. Wasserstein GAN (WGAN [arjovsky2017wasserstein]) facilitates GAN training by measuring the distance between real and fake data distributions using the Wasserstein distance that assures more stable gradients. In this formulation, is defined as a scalar helper function used to compute the Wasserstein distance that substitutes the value function in the minimax game
(2) 
The function of is shifted from a discriminating classifier to a critic that produces authenticity scores and tends to give high scores to real samples and low ones to fake data. To ensure Lipschitzcontinuity of , that is a prerequisite for obtaining Equation 2, we regularize the norm of the gradient of with respect to its inputs when optimized as in [gulrajani2017improved].
IiiC BiGAN
Bidirectional GANs (BiGANs [donahue2017adversarial]) improves the modeling of the latent space by exposing it to the discriminator together with images generated from it. An encoder module is introduced to map real samples to the corresponding latent space and trained together with . The discriminator is trained to discern whether the couple comes from a real or generated image. The minimax problem for the BiGAN is
(3)  
Thus, fooling leads and to minimize the difference between and couples.
Iv Method
We tackle anomaly detection as a oneclass classification problem — we assume a training dataset ( for images) comprised of only nonanomalous samples. Given a test sample, we want to label it as normal/nonanomalous/defectfree (we consider it the negative class) or anomalous (positive). We rely on a GANbased model to capture the distribution of normal data conditioned to the latent space . Similarly to [zenati2018efficient], we adopt a BiGAN as generative model, but we instantiated it with the Wassestein distance formulation; its minimax problem is defined as
(4) 
where is the generator producing fake images from latent representations, is the encoder projecting an image in its corresponding latent representation, and is the discriminator/critic that scores samples and helps implementing the Wasserstein distance computation.
All three the modules are defined as deep neural networks and are optimized alternatively (once
and , once ) with minibatch gradient descent. Given a minibatch of elements comprised of real samples and randomly sampled latent representations , the losses for the players and are defined as(5) 
and
(6) 
Once trained on normal data, the anomaly detection procedure is reconstructionbased: given a test sample , we compute to find its closest latent representation on the normal data manifold learned by the model and then compute to build its reconstruction. Following the AnoGAN approach [schlegl2017unsupervised], we define the anomaly score as a linear combination of two terms: a) the pixelbased L1 reconstruction error , and b) the featurebased discriminator error . Formally,
(7) 
with
(8)  
(9) 
where is the norm,
is feature vector extracted from an intermediate output of the discriminator
, andis a balancing hyperparameter whose value is commonly domainspecific. Intuitively, for an anomalous image both the reconstruction error
and the distance between discriminator features increase, as the encoder usually maps the input into an outofdistribution latent representation; therefore, the generator fails to reconstruct the input, and the discriminator extracts different representations for the original and reconstructed inputs.Unfortunately for anomaly detection, the BiGAN training procedure does not put any constraints on and being aligned (i.e. and viceversa are not guaranteed,) and this can lead to errors and misalignment in reconstructed samples of normal data and thus to a high false positive rate. Figure 1 shows an example of this phenomenon; the reconstructed sample presents a slight rotation with respect to the input sample, and this results in an erroneous high anomaly score. To cope with this problem, we add a cycle consistency regularization term to both and to promote their alignment. Formally,
(10) 
where
(11)  
(12) 
The new loss for and is a linear combination of and
(13) 
where controls the weight of each contribution. We refer to the model trained with this formulation as Consistency BiGAN (CBiGAN) whose architecture is depicted in Figure 2. Note that setting we obtain the standard BiGAN approach (EBGAD [zenati2018efficient]), while with our model collapses into a nonadversariallytrained double autoencoder (on both latent and input space).
V Evaluation
Va Dataset
We tested and compared our approach on MVTec AD [bergmann2019mvtec] — a recent realworld benchmark specifically tailored for unsupervised anomaly detection that is increasingly adopted in recent literature on this topic [venkataramanan2019attention, huang2019inverse, liu2019towards]. It is comprised of over 5,000 highresolution industrial images divided into 10 objects and 5 textures categories; for each category, MVTec AD provides a training set of nonanomalous (defectfree) images and a labelled test set containing both anomalous and normal images. Details of MVTec AD are reported in Table I, and examples (together with the reconstruction performed by our model) are depicted in Figure 4.
Train  Test  Image Side  

N  N  P  
Textures 
Carpet  280  28  89  1024 
Grid  264  21  57  1024  
Leather  245  32  92  1024  
Tile  230  33  84  840  
Wood  247  19  60  1024  
Objects 
Bottle  209  20  63  900 
Cable  224  58  92  1024  
Capsule  219  23  109  1000  
Hazelnut  391  40  70  1024  
Metal nut  220  22  93  700  
Pill  267  26  141  800  
Screw  320  41  119  1024  
Toothbrush  60  12  30  1024  
Transistor  213  60  40  1024  
Zipper  240  32  119  1024 
VB Implementation Details
We follow the evaluation procedure commonly adopted in related work [schlegl2017unsupervised] in the unsupervised setting. We implemented , , and as standard residual convolutional networks. Architectural details are reported in Figure 3. We train CBiGAN separately for each category using the corresponding anomalyfree training set. Different preprocessing are adopted depending on the type of object considered:

for ‘Object’ categories, we resize the input images to 128x128, and we apply data augmentation in the form of random rotation in the range whenever the orientation of the object is independent of its abnormality — for Bottle, Hazelnut, Metal nut, and Screw categories;

for ‘Texture’ categories, we resize the input image to 512x512 and processed in 64x64 patches; when training, we randomly crop 64x64 patches from the input image and apply random clockwise rotation in the range, and during testing, we divide the input image in 64x64 patches, derive a local anomaly score for each patch, and obtain a global score by picking the maximum local score.
For all models and categories, we use a 64dimensional latent space () sampled using a normal distribution. We tuned the weight of the consistency loss experimentally; good values are for ‘Objects’ and for ‘Textures’ depending on the importance of pixellevel details in detecting an anomaly. Note that small values of are needed to balance the usually large values of the consistency loss term that depends on the input and latent dimensions. We also produce the results for EGBAD [zenati2018efficient] by setting . All models are trained using the Adam optimizer with a learning rate of . Following [schlegl2017unsupervised], we set when computing the anomaly score for testing images.
VC Evaluation Metrics
To quantitatively evaluate the quality of the tested approaches, for each category we compute

the maximum Balanced Accuracy obtained when varying the threshold on the anomaly score; this is often reported by previous work [bergmann2019mvtec, venkataramanan2019attention] and referred to as “the mean of the ratio of correctly classified samples of anomalyfree (TNR) and anomalous images (TPR)”. The choice of the threshold is equivalent to the maximization of the Youden’s index [sokolova2006beyond] .

the Area Under the ROC Curve (auROC), commonly adopted as a thresholdindependent quality metric for classification.
We also report pertype and overall means of the above metrics as a unique indicator of the quality of the compared methods.
VD Comparison with State of the Art
We compare our approaches with several methods tackling oneclass anomaly detection and obtaining stateoftheart results on MVTec AD. We divide the methods into

iterative methods that require an iterative optimization for each sample on which anomaly detection is performed and include AnoGAN [schlegl2017unsupervised] and VAEgrad [dehaene2020iterative], and

singlepass methods that perform anomaly detection in the testing phase with a single forward evaluation of the model and include AE and AE [bergmann2018improving], AVID [sabokrou2018avid], LSA [abati2019latent], EGBAD [zenati2018efficient], GeoTrans [golan2018deep], GANomaly [akcay2018ganomaly], ITAE [huang2019inverse], and our CBiGAN.
In the latter, we also include CAVGA [venkataramanan2019attention] even if the authors report results using models trained with additional data and thus not directly comparable with the other methods. For EGBAD, results are coming from our implementation, as it is generalized by CBiGAN and can be obtained with .
VE Results
Tables II and III report results for each compared methods and for each category of MVTec AD. Compared to EGBAD, that is the approach we extend, we observe that the consistency constraint added in our model consistently improve the detection of anomalies in all the categories (with Zipper being the only exception) by at most 49%, achieving a +15% improvement on the overall balanced accuracy; the same conclusion can be drawn also from the auROC metric. These performance gains are achieved maintaining the exact same computational cost of EGBAD during prediction (a single forward pass of and ) and avoiding expensive iterative methods. The only additional cost to the BiGAN approach is the computation of the consistency loss during the offline training phase.
Figure 5 reports some relevant examples of the differences between EGBAD and CBiGAN. Note that our higher quality of reconstructions is due to the recovered alignment and the higher color fidelity in background colors, that overall inducing smaller differences in nonanomalous areas. The Zipper class is challenging for both EGBAD and our model, as we deem the anomalous parts of the image (usually dents) are small with respect to other source of variability (the border of the zipper) that need to be modeled (see Figure 5 last row); thus both models tend to have a high reconstruction loss for both normal and anomalous samples.
Both metrics also show that CBiGAN improves on all the compared methods when dealing with textures anomalies, reaching respectively a mean balanced accuracy and mean auROC of 0.84 and 0.85 and outperforming also methods using additional data. When all categories are concerned, our method performs comparably to both other singlepass and iterative methods in terms of overall balanced accuracy, respectively obtaining a +3% (vs AVID, LSA) and 1% (vs VAEgrad) performance with respect to the second best methods.
Experiments on objects display a more complex situation; among singlepass methods, an absolute best does not emerge, and different categories benefit from specific peculiarities of each method. CBiGAN does improve on vanilla BiGAN (EGBAD) on objects, but overall, our method offers an average performance comparable or slightly degraded (23%) with respect to other singlepass methods. Note that the performance of the vanilla L2 autoencoder suggests that tuning the hyperparameter for each particular object category could further increase performance of CBiGAN (e.g. Zipper could benefit from a higher ), but for sake of simplicity, we refrain from exploring classspecific parameter values in this work and prefer presenting results for fixed reasonable values. Among the few methods reporting the auROC on MVTec AD data, our model still achieves stateoftheart performance on textures and offers a better or comparable performance with respect to other methods on objects with the only exception represented by ITAE [huang2019inverse]
— a dataaugmentationbased denoising autoencoder. At the time of writing, we abstain from discussing the performance of ITAE due to reproducibility issues.
Query  Recon.  Diff.  

Texture  Grid 

Obj.  Toothbrush 
Textures  Objects  Overall
Mean 


Carpet  Grid  Leather  Tile  Wood  Mean  Bottle  Cable  Capsule  Hazelnut  MetalNut  Pill  Screw  Toothbrush  Transistor  Zipper  Mean  
Iterative methods  
AnoGAN [schlegl2017unsupervised]  0.49  0.51  0.52  0.51  0.68  0.54  0.69  0.53  0.58  0.50  0.50  0.62  0.35  0.57  0.67  0.59  0.56  0.55 
VAEgrad [dehaene2020iterative]  0.67  0.83  0.71  0.81  0.89  0.78  0.86  0.56  0.86  0.74  0.78  0.80  0.71  0.89  0.70  0.67  0.76  0.77 
Singlepass methods  
AE [bergmann2018improving]  0.67  0.69  0.46  0.52  0.83  0.63  0.88  0.61  0.61  0.54  0.54  0.60  0.51  0.74  0.52  0.80  0.64  0.63 
AE [bergmann2018improving]  0.50  0.78  0.44  0.77  0.74  0.65  0.80  0.56  0.62  0.88  0.73  0.62  0.69  0.98  0.71  0.80  0.74  0.71 
AVID [sabokrou2018avid]  0.70  0.59  0.58  0.66  0.83  0.67  0.88  0.64  0.85  0.86  0.63  0.86  0.66  0.73  0.58  0.84  0.75  0.73 
LSA [abati2019latent]  0.74  0.54  0.70  0.70  0.75  0.69  0.86  0.61  0.71  0.80  0.67  0.85  0.75  0.89  0.50  0.88  0.75  0.73 
EGBAD [zenati2018efficient]  0.60  0.50  0.65  0.73  0.80  0.66  0.68  0.66  0.55  0.50  0.55  0.63  0.50  0.48  0.68  0.59  0.58  0.61 
CBiGAN (ours)  0.60  0.99  0.87  0.84  0.88  0.84  0.84  0.73  0.58  0.75  0.67  0.76  0.67  0.97  0.74  0.55  0.73  0.76 
Methods using additional data  
CAVGAD [venkataramanan2019attention]  0.73  0.75  0.71  0.70  0.85  0.75  0.89  0.63  0.83  0.84  0.67  0.88  0.77  0.91  0.73  0.87  0.80  0.78 
CAVGAR [venkataramanan2019attention]  0.78  0.78  0.75  0.72  0.88  0.78  0.91  0.67  0.87  0.87  0.71  0.91  0.78  0.97  0.75  0.94  0.84  0.82 
Textures  Objects  Overall
Mean 


Carpet  Grid  Leather  Tile  Wood  Mean  Bottle  Cable  Capsule  Hazelnut  MetalNut  Pill  Screw  Toothbrush  Transistor  Zipper  Mean  
AE  0.64  0.83  0.80  0.74  0.97  0.80  0.65  0.64  0.62  0.73  0.64  0.77  1.00  0.77  0.65  0.87  0.74  0.75 
GeoTrans [golan2018deep]  0.44  0.62  0.84  0.42  0.61  0.59  0.74  0.78  0.67  0.36  0.81  0.63  0.50  0.97  0.87  0.82  0.71  0.67 
GANomaly [akcay2018ganomaly]  0.70  0.71  0.84  0.79  0.83  0.77  0.89  0.76  0.73  0.79  0.70  0.74  0.75  0.65  0.79  0.75  0.76  0.76 
ITAE [huang2019inverse]  0.71  0.88  0.86  0.74  0.92  0.82  0.94  0.83  0.68  0.86  0.67  0.79  1.00  1.00  0.84  0.88  0.85  0.84 
EGBAD [zenati2018efficient]  0.52  0.54  0.55  0.79  0.91  0.66  0.63  0.68  0.52  0.43  0.47  0.57  0.46  0.64  0.73  0.58  0.57  0.60 
CBiGAN (ours)  0.55  0.99  0.83  0.91  0.95  0.85  0.87  0.81  0.56  0.77  0.63  0.81  0.58  0.94  0.77  0.53  0.73  0.77 
EGBAD [zenati2018efficient]  CBiGAN (ours)  

Query  Recon.  Diff.  Recon.  Diff. 
Vi Conclusion
We tackled oneclass anomaly detection of images using deep generative models and reconstructionbased approaches. We proposed CBiGAN — an improved Bidirectional GAN model with a consistency regularization on both the encoder and decoder modules. Our model generalizes and combines both BiGANs and Autoencoders to retain the modelling power of the former and the reconstruction accuracy of the latter. We evaluated our proposal on MVTec AD — a realworld benchmark for unsupervised visual anomaly detection with focus on industrial applications. The results of our experiments showed that our proposal greatly improves the reconstruction ability (and thus performance) of vanilla bidirectional GANs on both texture and object categories while maintaining its efficiency at test time. Our CBiGAN is particularly effective on texturetype images where it sets the new state of the art obtaining the best mean accuracy and auROC among competing methods including expensive iterative ones and approaches using additional data. Concerning objecttype images, we observed that no particular method prevails on others, as each object category comes with different peculiarities. In this context, our model provides an average performance comparable with other efficient (singlepass) methods. In future work, we plan to evaluate our model also on the task of anomaly localization and gain insight on the effect of in that context.
Acknowledgment
This work was partially funded by “Automatic Data and documents Analysis to enhance humanbased processes” (ADA, CUP CIPE D55F17000290009) and the AI4EU project (funded by the EC, H2020  Contract n. 825619). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a Jetson TX2 used for this research.
Comments
farid.aliev.0204@gmail.com ∙
Hello
∙ reply