Codebase for Unsupervised Anomaly Detection with Adversarial Mirrored AutoEncoders paper
Use of deep generative models for unsupervised anomaly detection has shown great promise partially owing to their ability to learn proper representations of complex input data distributions. Current methods, however, lack a strong latent representation of the data, thereby resulting in sub-optimal unsupervised anomaly detection results. In this work, we propose a novel representation learning technique using deep autoencoders to tackle the problem of unsupervised anomaly detection. Our approach replaces the L_p reconstruction loss in the autoencoder optimization objective with a novel adversarial loss to enforce semantic-level reconstruction. In addition, we propose a novel simplex interpolation loss to improve the structure of the latent space representation in the autoencoder. Our technique improves the state-of-the-art unsupervised anomaly detection performance by a large margin on several image datasets including MNIST, fashion MNIST, CIFAR and Coil-100 as well as on several non-image datasets including KDD99, Arrhythmia and Thyroid. For example, On the CIFAR-10 dataset, using a standard leave-one-out evaluation protocol, our method achieves a substantial performance gain of 0.23 AUC points compared to the state-of-the-art.READ FULL TEXT VIEW PDF
Codebase for Unsupervised Anomaly Detection with Adversarial Mirrored AutoEncoders paper
Data distributions encountered in different applications are typically noisy and may contain out-of-distribution samples, also called outliers or anomalies. Detecting these anomalous patterns is crucial in many applications: In medical imaging, detecting anomalous patterns in X-ray and MRI scans could aid doctors diagnose patients more effectively . Detecting anomalous behaviour in credit card usage patterns help banks identify fraudulent users . Detecting anomalous objects such as guns in baggage scans can identify hazardous materials in airport screening systems 
. Anomaly detection systems are also used as a pre-processing step in many machine learning pipelines
. For instance, a system for detecting cats vs. dogs could assign high probability scores to car samples, as it is required to predict car samples as one of the two classes. An anomaly detection step can weed out such anomalous patterns before passing it to the recognition system.
Anomaly detection is a long standing problem in machine learning and computer vision[7, 33]
. The problem is typically addressed in a supervised, semi-supervised or unsupervised framework. In supervised and semi-supervised anomaly detection, access to a few (or many) labeled anomalous samples are assumed, and is typically solved as a supervised learning problem. Unsupervised anomaly detection, on the other hand, is a much harder problem than the previous ones as anomalous samples are not available in the training time. Instead, we are given an input dataset, with a goal of detecting any out-of-distribution sample that does not belong to the provided input dataset. The absence of out-of-distribution samples during training makes the unsupervised anomaly detection problem a challenging one.
Unsupervised setting in anomaly detection is an important one to address primarily for the following reasons: In many applications like detecting fraudulent credit card users, number of labeled anomalous samples can be much smaller than normal samples. Supervised classification in this case typically leads to over-fitting. Or in some applications, annotating data can be expensive e.g., medical imaging. Even in cases where a few types of anomalous samples are labeled (eg., broken arms in X-ray images), supervised models could achieve high performance in detecting the same type of anomaly. But if a new type of anomaly is presented at test-time, these models fail to generalize. By addressing anomaly detection in unsupervised regime, we focus on detection any class of out-of-distribution samples. Additionally, we do not rely on any labeled information, hence data scarcity is no longer an issue.
Use of deep generative models has received much attention recently for unsupervised anomaly detection. The core idea is to learn the input distribution using a generative model such as a GAN or an auto-encoder, and to flag a sample as anomalous if it lies far away from the generative manifold. Since estimating the distance of a sample from a generative manifold is a hard problem, proxy measures are typically used as anomaly scores. In, a linear combination of distance between images and discriminator feature representations is used as anomaly score. In  and , an encoder network is learnt with an auto-encoding objective, and the distance between encoded feature representations of the input and reconstructed image is used as anomaly score. In , distance of a test sample from a GAN manifold along with the latent likelihood and an entropy term is proven to be a estimator of sample likelihood, which is a natural candidate for anomaly score.
In this work, we propose a novel approach for unsupervised anomaly detection problem by learning a powerful autoencoder model that contains two novel components. First, we introduce Mirrored Adversarial Autoencoder, a variant of autoencoder in which we replace the
loss with an adversarial loss on the joint distribution of input and its mirrored reconstructed samples. As shown in Fig3, the autoencoder model employs a discriminator network that discriminates between (input, input) and (input, reconstruction) pairs, while the encoder-generator pair is trained using an adversarial loss derived from the discriminator network. Second, we extend the interpolation idea proposed in  and introduce a novel interpolation scheme for autoencoders, called Simplex Interpolation, in which we make the reconstructions corresponding to simplex interpolations of real latent samples look realistic. This is realized using an adversarial loss, where a discriminator is trained to predict simplex coefficients given the reconstructed images, and the autoencoder is trained to fool the discriminator (see Fig. 3). The proposed interpolation scheme yields a better-clustered latent representation.
The resulting autoencoder performs extremely well on the unsupervised anomaly detection task. On CIFAR-10 dataset, using a leave-one-out evaluation protocol, the best performing prior approach can only obtain a AUC score of around (refer Table. 1). Our approach, on the other hand, achieves a substantial performance gain of 0.23 AUC points, thus achieving a new state-of-the-art for the problem. In particular, even for harder classes like Bird, in which prior approaches consistently under-perform, our approach achieves a performance gain over . Our approach is versatile, and can be applied on non-image datasets as well. We achieve the state-of-the-art performance on three non-image datasets: KDD99, Thyroid and Arhythemia, especially obtaining an improvement of over on Thyroid dataset.
In summary, our key contributions are as follows:
We propose a novel autoencoder model, called Mirrored Adversarial Autoencoder, in which we replace the loss with an adversarial loss involving joint distribution of original image and the reconstructed one.
We propose a novel interpolation scheme, called Simplex adversarial interpolation to obtain a rich clustered and semantically meaningful latent representation in an auto-encoder.
The two schemes are used in the unsupervised anomaly detection problem, where we achieve the state-of-the-art results on CIFAR-10, KDD99, Thyroid and Arhythemia datasets.
Traditional methods for anomaly detection has been surveyed in detail in . Some techniques for unsupervised anomaly detection includes using one-class SVM  to find the classification boundary of the normal data, using clustering method  to force similarity between members from the same cluster, etc. Eskin  project data points into feature space and find anomalous points in the sparsity region of feature space. However, these methods can only be used on low dimensional data distributions, perform poorly in high dimensional settings.
Recently, there has been much interest in using deep generative models for unsupervised anomaly detection. Approaches are either based on GAN, AutoEncdoer or Variational Auto-Encoder model. Zhou  build a robust denoising auto encoder model, and detects anomalous samples using reconstruction error. Zong  and Zhai  directly learns a generative model on normal data distribution using mixture of Gaussians.
One of the first works that uses GAN model for anomaly detection is . A GAN model is trained on normal samples, and a technique for inverting images to latent space is proposed. At test time, both normal samples and abnormal ones are mapped into the latent space and the generator model reconstructs them. Anomaly score is calculated using an norm between the difference of normal samples and the reconstructions.    train GAN model simultaneously with an encoder network for mapping images back into the latent space. Zenati  propose ALAD model, which is a BiGAN network for anomaly detection. FGAN 
trains a GAN model to generate images along boundary of the normal distribution, and directly uses the discriminator score as anomaly threshold.
Interpolation is a way to enhance the structure of the latent space in an autoencoder. By forcing intermediate points along the interpolation to be indistinguishable from real data distribution, Berthelot et al.  find that the representation in latent space gets enhanced, leading to improved performance on downstream tasks such as supervised learning and clustering.
First, let us understand why interpolation can improve anomaly detection. Consider the Figure. 2
(a) - the TSNE visualization of normal and anomalous latents of a vanilla autoencoder. Even though the autoencoder is trained only on normal samples, we find that the latent space of anomaly samples is mixed up with normal samples. This results in poor anomaly detection performance. Ideally, we would like to have a loss function that separates the manifolds of normal and anomaly samples. However, the absence of anomaly samples in the training phase prohibits using such a loss term. Instead, we can perform space filling, where we force the space between normal latents to be occupied by in-distribution samples. This will produce tight clustering of normal distributions, and anomaly distributions will inevitable fall out of this cluster. Simplex interpolation is an exact realization of this space filling. By forcing reconstructions of convex-hulls of normal samples look realistic, we fill the space between the latent distributions of normal samples.
Berthelot et al.  investigates the use of adversarial loss to force the semantic consistency in image space using interpolation in latent space. First, latents corresponding to pairs of input images are generated, and a convex combination of the these latents are formed and decoded. A critic network takes this decoded image as input, and attempts to recover the coefficient of convex combination. The autoencoder is then trained so that the critic fails (assigns a coefficient ).
Let us denote the encoder and decoder network as and respectively. For two data points and , and are their latent representations. Then, the linear interpolation of these two points can be represented as: , where is constrained to be in the range . is first decoded as , which is then passed the critic network . is trained to distinguish real samples from interpolated ones by predicting 0 for non-interpolated inputs, and for interpolated samples. The loss that optimizes can be written as:
Meanwhile, autoencoder is trained to fool to give 0 for interpolations.
In this section, we introduce our simplex interpolation scheme. Our method includes a number of modifications to the Berthelot et al..’s  interpolation method. First, we train the on the joint distribution of the training images and the decoded interpolated images to force the encoder to generate semantically similar images from points close in latent space, rather than simply forcing all interpolated images to be indistinguishable from the training set as a whole, as in the Berthelot et al. formulation. Secondly, we extend line interpolation to simplex interpolation to cover more points in the latent space. This results in improved space-filling.
Two-point Simplex Interpolation. In order to estimate the distance of image generated from interpolated point to two non-interpolated images, we introduce a discriminator trained on joint distribution of real and decoded interpolated image. For a given pair of training images, an interpolated image is first generated by decoding a convex combination of their latents. is trained separately on pairs of each of the image and the interpolating point to recover the distance in latent space between the encodings of each of the two training images and the interpolated image. The formula can be formalized as:
always gives 0 to the pair of points that share the same semantics. When equals 1, should give us 0 since and it should have same semantic meaning as . On the other hand, is supposed to output 1 since and have totally different semantic meanings.
A general case. Sainburget al.  argue that pairwise interpolation between samples of do not reach all points within the latent distribution, and will not necessarily make the latent distribution convex. Simplex Interpolation can cover points that line interpolation cannot cover. However, the loss function defined in Berthelot et al.  algorithm (Eq. (1)) is tailored for -point interpolation, and replacing Eq. (1) for predicting vector instead of the scalar coefficient did not converge. Our approach (Eq.(2)), on the other hand, can be directly extended for - point simplex interpolation since it measures how far interpolant is from each vertex of the simplex. The equations can be written as:
Meanwhile, the autoencoder is trained to fool to give 0 for interpolated points, which can be written as
where n is the number of images used to interpolate ( corresponds to - point simplex interpolation). Note that in Berthelot et al.’s formulation  there is no term before the since they just consider the distance of decoded interpolated image to one of original images. However, in our algorithm, is very crucial for the following reason: If , then decoded image is closer to . Hence, the encoder-decoder loss corresponding to should receive a higher weight. Similarly, if equals 0, has no relation to , therefore there is no need to force and to generate a close to . So, we propose scaling the discriminator loss with the term.
|Fence GAN ||0.67||0.71||0.68||0.75||0.66||0.79||0.75||0.51||0.52||0.73||0.68|
For any autoencoder training, either or reconstruction loss between original image and its reconstruction has been used, which can be define as where . We propose to replace the pixel-level losses with a sementic-level reconstruction loss that is suited for the unsupervised anomaly detection.
or reconstruction losses typically result in blurry reconstructions. Moreover, using it as an anomaly score provides poor estimates as
distances do not measure the semantic similarity between images. Additionally, a highreconstruction loss between input and decoded image can be an outcome of poor reconstruction quality and not because the image is an outlier, hence it results in poor anomaly scores. Our proposal is to replace reconstruction loss with a novel adversarial loss, which is motivated by the following reasons: (1) To improve the quality of reconstructions, (2) Use of discriminator to obtain a semantically meaningful measure of anomaly score.
We use a discriminator to measure the Wasserstein distance between the joint distribution and . This approach differs from conventional Wasserstein GAN-based architectures  as joint distribution between image and reconstructed images are minimized instead of the marginal distributions. The reason for using such a discriminator is as follows: For training autoencoders, we are required to reconstruct a sample that looks similar to that of input sample. Just minimizing the Wasserstein distance between marginals of real and generated samples might result in a situation where input and generated sample both belong to the same distribution, yet semantically different. For example, a cat image in a CIFAR dataset can be reconstructed as an airplane. This will still be a feasible solution since both airplane and cat belong to the same input distribution, hence wasserstein distance will be small.
To resolve this issue, we perform Wasserstein minimization between the joint distributions and . The discriminator now takes in pairs of input images and . This clearly avoids the problems discussed in the previous section as the distribution always has pairs of samples that are similar looking. If a car image reconstructs as airplane, the generated distribution will contain (car, airplane) sample, which is never found in the input distribution . Hence, the model will always generate samples sharing the same semantics. We would like to point out that the formulation presented here is equivalent to matching conditional distributions between
. This model also shares similarities to discriminator architectures used in conditional image to image translation such as Pix2Pix.
Mathematically, our formulation can be written as:
If E and G are optimal encoder and generator networks, i.e., , then =
In addition to Wasserstein minization between joint distributions of image-reconstruction pairs, we use a latent space regularization to regularize the norm of the latent codes. We find this regularization useful in practice for obtaining good anomaly detection scores.
where d is the dimension size of your latent space representation.
The previous sections discussed two techniques for training autoencoders with improved latent representations: Simplex Interpolation and Mirrored Adversarial Autoencoders. In this section, we discuss how such autoencoder models can be used for unsupervised anomaly detection problem. The use of simplex interpolation helps obtain a compact and a clustered latent space for normal samples. As discussed in Section. 3, interpolation performs space-filling where the space between latent distributions of normal samples are made to look like normal distribution. Hence, latent codes of anomaly samples has to lie outside this distribution, which naturally gives a good separation between normal and anomaly regions in the latent space. This results in improved anomaly detection performance. Mirrored Adversarial Autoencoders, as discussed in Section. 4, learns autoencoders using an adversarial loss based on Wasserstein minimization between joint distributions of real and decoded samples. The learnt discriminator network provides a good feature representation to detect if the tuple of (input, reconstruction) sample belongs to the input distribtion. We show that this discriminator representation provides a good estimate of anomaly score.
First, we would like to point out that two discriminator models are used in out training pipeline: - discriminator used in interpolation step of simplex adversarial interpolation, and - discriminator used in reconstruction step in autoencoder training. and are updated according to Eq. (3) and Eq. (4) respectively. Encoder-decoder pair, on the other hand, has the following two objectives: (1) Autoencoder update: Minimizing the Wasserstein distance between the joint distribution of and , and (2) Interpolation update: Forcing the interpolated points to look realistic. Overall objective can be written as:
where is a scalar hyper-parameter which controls the weight of the interpolation loss. denotes the number of images used to interpolate ( corresponds to 2-point interpolation).
Let denote the response of the pen-ultimate layer of the discriminator network when the pair is used as input. This gives the feature embedding of the pair of points . We measure define anomaly score as the norm difference between the feature embeddings:
|3- Simplex (Ours)||0.9527||0.9677||0.9601|
|3- Simplex (Ours)||0.5294||0.5625||0.5455|
|3- Simplex (Ours)||0.6875||0.7021||0.6947|
|Outlier||Fence GAN ||EGBAD||Ano-GAN||GANomaly||Ours|
|AUC score||F1 score||AUC score||F1 score||AUC score||F1 score|
|Outlier Pursuit ||0.908||0.902||0.837||0.686||0.822||0.528|
|Date||ALOCC DR ||ALOCC D||DCAE||GPND||OCGAN ||Ours|
Mean One-class novelty detection on FMNIST dataset
CIFAR10. To test unsupervised anomaly detection on CIFAR-10 dataset, we used the commonly-used leave-one-out protocol  , in which a samples from one of the CIFAR-10 classes is used as anomalous samples, and all other classes are used as normal samples (training data). Since the setting is unsupervised, training data only consists of normal data and anomalous samples should not be used while training. Experiments are repeated for trials, each time using one of the CIFAR-10 classes as anomaly.
Our approach is optimized using the objective function 5.1. For the exact algorithm, please refer to the Supplementary material. In all our models, we used = 0.5. To optimize , and models, we used Adam optimizer with initial learning rate = 3e-4, and momentum = 0, = 0.999. The encoder model is optimized using Adam optimizer with initial learning rate = 3e-4 and momentum = 0.5, = 0.999. Among all models trained, we pick the best model as the one that gives least discriminator feature difference loss (
loss between the discriminator features of the input samples and the reconstructed ones) on training samples after 60 epochs. Experiments are performed using two NVIDIA GTX-2080TI GPUS.
To evaluate our models, we compute the anomaly scores for normal and anomalous samples, and measure their AUC scores. Plot of our AUC scores compared with other approaches are reported in Figure. 5. We find that our approach significantly improves the AUC scores compared to the prior approaches. On an average, we get an improvement of 0.25 AUC points, which is a significant improvement. Additionally, most of the prior approaches fail (achieve a AUC score of less than ) on hard classes like bird (In CIFAR-10, bird class is similar to airplane class). Out approach achieves a performance of , which is a phenomenal improvement in performance in such hard classes.
MNIST. We also evaluate our simplex interpolation model on MNIST dataset. We used the same leave-one-out protocol as CIFAR10 experiment, data points from a class as anomalous sample and data points from the other 9 classes as normal samples. Training data only consists of normal data and anomalous samples should not be used while training. Experimentsare repeated for10trials, each time using one of the MNIST classes as anomaly. The results on table3 shows that ourmethod can reach AUC around 0.97 in MNIST. Please refer to supplementary material for more details on model architectures and hyper-parameters.
Coil-100 and FMNIST.
GPND  and OCGAN  also evaluate their
performance on Coil-100 and FMNIST. We use the same experiment design to test our model performance on these two dataset. We take randomly n categories, where n 1, 4, 7 and randomly sample the rest of the categories for outliers. We repeat
this procedure 30 times. Result in Table 4 shows our method can compete them in Coil-100.
For FMNIST, 80% of in-class samples are used for training, 20% of in-class samples are used for testing. Negative samples are randomly selected so that they take up 50% of the test dataset. We leave one class as normal and others as anomalous samples, the final AUC score is calculated as average of 10 labels. In Table 5, our method is able to compete OCGAN  on FMNIST dataset.
Experiments on non-image dataset. Our approach is versatile, and can be applied to non-image datasets as well. We evaluate our simplex interpolation model on publicly available tabular data set KDDCup99 10%, Arrhythmia and Thyroid . In these datasets, samples of KDDCup99, of Arrhythmia, of Thyroid are labelled as anomalous. We evaluate anomaly detection performance using Precision, Recall and F1 score metrics, as done in previous approaches [30, 31, 34]. We randomly sample of the data as training set, and remove anomalous samples from these. The resulting dataset is used as our training data.
At test time, we assume that fraction of anomaly samples in each dataset is known ( in KDDCup99, in Arrhythmia, in Thyroid). This is the protocol used in [30, 31, 34]. For the test set, we compute anomaly scores for each sample, sort them by anomaly scores and assign top- of samples as anomalous, where
is the percentage of anomaly samples in each dataset. With these are assignments as predictions, we compute the evaluation metrics. Results are reported in Table.2. We observe that our approach achieves the state-of-the-art performance on all three datasets. In particular, we obtain significant gains of 0.25 performance points on Thyroid dataset. Please refer to supplementary material for more details on model architectures and hyper-parameters.
Our objective function consists of three main components, as shown in Eq. (5.1): (1) Autoencoder training loss, (2) Simplex interpolation and (3) Latent space regularization. In this experiment, we perform an ablation study of each of these components. For all experiments, we use Mirrored Autoencoder as our base architecture. Among simplex interpolation, we compare against 2-point and 3-point interpolation. In our experiments, we observe that performance saturates beyond . Results of the ablation study for CIFAR-10 is provided in Fig. 5. We make following observations: (1) Interpolation improves performance compared to not using any interpolation, and (2) Among the different interpolation techniques, Simplex interpolation outperforms Berthello interpolation, (3) 3-point simplex interpolation achieves improvements over using 2-point interpolation. These best performance is obtained by using 3-point simplex interpolation with a combination of all three terms in the objective of Eq. (5.1)
In this paper, we introduced a new method for the unsupervised anomaly detection problem based on a novel representation learning technique using deep autoencoders that contains two novel components: (1) Mirrored Adversarial Autoencoder that replaces the reconstruction loss in the autoencoder optimization objective with a novel adversarial loss to enforce semantic-level reconstruction, and (2) Simplex Interpolation that extends the interpolation idea of  to improve the structure of the latent space representation in the autoencoder. We showed that our proposed method improves the state-of-the-art by a large margin on benchmark anomaly detection datasets. We note that ideas proposed in this work can be potentially used in the semi-supervised anomaly detection problem where we have access to few anomaly samples during the training. We leave this for the future work.
This work was supported in part by NSF CAREER AWARD 1942230, a sponsorship from Capital One and IBM Faculty award.
Image-to-image translation with conditional adversarial networks. CVPR. Cited by: §4.1.
Credit card fraud detection in e-commerce: an outlier detection approach. CoRR abs/1811.02196. External Links: Cited by: §1.
Adversarially learned one-class classifier for novelty detection. CoRR abs/1802.09088. External Links: Cited by: Table 5.
Proceedings of Intelligent Engineering Systems Through Artificial Neural Networks, pp. 579–584. Cited by: §2.
Deep structured energy based models for anomaly detection. CoRR abs/1605.07717. External Links: Cited by: §2, Table 2, §6, §6.
Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In ICLR, Cited by: §2, §6, §6.