GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training

05/17/2018 ∙ by Samet Akcay, et al. ∙ 0

Anomaly detection is a classical problem in computer vision, namely the determination of the normal from the abnormal when datasets are highly biased towards one class (normal) due to the insufficient sample size of the other class (abnormal). While this can be addressed as a supervised learning problem, a significantly more challenging problem is that of detecting the unknown/unseen anomaly case that takes us instead into the space of a one-class, semi-supervised learning paradigm. We introduce such a novel anomaly detection model, by using a conditional generative adversarial network that jointly learns the generation of high-dimensional image space and the inference of latent space. Employing encoder-decoder-encoder sub-networks in the generator network enables the model to map the input image to a lower dimension vector, which is then used to reconstruct the generated output image. The use of the additional encoder network maps this generated image to its latent representation. Minimizing the distance between these images and the latent vectors during training aids in learning the data distribution for the normal samples. As a result, a larger distance metric from this learned data distribution at inference time is indicative of an outlier from that distribution - an anomaly. Experimentation over several benchmark datasets, from varying domains, shows the model efficacy and superiority over previous state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite yielding encouraging performance over various computer vision tasks, supervised approaches heavily depend on large, labeled datasets. In many of the real world problems, however, samples from the more unusual classes of interest are of insufficient sizes to be effectively modeled. Instead, the task of anomaly detection is to be able to identify such cases, by training only on samples considered to be normal and then identifying these unusual, insufficiently available samples (abnormal) that differ from the learned sample distribution of normality. For example a tangible application, that is considered here within our evaluation, is that of X-ray screening for aviation or border security — where anomalous items posing a security threat are not commonly encountered, exemplary data of such can be difficult to obtain in any quantity, and the nature of any anomaly posing a potential threat may evolve due to a range of external factors. However, within this challenging context, human security operators are still competent and adaptable anomaly detectors against new and emerging anomalous threat signatures.

As illustrated in Figure 1, a formal problem definition of the anomaly detection task is as follows: given a dataset containing a large number of normal samples for training, and relatively few abnormal examples for the test, a model is optimized over its parameters . learns the data distribution of the normal samples during training while identifying abnormal samples as outliers during testing by outputting an anomaly score , where x is a given test example. A Larger indicates possible abnormalities within the test image since learns to minimize the output score during training. is general in that it can detect unseen anomalies as being non-conforming to .

Figure 1: Overview of our anomaly detection approach within the context of an X-ray security screening problem. Our model is trained on normal samples (a), and tested on normal and abnormal samples (b). Anomalies are detected when the output of the model is greater than a certain threshold .

There is a large volume of studies proposing anomaly detection models within various application domains [2, 4, 3, 39, 23]. Besides, a considerable amount of work taxonomized the approaches within the literature [28, 29, 19, 9, 33]. In parallel to the recent advances in this field, Generative Adversarial Networks (GAN) have emerged as a leading methodology across both unsupervised and semi-supervised problems. Goodfellow et al.[16]

first proposed this approach by co-training a pair networks (generator and discriminator). The former network models high dimensional data from a latent vector to resemble the source data, while the latter distinguishes the modeled (i.e., approximated) and original data samples. Several approaches followed this work to improve the training and inference stages

[8, 17]. As reviewed in [23], adversarial training has also been adopted by recent work within anomaly detection.

Schlegl et al.[39] hypothesize that the latent vector of a GAN represents the true distribution of the data and remap to the latent vector by optimizing a pre-trained GAN based on the latent vector. The limitation is the enormous computational complexity of remapping to this latent vector space. In a follow-up study, Zenati et al.[40] train a BiGAN model [14], which maps from image space to latent space jointly, and report statistically and computationally superior results albeit on the simplistic MNIST benchmark dataset [25].

Motivated by [39, 40, 6], here we propose a generic anomaly detection architecture comprising an adversarial training framework. In a similar vein to [39], we use single color images as the input to our approach drawn only from an example set of (non-anomalous) training examples. However, in contrast, our approach does not require two-stage training and is both efficient for model training and later inference (run-time testing). As with [40]

, we also learn image and latent vector spaces jointly. Our key novelty comes from the fact that we employ adversarial autoencoder within an encoder-decoder-encoder pipeline, capturing the training data distribution within both image and latent vector space. An adversarial training architecture such as this, practically based on only

training data examples, produces superior performance over challenging benchmark problems. The main contributions of this paper are as follows:

  • semi-supervised anomaly detection — a novel adversarial autoencoder within an encoder-decoder-encoder pipeline, capturing the training data distribution within both image and latent vector space, yielding superior results to contemporary GAN-based and traditional autoencoder-based approaches.

  • efficacy — an efficient and novel approach to anomaly detection that yields both statistically and computationally better performance.

  • reproducibility — simple and effective algorithm such that the results could be reproduced via the code111The code is available on https://github.com/samet-akcay/ganomaly made publicly available.

2 Related Work

Anomaly detection has long been a question of great interest in a wide range of domains including but not limited to biomedical [39], financial [3] and security such as video surveillance [23], network systems [4] and fraud detection [2]. Besides, a considerable amount of work has been published to taxonomize the approaches in the literature [28, 29, 19, 9, 33]. The narrower scope of the review is primarily focused on reconstruction-based anomaly techniques.

The vast majority of the reconstruction-based approaches have been employed to investigate anomalies in video sequences. Sabokrou et al.[37]

investigate the use of Gaussian classifiers on top of autoencoders (global) and nearest neighbor similarity (local) feature descriptors to model non-overlapping video patches. A study by Medel and Savakis

[30]

employs convolutional long short-term memory networks for anomaly detection. Trained on normal samples only, the model predicts the future frame of possible standard example, which distinguishes the abnormality during the inference. In another study on the same task, Hasan

et al.[18]

considers a two-stage approach, using local features and fully connected autoencoder first, followed by fully convolutional autoencoder for end-to-end feature extraction and classification. Experiments yield competitive results on anomaly detection benchmarks. To determine the effects of adversarial training in anomaly detection in videos, Dimokranitou

[13] uses adversarial autoencoders, producing a comparable performance on benchmarks.

More recent attention in the literature has been focused on the provision of adversarial training. The seminal work of Ravanbakhsh et al.[35] utilizes image to image translation [21] to examine the abnormality detection problem in crowded scenes and achieves state-of-the-art on the benchmarks. The approach is to train two conditional GANs. The first generator produces optical flow from frames, while the second generates frames from optical-flow.

The generalisability of the approach mentioned above is problematic since in many cases datasets do not have temporal features. One of the most influential accounts of anomaly detection using adversarial training comes from Schlegl et al.[39]. The authors hypothesize that the latent vector of the GAN represents the distribution of the data. However, mapping to the vector space of the GAN is not straightforward. To achieve this, the authors first train a generator and discriminator using only normal images. In the next stage, they utilize the pre-trained generator and discriminator by freezing the weights and remap to the latent vector by optimizing the GAN based on the vector. During inference, the model pinpoints an anomaly by outputting a high anomaly score, reporting significant improvement over the previous work. The main limitation of this work is its computational complexity since the model employs a two-stage approach, and remapping the latent vector is extremely expensive. In a follow-up study, Zenati et al.[40] investigate the use of BiGAN [14] in an anomaly detection task, examining joint training to map from image space to latent space simultaneously, and vice-versa. Training the model via [39] yields superior results on the MNIST [25] dataset.

Overall prior work strongly supports the hypothesis that the use of autoencoders and GAN demonstrate promise in anomaly detection problems [23, 39, 40]. Motivated by the idea of GAN with inference studied in [39] and [40]

, we introduce a conditional adversarial network such that generator comprises encoder-decoder-encoder sub-networks, learning representations in both image and latent vector space jointly, and achieving state-of-the-art performance both statistically and computationally.

3 Our Approach: GANomaly

Figure 2: Pipeline of the proposed approach for anomaly detection.

To explain our approach in detail, it is essential to briefly introduce the background of GAN.

3.0.1 Generative Adversarial Networks (GAN)

are an unsupervised machine learning algorithm that was initially introduced by Goodfellow et al.

[16]. The original primary goal of the work is to generate realistic images. The idea being that two networks (generator and discriminator) compete with each other during training such that the former tries to generate an image, while the latter decides whether the generated image is a real or a fake. The generator is a decoder-alike network that learns the distribution of input data from a latent space. The primary objective here is to model high dimensional data that captures the original real data distribution. The discriminator network usually has a classical classification architecture, reading an input image, and determining its validity (i.e., real vs. fake).

GAN have been intensively investigated recently due to their future potential [12]. To address training instability issues, several empirical methodologies have been proposed [38, 7]. One well-known study that receives attention in the literature is Deep Convolutional GAN (DCGAN) by Radford and Chintala [34]

, who introduce a fully convolutional generative network by removing fully connected layers and using convolutional layers and batch-normalization

[20] throughout the network. The training performance of GAN is improved further via the use of Wasserstein loss [8, 17].

3.0.2 Adversarial Auto-Encoders (AAE)

consist of two sub-networks, namely an encoder and a decoder. This structure maps the input to latent space and remaps back to input data space, known as reconstruction. Training autoencoders with adversarial setting enable not only better reconstruction but also control over latent space. [31, 27, 12].

3.0.3 GAN with Inference

are also used within discrimination tasks by exploiting latent space variables [10]. For instance, the research by [11] suggests that networks are capable of generating a similar latent representation for related high-dimensional image data. Lipton and Tripathi [26] also investigate the idea of inverse mapping by introducing a gradient-based approach, mapping images back to the latent space. This has also been explored in [15] with a specific focus on joint training of generator and inference networks. The former network maps from latent space to high-dimensional image space, while the latter maps from image to latent space. Another study by Donahue et al.[14] suggests that with the additional use of an encoder network mapping from image space to latent space, a vanilla GAN network is capable of learning inverse mapping.

3.1 Proposed Approach

3.1.1 Problem Definition.

Our objective is to train an unsupervised network that detects anomalies using a dataset that is highly biased towards a particular class - i.e., comprising normal non-anomalous occurrences only for training. The formal definition of this problem is as follows:

We are given a large tranining dataset comprising only normal images, , and a smaller testing dataset of N normal and abnormal images, , where denotes the image label. In the practical setting, the training set is significantly larger than the test set such that .

Given the dataset, our goal is first to model to learn its manifold, then detect the abnormal samples in as outliers during the inference stage. The model learns both the normal data distribution and minimizes the output anomaly score . For a given test image , a high anomaly score of ) indicates possible anomalies within the image. The evaluation criteria for this is to threshold () the score, where indicates anomaly.

3.1.2 Ganomaly Pipeline.

Figure 2 illustrates the overview of our approach, which contains two encoders, a decoder, and discriminator networks, employed within three sub-networks.

First sub-network is a bow tie autoencoder network behaving as the generator part of the model. The generator learns the input data representation and reconstructs the input image via the use of an encoder and a decoder network, respectively. The formal principle of the sub-network is the following: The generator first reads an input image , where , and forward-passes it to its encoder network . With the use of convolutional layers followed by batch-norm and leaky activation, respectively, downscales by compressing it to a vector , where . is also known as the bottleneck features of and hypothesized to have the smallest dimension containing the best representation of . The decoder part of the generator network adopts the architecture of a DCGAN generator [34], using convolutional transpose layers, activation and batch-norm together with a tanh layer at the end. This approach upscales the vector to reconstruct the image as . Based on these, the generator network generates image via , where .

The second sub-network is the encoder network that compresses the image that is reconstructed by the network . With different parametrization, it has the same architectural details as . downscales to find its feature representation . The dimension of the vector is the same as that of for consistent comparison. This sub-network is one of the unique parts of the proposed approach. Unlike the prior autoencoder-based approaches, in which the minimization of the latent vectors is achieved via the bottleneck features, this sub-network explicitly learns to minimize the distance with its parametrization. During the test time, moreover, the anomaly detection is performed with this minimization.

The third sub-network is the discriminator network whose objective is to classify the input and the output as real or fake, respectively. This sub-network is the standard discriminator network introduced in DCGAN [34].

Having defined our overall multi-network architecture, as depicted in Figure 2, we now move on to discuss how we formulate our objective for learning.

3.2 Model Training

We hypothesize that when an abnormal image is forward-passed into the network , is not able to reconstruct the abnormalities even though manages to map the input to the latent vector . This is because the network is modeled only on normal samples during training and its parametrization is not suitable for generating abnormal samples. An output that has missed abnormalities can lead to the encoder network mapping to a vector that has also missed abnormal feature representation, causing dissimilarity between and . When there is such dissimilarity within latent vector space for an input image , the model classifies

as an anomalous image. To validate this hypothesis, we formulate our objective function by combining three loss functions, each of which optimizes individual sub-networks.

3.2.1 Adversarial Loss.

Following the current trend within the new anomaly detection approaches [39, 40], we also use feature matching loss for adversarial learning. Proposed by Salimans et al.[38], feature matching is shown to reduce the instability of GAN training. Unlike the vanilla GAN where is updated based on the output of (real/fake), here we update based on the internal representation of . Formally, let be a function that outputs an intermediate layer of the discriminator for a given input drawn from the input data distribution , feature matching computes the distance between the feature representation of the original and the generated images, respectively. Hence, our adversarial loss is defined as:

(1)

3.2.2 Contextual Loss.

The adversarial loss is adequate to fool the discriminator with generated samples. However, with only an adversarial loss, the generator is not optimized towards learning contextual information about the input data. It has been shown that penalizing the generator by measuring the distance between the input and the generated images remedies this issue [21]. Isola et al.[21] show that the use of yields less blurry results than . Hence, we also penalize by measuring the distance between the original and the generated images () using a contextual loss defined as:

(2)

3.2.3 Encoder Loss.

The two losses introduced above can enforce the generator to produce images that are not only realistic but also contextually sound. Moreover, we employ an additional encoder loss to minimize the distance between the bottleneck features of the input () and the encoded features of the generated image (). is formally defined as

(3)

In so doing, the generator learns how to encode features of the generated image for normal samples. For anomalous inputs, however, it will fail to minimize the distance between the input and the generated images in the feature space since both and networks are optimized towards normal samples only.

Overall, our objective function for the generator becomes the following:

(4)

where , and are the weighting parameters adjusting the impact of individual losses to the overall objective function.

Figure 3: Comparison of the three models. A) AnoGAN [39], B) Efficient-GAN-Anomaly [40], C) Our Approach: GANomaly

3.3 Model Testing

During the test stage, the model uses given in Eq 3 for scoring the abnormality of a given image. Hence, for a test sample , our anomaly score or is defined as

(5)

To evaluate the overall anomaly performance, we compute the anomaly score for individual test sample within the test set , which in turn yields us a set of anomaly scores . We then apply feature scaling to have the anomaly scores within the probabilistic range of .

(6)

The use of Eq 6 ultimately yields an anomaly score vector for the final evaluation of the test set .

4 Experimental Setup

To evaluate our anomaly detection framework, we use three types of dataset ranging from the simplistic benchmark of MNIST[25], the reference benchmark of CIFAR[24] and the operational context of anomaly detection within X-ray security screening[5].

4.0.1 Mnist.

To replicate the results presented in [40], we first experiment on MNIST data [25] by treating one class being an anomaly, while the rest of the classes are considered as the normal class. In total, we have ten sets of data, each of which consider individual digits as the anomaly.

4.0.2 Cifar10.

Within our use of the CIFAR dataset, we again treat one class as abnormal and the rest as normal. We then detect the outlier anomalies as instances drawn from the former class by training the model on the latter labels.

4.0.3 University Baggage Anomaly Dataset — (UBA).

This sliding window patched-based dataset comprises 230,275 image patches. Normal samples are extracted via an overlapping sliding window from a full X-ray image, constructed using single conventional X-ray imagery with associated false color materials mapping from dual-energy [36]. Abnormal classes () are of 3 sub-classes — knife (), gun () and gun component () — contain manually cropped threat objects together with sliding window patches whose intersection over union with the ground truth is greater than .

4.0.4 Full Firearm vs. Operational Benign — (FFOB).

In addition to these datasets, we also use the UK government evaluation dataset [1], comprising both expertly concealed firearm (threat) items and operational benign (non-threat) imagery from commercial X-ray security screening operations (baggage/parcels). Denoted as FFOB, this dataset comprises firearm full-weapons as full abnormal and operational benign as full normal images, respectively.

The procedure for train and test set split for the above datasets is as follows: we split the normal samples such that and of the samples are considered as part of the train and test sets, respectively. We then resize MNIST to , DBA and FFOB to , respectively.

Following Schlegl et al.[39] (AnoGAN) and Zenati et al.[40] (EGBAD), our adversarial training is also based on the standard DCGAN approach [34] for a consistent comparison. As such, we aim to show the superiority of our multi-network architecture regardless of using any tricks to improve the GAN training. In addition, we also compare our method against the traditional variational autoencoder architecture [6]

(VAE) to show the advantage of our multi-network architecture. We implement our approach in PyTorch

[32] (v0.4.0 with Python 3.6.5) by optimizing the networks using Adam [22] with an initial learning rate , and momentums , . Our model is optimized based on the weighted loss (defined in Equation 4) using the weight values , and , which were empirically chosen to yield optimum results. (Figure 5

(b)). We train the model for 15, 25, 25 epochs for MNIST, UBA and FFOB datasets, respectively. Experimentation is performed using a dual-core Intel Xeon E5-2630 v4 processor and NVIDIA GTX Titan X GPU.

5 Results

We report results based on the area under the curve (AUC) of the Receiver Operating Characteristic (ROC), true positive rate (TPR) as a function of false positive rate (FPR) for different points, each of which is a TPR-FPR value for different thresholds.

Figure 4: Results for MNIST (a) and CIFAR (b) datasets. Variations due to the use of 3 different random seeds are depicted via error bars. All but GANomaly results in (a) were obtained from [40].

Figure 4 (a) presents the results obtained on MNIST data using 3 different random seeds, where we observe the clear superiority of our approach over previous contemporary models [6, 39, 40]. For each digit chosen as anomalous, our model achieves higher AUC than EGBAD [40], AnoGAN[39] and variational autoencoder pipeline VAE [6]. Due to showing its poor performance within relatively unchallenging dataset, we do not include VAE in the rest of experiments. Figure 4 (b) shows the performance of the models trained on the CIFAR10 dataset. We see that our model achieves the best AUC performance for any of the class chosen as anomalous. The reason for getting relatively lower quantitative results within this dataset is that for a selected abnormal category, there exists a normal class that is similar to the abnormal (plane vs. bird, cat vs. dog, horse vs. deer and car vs. truck).

UBA FFOB
Method gun gun-parts knife overall full-weapon
AnoGAN [39] 0.598 0.511 0.599 0.569 0.703
EGBAD [40] 0.614 0.591 0.587 0.597 0.712
GANomaly 0.747 0.662 0.520 0.643 0.882
Table 1: AUC results for UBA and FFOB datasets

For UBA and FFOB datasets shown in Table 1, our model again outperforms other approaches excluding the case of the knife. In fact, the performance of the models for knife is comparable. Relatively lower performance of this class is its shape simplicity, causing an overfit and hence high false positives. For the overall performance, however, our approach surpasses the other models, yielding AUC of and on the UBA and FFOB datasets, respectively.

Figure 5 depicts how the choice of hyper-parameters ultimately affect the overall performance of the model. In Figure 5 (a), we see that the optimal performance is achieved when the size of the latent vector is for the MNIST dataset with an abnormal digit-2. Figure 5 (b) demonstrates the impact of tuning the loss function in Equation 4 on the overall performance. The model achieves the highest AUC when , and . We empirically observe the same tuning-pattern for the rest of datasets.

Figure 5: (a) Overall performance of the model based on varying size of the latent vector . (b) Impact of weighting the losses on the overall performance. Model is trained on MNIST dataset with an abnormal digit-2

Figure 6 provides the histogram of the anomaly scores during the inference stage (a) and t-SNE visualization of the features extracted from the last convolutional layer of the discriminator network (b). Both of the figures demonstrate a clear separation within the latent vector and feature spaces.

Figure 6: (a) Histogram of the scores for both normal and abnormal test samples. (b) t-SNE visualization of the features extracted from the last conv. layer of the discriminator

Table 2 illustrates the runtime performance of the GAN-based models. Compared to the rest of the approaches, AnoGAN [39] is computationally rather expensive since optimization of the latent vector is needed for each example. For EGBAD [40], we report similar runtime performance to that of the original paper. Our approach, on the other hand, achieves the highest runtime performance. Runtime performance of both UBA and FFOB datasets are comparable to MNIST even though their image and network size are double than that of MNIST.

Model MNIST CIFAR DBA FFOB
AnoGAN [39] 7120 7120 7110 7223
EGBAD [40] 8.92 8.71 8.88 8.87
GANomaly 2.79 2.21 2.66 2.53
Table 2: Computational performance of the approaches. (Runtime in terms of millisecond)

A set of examples in Figure 7 depict real and fake images that are respectively the input and output of our model. We expect the model to fail when generating anomalous samples. As can be seen in Figure 7(a), this is not the case for the class of 2 in the MNIST data. This stems from the fact that MNIST dataset is relatively unchallenging, and the model learns sufficient information to be able to generate samples not seen during training. Another conclusion that could be drawn is that distance in the latent vector space provides adequate details for detecting anomalies even though the model cannot distinguish abnormalities in the image space. On the contrary to the MNIST experiments, this is not the case. Figures 7 (b-c) illustrate that model is unable to produce abnormal objects.

Figure 7: Exemplar real and generated samples containing normal and abnormal objects in each dataset. The model fails to generate abnormal samples not being trained on.

Overall these results purport that our approach yields both statistically and computationally superior results than leading state-of-the-art approaches [39, 40].

6 Conclusion

We introduce a novel encoder-decoder-encoder architectural model for general anomaly detection enabled by an adversarial training framework. Experimentation across dataset benchmarks of varying complexity, and within the operational anomaly detection context of X-ray security screening, shows that the proposed method outperforms both contemporary state-of-the-art GAN-based and traditional autoencoder-based anomaly detection approaches with generalization ability to any anomaly detection task. Future work will consider employing emerging contemporary GAN optimizations [38, 17, 7], known to improve generalized adversarial training.

References