Skip-GANomaly: Skip Connected and Adversarially Trained Encoder-Decoder Anomaly Detection

01/25/2019 ∙ by Samet Akcay, et al. ∙ 0

Despite inherent ill-definition, anomaly detection is a research endeavor of great interest within machine learning and visual scene understanding alike. Most commonly, anomaly detection is considered as the detection of outliers within a given data distribution based on some measure of normality. The most significant challenge in real-world anomaly detection problems is that available data is highly imbalanced towards normality (i.e. non-anomalous) and contains a most a subset of all possible anomalous samples - hence limiting the use of well-established supervised learning methods. By contrast, we introduce an unsupervised anomaly detection model, trained only on the normal (non-anomalous, plentiful) samples in order to learn the normality distribution of the domain and hence detect abnormality based on deviation from this model. Our proposed approach employs an encoder-decoder convolutional neural network with skip connections to thoroughly capture the multi-scale distribution of the normal data distribution in high-dimensional image space. Furthermore, utilizing an adversarial training scheme for this chosen architecture provides superior reconstruction both within high-dimensional image space and a lower-dimensional latent vector space encoding. Minimizing the reconstruction error metric within both the image and hidden vector spaces during training aids the model to learn the distribution of normality as required. Higher reconstruction metrics during subsequent test and deployment are thus indicative of a deviation from this normal distribution, hence indicative of an anomaly. Experimentation over established anomaly detection benchmarks and challenging real-world datasets, within the context of X-ray security screening, shows the unique promise of such a proposed approach.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Anomaly detection is an increasingly important area within visual image understanding. Following recent trends in the field, there has been a significant increase in the availability of large datasets. However, in most cases such data resources are highly imbalanced towards examples of normality (non-anomalous), whilst lacking in examples of abnormality (anomalous) and offering only partial coverage of all possibilities can could encompass this latter class. This variation, and somewhat unknown nature, of the anomalous class mean such datasets lack the capacity and diversity to train traditional supervised detection approaches. In many application scenarios, such as the X-ray screening example illustrated in Figure 1, the availability of anomalous cases may be limited and may evolve over time due to external factors. Within such scenarios, unsupervised anomaly detection has become instrumental in modeling such data distributions, whereby the model is trained only on normal (non anomalous) samples to capture the distribution of normality, and then evaluated on both unseen normal and abnormal (anomalous) examples to find their deviation from the distribution.

Fig. 1: Sub-sample of the X-ray screening application dataset used to train the proposed approach: (a) training data contains normal samples only, while the test data (b) comprises both normal and abnormal samples.

A significant body of prior work exists within anomaly detection for visual scene understanding [1, 2, 3, 4, 5] with a wide range of application domains [6, 7, 8, 9, 10]. A common hypothesis in such anomaly detection approaches is that abnormal samples differ from normality in not only high-dimensional image space but also with lower-dimensional latent space encoding. Hence, mapping high-dimensional images to lower-dimensional latent space becomes essential. The critical issue here is that capturing the distribution of the normal samples is rather challenging. Recent developments in Generative Adversarial Networks (GAN) [11], shown to be highly capable of obtaining input data distribution, have led to a renewed interest in the anomaly detection problem. Several contemporary studies demonstrate that the use of GAN has great promise to address this anomaly detection problem since they are inherently adept at mapping high-dimensional to lower dimensional latent encoding and vice-versa with minimal information loss [9, 12, 13].

Schlegl et al.[9] trains a pre-trained GAN backwardly to map from image space to lower-dimensional latent space, hypothesizing that differences in latent space would yield anomalies. Zenati et al.[12] jointly train a two network to capture normal distribution by mapping from image space to latent space, and vice-versa. Akçay et al.[13] trains an encoder-decoder-encoder network with the adversarial scheme to capture the normal distribution within the image and latent space. Sabokrou et al.[14] also trains an adversarial network to capture the normal distribution, hypothesizing that the model would fail to generate abnormal samples, where the difference between the original and generated images would yield the abnormality. This prior work in the field [9, 12, 13, 14], empirically illustrates both the importance and promise of anomaly detection anomalies within dual image and latent space.

Here we propose a new method for anomaly detection via the adversarial training over a skip-connected encoder-decoder (convolutional neural) network architecture. Whilst adversarial training has shown the promise of GAN in this domain [13], skip-connections within such UNet style (encoder-decoder) [15] generator networks are known to enable the multi-scale capture of image space detail with sufficient capacity to generate high-quality normal image drawn from the distribution the model has learned. Similar to [9, 12, 13], the proposed approach also seeks to learn the normal distribution in both the image and latent spaces via a GAN generator-discriminator paradigm. The discriminator network not only forces the generator to learn an improved model of the distribution but also works as a feature extractor such that it learns the reconstruction of the normal distribution within a lower-dimensional latent space. Evaluation of the model on various established benchmarks [16, 17] statistically illustrate superior anomaly detection task performance over prior work [9, 12, 13]. Subsequently, the main contributions of this paper are as follow:

  • unsupervised anomaly detection — a unique unsupervised adversarial training regime, over a skip-connected encoder-decoder convolutional network architecture, yields superior reconstruction within the image and latent vector spaces.

  • efficacy — an efficient anomaly detection algorithm achieving quantitatively and qualitatively superior performance against prior state-of-the-art approaches.

  • reproducibility — a simple yet effective algorithmic approach that can be readily reproduced.

Ii Related Work

Anomaly detection is a major area of interest within the field of machine learning with various real-world applications spanning from biomedical[9] to video surveillance[10]. Recently, a considerable literature has grown up in the field, leading to a proliferation of taxonomy papers [1, 2, 3, 4, 5]. Due to the current trends, the review in the paper primarily focuses on reconstruction-based anomaly detection approaches.

One of the most influential accounts of anomaly detection using adversarial training comes from Schlegl et al.[9]. The authors hypothesize that the latent vector of the GAN represents the distribution of the data. However, mapping to the vector space of the GAN is not straightforward. To achieve this, the authors first train a generator and discriminator using only normal images. In the next stage, they utilize the pre-trained generator and discriminator by freezing the weights and remap to the latent vector by optimizing the GAN based on the vector. During inference, the model pinpoints an anomaly by outputting a high anomaly score, reporting significant improvement over the previous work. The main limitation of this work is its computational complexity since the model employs a two-stage approach, and remapping the latent vector is extremely expensive. In a follow-up study, Zenati et al.[12] investigate the use of BiGAN [18] in an anomaly detection task, examining joint training to map from image space to latent space simultaneously, and vice-versa. Training the model via [9] yields superior results on the MNIST [19] dataset. In a similar study in which image and latent vector spaces are optimized for anomaly detection, Akçay et al.[13] propose an adversarial network such that the generator comprises encoder-decoder-encoder sub-networks. The objective of the model is not only the minimize the distance between the real and fake normal images, but also minimize the distance within their latent vector representations jointly. The proposed approach achieves state-of-the-art performance both statistically and computationally.

Taken together, these studies support the notion that the use of reconstruction-based approaches shows promise within the field [10, 9, 12, 13, 14]. Motivated by the previous methods in which latent vectors are optimized [9, 12, 13]

, we propose an anomaly detection approach that utilizes adversarial autoencoders with skip connections. The proposed approach learns representations within both image and latent vector space jointly and achieves numerically superior performance.

Iii Proposed Approach

Before proceeding to explain our proposed approach, it is important to introduce the fundamental concepts.

Iii-a Background

Iii-A1 Generative Adversarial Networks (GAN)

GAN are unsupervised deep neural architectures that learn to capture any input data distribution by predicting features from an initially hidden representation. Initially proposed in


, the theory behind GAN is based on a competition of two networks within a zero-sum game framework, as initially used in game theory. The task of the first network, called Generator (

) is to capture the distribution of the input dataset for a given class label, by predicting features (or images) from a hidden representation, which is commonly a random noise vector. Hence the generator network has a decoder network architecture such that it up-samples the input arbitrary latent representation to generate high dimensional features. The task of the second network, called Discriminator (), on the other hand, is to predict the correct class (i.e., real vs. fake) based on the given features (or images). The discriminator network usually adopts encoder network architecture such that for a given high dimensional feature, it predicts its class label. With optimization based on a zero-sum game framework, each network strengthens its prediction capability until they reach an equilibrium.

Due to their inherent potential for capturing data distributions, there is a growing body of literature that recognizes the importance of GAN [20]. Training two networks jointly to reach an equilibrium, however, is not a straightforward procedure, causing training instability issues. Recently, there has been a surge of interest in addressing the instability issues via several empirical methodologies [21, 22]. An innovative and seminal work of Radford and Chintala [23]

pioneered a new approach to stabilize GAN training by using fully convolutional layers and batch normalization

[24] throughout the network. Another well-known attempt to stabilize GAN training is the use of Wasserstein loss in the training objective, which significantly improves the training issues [25, 26].

Iii-A2 Adversarial Auto-Encoders (AAE)

Conceptually similar to GAN, AAE consist of a generator and a discriminator networks. The generator has a bow-tie architectural network style comprising both an encoder and a decoder. The task of the generator is to reconstruct an input data by down-sampling it into a latent representation first, and then by upsampling the latent vector into the reconstructed data (image). The task of the discriminator network is to predict whether the input is a latent vector from the auto-encoder or the prior distribution initialized arbitrarily. Training AAE provides superior reconstruction as well as the capability of controlling the latent space [27, 28, 20].

Iii-A3 Inference within GAN

A strong correlation has been demonstrated between the manipulation of the input noise vector and the output of the generator network [23, 29]. Similar latent space variables have demonstrably produced visually similar high-dimensional images [30]. One approach to finding the optimal latent vectors to create similar images is to inversely map images back to their hidden space via their gradients [31]. Alternatively, with an additional encoder network that down-samples high dimensional images into lower dimensional latent space, vanilla GAN are reported to be capable of learning inverse mapping [18]. Another way to learn inference via inverse mapping is to jointly train two networks such that the former maps images to latent space, while the latter maps this latent space representation back into higher dimensional image space [32]. Based on these previous findings, the primary aim of this paper is to explore inference within GAN by exploiting the latent vector representation in order to find unique a representation for a normal (non anomalous) data distribution such that it can be statistically differentiated from unseen, unknown and varying abnormal (anomalous) data samples.

Iii-B Proposed Approach

Iii-B1 Problem Definition

This work proposes an unsupervised approach for anomaly detection.

We adversarially train our proposed convolutional network architecture in an unsupervised manner such that the conceptual model is trained on normal samples only, and yet tested on both normal and abnormal ones. Mathematically, we define and formulate our problem as the following:

An input dataset is split into train and test sets such that contains normal samples, where denotes normal class. The test set comprises normal and abnormal samples, where for normal and abnormal classes, respectively. In practical setting, .

Fig. 2: Overview of the proposed adversarial training procedure.

Based on the dataset defined above, we are to train our model on and evaluate its performance on . The training objective () of the model is to capture the distribution of within not only image space but also hidden latent vector space. Capturing the distribution within both dimensions by minimizing enable the network to learn higher and lower level features that are unique to normal images. We hypothesize that defining an anomaly score based on the training objective would yield minimum anomaly scores for training samples —normal samples, but higher scores for abnormal images. Hence a higher anomaly score for a given sample would indicate whether is any abnormal with respect to the distribution of normal data learned by from during training.

Iii-B2 Pipeline

Fig. 3: Details of the proposed network architecture.

Figure 2 shows a high-level overview of the proposed approach, which comprises a generator () and a discriminator () networks, respectively. The network adopts a bow-tie network using an encoder () and a decoder () networks. The encoder network captures the distribution of the input data by mapping high-dimensional image () into lower-dimensional latent representation () such that , where and . As illustrated in Figure 3, the network reads input

through five blocks containing Convolutional and BatchNorm layers as well as LeakyReLU activation function and outputs the latent representation

, which is also known as the bottleneck features that carries a unique representation of the input.

Being symmetrical to , the decoder network up-samples the latent vector back to the input image dimension and reconstructs the output, denoted as . Motivated by [15], the decoder adopts skip-connection approach such that each down-sampling layer in the encoder network is concatenated to its corresponding up-sampling decoder layer (Figure 3). This use of skip connections provides substantial advantages via direct information transfer between the layers, preserving both local and global (multi-scale) information, and hence yielding better reconstruction.

The second network within the pipeline, shown in Figure 3 (b), called discriminator (

), predicts the class label of the given input. In this context, its task is to classify real images (

) from the fake ones (), generated by the network . The network architecture of the discriminator follows the same structure as the discriminator of the DCGAN approach presented in [23]. Besides being a classifier, the network is also used as a feature extractor such that latent representations of the input image and the reconstructed image are computed. Extracting the features from the discriminator to perform inference within the latent space is the novel part of the proposed approach compared to the previous approaches [9, 12, 13].

Based on this multi-network architecture, explained above and shown in Figure 3, the next section describes the proposed training objective and inference scheme.

Iii-C Training Objective

As explained in Section III-B1, the idea proposed in this work is to train the model only on normal samples, and test on both normal and abnormal ones. The motivation is that we expect the model to be able to correctly reconstruct the normal samples either in image or latent vector space. The hypothesis is that the network is conversely expected to fail to reconstruct the abnormal samples as it is never trained on such abnormal examples. Hence, for abnormal samples, one would expect a higher loss for the reconstruction of the output image representation or the latent representation . To validate this, we propose to combine three loss values (Adversarial, Contextual, Latent), each of which has its own contribution to make within the overall training objective.

Iii-C1 Adversarial Loss

In order to maximize the reconstruction capability for the normal images during training, we utilize the adversarial loss proposed in [11]. This loss, shown in Equation 1, ensures that the network reconstructs a normal image to as realistically as possible, while the discriminator network classifies the real and the (fake) generated samples. The task here is to minimize this objective for , and maximize for to achieve , where is denoted as


Iii-C2 Contextual Loss

The adversarial loss defined in Section III-C1 impose the model to generate realistic samples, but does not guarantee to learn contextual information regarding the input. To explicitly learn this contextual information to sufficiently capture the input data distribution for the normal samples, we apply normalization to the input and the reconstructed output . This normalization ensures that the model is capable of generating contextually similar images to normal samples. The contextual loss of the training objective is shown below:


Iii-C3 Latent Loss

With the adversarial and contextual losses defined above, the model is able to generate realistic and contextually similar images. In addition to these objectives, we aim to reconstruct latent representations for the input and the generated normal samples as similar as possible. This is to ensure that the network is capable of producing contextually sound latent representations for common examples. As depicted in Figure 3(b), we use the final convolutional layer of the discriminator , and extract the features of and to reconstruct their latent representations such that and . The latent representation loss therefore becomes:


Finally, total training objective becomes a weighted sum of the losses above.


where , and are the weighting parameters adjusting the dominance of the individual losses to the overall objective function.

Iii-D Inference

To find the anomalies during the testing and subsequent deployment, we adopt the anomaly score, proposed in [9] and also employed in [12]. For a given test image , its anomaly score becomes:


where is the reconstruction score measuring the contextual similarity between the input and the generated images based on Equation 2. denotes the latent representation score measuring the difference between the input and generated images based on Equation 3. is the weighting parameter controlling the relative importance of the score functions.

Based on Equation 5, we then compute the anomaly scores for each individual test sample in the test set , and denote as anomaly score vector such that . Finally, following the same procedure proposed in [13], we also apply feature scaling to to scale the anomaly scores within the probabilistic range of . Hence, the updated anomaly score for an individual test sample becomes:


Equation 6 finally yields an anomaly score vector for the final evaluation of the test set , which is explained in Sections IV-C and V.

Iv Experimental Setup

This section introduces the datasets, training and implementational details as well as the evaluation criteria used within the experimentation.

Model bird car cat deer dog frog horse plane ship truck
AnoGAN [9] 0.411 0.492 0.399 0.335 0.393 0.321 0.399 0.516 0.567 0.511
EGBAD [12] 0.383 0.514 0.448 0.374 0.481 0.353 0.526 0.577 0.413 0.555
GANomaly [13] 0.510 0.631 0.587 0.593 0.628 0.683 0.605 0.633 0.616 0.617
Proposed 0.448 0.953 0.607 0.602 0.615 0.931 0.788 0.797 0.659 0.907
TABLE I: AUC results for CIFAR-10 dataset

Iv-a Datasets

To demonstrate the proof of concept of the proposed approach, we validate the model on four different datasets, each of which is explained in the following subsections.

We perform our evaluation using the benchmark CIFAR-10 dataset [16] and also the UBA and FFOB datasets [13]. Using CIFAR-10 we formulate a leave one class out anomaly detection problem. For the application context of X-ray baggage screening [33], the UBA and FFOB datasets from [13] are used to formulate an anomaly detection problem based on the concept of weapon threat items being an anomaly within the security screening process.

Iv-A1 Cifar-10

Experiments for the CIFAR-10 dataset has the one versus the rest approach. Following this procedure yields ten different anomaly cases for CIFAR-10, each of which has normal training samples, and : normal-abnormal test samples.

Iv-A2 University Baggage Dataset —UBA

This in-house dataset comprises 230,275 dual energy X-ray security image patches extracted via a overlapping sliding window approach. The dataset contains 3 abnormal sub-classes —knife (63,496), gun (45,855) and gun component (13,452). Normal class comprises 107,472 benign X-ray patches, splitted via 80:20 train-test ratio.

Iv-A3 Full Firearm vs Operational Benign —FFOB

As presented in [13], we also evaluate the performance of the model on the UK government evaluation dataset [17], comprising both expertly concealed firearm (threat) items and operational benign (non-threat) imagery from commercial X-ray security screening operations (baggage/parcels). Denoted as FFOB, this dataset comprises 4,680 firearm full-weapons as full abnormal and 67,672 operational benign as full normal images, respectively.

Iv-B Training Details

The training objective from Equation 4 is optimized via Adam[34] optimizer with an initial learning rate with a lambda decay, and momentums , . The weighting parameters of is chosen as , and , empirically shown to yield the optimal performance (See Figure 9

). The model is initially set to be trained for 15 epochs; however, in most cases it learns sufficient information within less training cycles. Therefore, we save the parameters of the network when the performance of the model starts to decrease since this reduce is a strong indication of over-fitting. The model is implemented using PyTorch

[35] (v0.5.1, Python 3.7.1, CUDA 9.3 and CUDNN 7.1). Experiments are performed using an NVIDIA Titan X GPU.

Iv-C Evaluation

The performance of the model is evaluated by the area under the curve (AUC) of the receiver operating characteristics (ROC) [36], a function plotted by the true positive rates (TPR) and false positive rates (FPR) with varying threshold values (as per prior work in the field [9, 12, 13]

V Results

For the CIFAR-10 dataset, Table I / Figure 4 demonstrate with the exception of abnormal classes bird and dog, the proposed model yield superior results to the prior work.

Fig. 4: AUC results for CIFAR-10 dataset. Shaded areas in the plot represents variations due to the use of 3 random seeds.

Table II presents the experimental results for UBA and FFOB datasets. It is apparent from this table that the proposed method significantly outperforms the prior work in each anomaly cases of the datasets. Of significance, the best AUC of the prior work is for the most challenging abnormality case – knife, while the method proposed here achieves AUC of .

Method gun gun-parts knife overall full-weapon
AnoGAN [9] 0.598 0.511 0.599 0.569 0.703
EGBAD [12] 0.614 0.591 0.587 0.597 0.712
GANomaly [13] 0.747 0.662 0.520 0.643 0.882
Proposed 0.972 0.945 0.904 0.940 0.903
TABLE II: AUC results for UBA and FFOB datasets

Figure 7 depicts exemplar test images for the datasets used in the experimentation. A significant result emerging from the examples presented within Figure 7

is that the proposed model is capable of generating both normal and abnormal reconstructed outputs at test time, meaning that it captures the distribution of both domains. This is probably due to the use of skip connections enabling reconstruction even for the abnormal test samples.

The qualitative results of Figure 7, supporting by the quantitative results of Table II reveal that abnormality detection is successfully made in latent object space of the model that emerges from our adversarial training over the proposed skip-connected architecture.

Fig. 5: (a) Histogram of the normal and abnormal scores for the test data.
Fig. 6:

(b) t-SNE plot of the 1000 subsampled normal and abnormal features extracted from the last convolutional layer (

) of the discriminator (Figure 3).

Figures 5 and 6 show the histogram plot (a) of the normal and abnormal scores for the test data, and the t-SNE plot (b) of the normal and abnormal features extracted from the last convolutional layer () of the discriminator (see Figure 3). Closer inspection of the figures reveals that the model yields promising separation within both the output anomaly (reconstruction) score and the preceding convolutional feature spaces.

Overall, these results indicate that the proposed approach yields superior anomaly detection performance to the previous state-of-the-art approaches.

Fig. 7: Exemplar test images for CIFAR-10, UBA and FFOB datasets when the abnormalities are car, gun-gun component-knife and gun, respectively. Despite the model’s capability of generating even abnormal samples, the proposed model is able to detect abnormality within latent object space.
Fig. 8: Hyper-parameter tuning for the model. The model achieves the most optimum performance when .
Fig. 9: Hyper-parameter tuning for the model. The model achieves the most optimum performance when , and .

Vi Conclusion

This paper introduces a novel unsupervised anomaly detection architecture within an adversarial training scheme. The proposed approach examines the role of skip connections within the generator and feature extraction from the discriminator for the manipulation of hidden features. Based on an evaluation across multiple datasets from different domains and complexity, the findings indicate that skip connections provide more stable training, and the inference learning from the discriminator achieves numerically superior results than the previous state-of-the-art methods. The empirical findings in this study provide an insight into the generalization capability of the proposed method to any anomaly detection task. Further research could also be conducted to determine the effectiveness of the proposed approach on both higher resolution images and various other anomaly detection tasks containing temporal information.