Inverse-Transform AutoEncoder for Anomaly Detection

11/25/2019 ∙ by Chaoqing Huang, et al. ∙ Shanghai Jiao Tong University 46

Reconstruction-based methods have recently shown great promise for anomaly detection. We here propose a new transform-based framework for anomaly detection. A selected set of transformations based on human priors is used to erase certain targeted information from input data. An inverse-transform autoencoder is trained with the normal data only to embed corresponding erased information during the restoration of the original data. The normal and anomalous data are thus expected to be differentiable based on restoration errors. Extensive experiments have demonstrated that the proposed method significantly outperforms several state-of-the-arts on multiple benchmark datasets, especially on ImageNet, increasing the AUROC of the top-performing baseline by 10.1 detection dataset MVTec AD and a video anomaly detection dataset ShanghaiTech to validate the effectiveness of the method in real-world environments.



There are no comments yet.


page 8

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection, with broad application in medical diagnosis, network intrusion detection, credit card fraud detection, sensor network fault detection and numerous other fields [8]

, has recently received significant attention among machine learning community. Considering the scarcity and diversity of anomalous data, anomaly detection is usually modeled as an unsupervised learning problem or one-class classification problem 

[31], i.e. the training dataset contains only “normal” data and the anomalous data is not available during training.

Figure 1: A selected set of transformations based on human priors is used to erase certain targeted information from the original images and ITAE needs to restore the original images with information erased inputs, which forces ITAE to embed certain targeted information.

With the recent advances in deep neural networks, reconstruction-based methods 

[35, 1, 33] have shown great promise for anomaly detection. Autoencoder [26] is adopted by most reconstruction-based methods which assume that normal samples and anomalous samples could lead to significantly different embedding and thus the corresponding reconstruction errors can be leveraged to differentiate the two types of samples [34]. However, this assumption may not always hold. Reconstruction-based methods could fail when the autoencoder captures mostly the shared patterns between the normal and anomalous data and “generalizes” so well [12, 42], which results in good reconstruction of both normal and anomalous data. To mitigate this drawback, Perera et al. [27] proposed to explicitly constrain the latent space to exclusively represent the given normal class but not the anomalous class, which enlarges the gap of reconstruction errors between normal and anomalous data, suggesting that it is important to control what information is embedded in the autoencoder.

As shown in Figure 1, we here propose a transform-based method for anomaly detection, which leverages transformation as a way to control information embedded in the autoencoder and further enlarges the gap of reconstruction errors between normal and anomalous data. Different from  [27], which controls the embedding by constraining the latent space, we apply a selected set of transformations (e.g. graying and random rotation) to erase certain targeted information from images, and further leverage the reconstruction capability of the autoencoder to embed the targeted information during the restoration of the original images. The inverse-transform autoencoders (named ITAE thereafter) are trained with normal samples only and thus embed only key information for the normal class.

For the success of ITAE for anomaly detection, the transformations need to satisfy a vital criteria, i.e., the information erased by transformations must be able to distinguish between normal and anomalous data; otherwise, the indistinguishable erased information will lead to similar information embedding, resulting in similar restoration errors for both normal and anomalous data. Due to the anomaly detection is unsupervised in nature, it would be difficult to know which transformation would work ahead of time. We thus propose to simultaneously adopt a simple set of universal transformations based on human prior. For example, for the task of handwriting digit recognition, orientation is important for the normal class “6” based on human prior. Applying a random rotation transformation to the numbers can force the ITAE to embed orientation information during restoration. When an anomalous sample (such as the number “9”) is fed to the ITAE, it is likely to be restored to “6” as the model cannot differentiate whether it is a “9” or a rotated “6”, hence leading to large restoration errors.

To validate the effectiveness of ITAE, we conduct extensive experiments with several benchmarks and compare them with state-of-the-art methods. Our experimental results have shown that ITAE outperforms state-of-the-art methods in terms of model accuracy and model stability for different tasks. To further evaluate with more challenging tasks, we experiment with the large-scale dataset ImageNet [32] and show that ITAE improves the AUROC of the top-performing baseline by 10.1%. Experiments on a real-world anomaly detection dataset MVTec AD [4] and a most recent video anomaly detection benchmark dataset ShanghaiTech [23] show that our ITAE is more adaptable to complex real-world environments.

2 Related Works

For anomaly detection on images or videos, a large variety of methods have been developed in recent years [7, 24, 25, 8, 28, 19]. Popular methods focusing on anomaly detection in still images, which we study in this paper, can be concluded into three types: Statistics-based, reconstruction-based and classification-based approaches.

Statistics-based approaches: To distinguish the normal data from anomalous data, some previous conventional methods [10, 39, 29, 38]

tended to depict the normal data with statistical approaches. Through training, a distribution function was forced to fit on the features extracted from the normal data to represent them in a shared latent space. During testing, samples mapped to different statistical representations are considered as anomalous.

Reconstruction-based approaches: Many works considered anomaly detection task through reconstruction [3, 36, 35, 42, 9]. Based on autoencoders, these methods compressed normal samples into a lower-dimensional representation space and then reconstructed higher-dimensional outputs. The normal and anomalous samples were distinguished through some reconstruction errors. Sabokrou et al. [33] and Akcay et al. [2] employed adversarial training to optimize the autoencoder and leveraged its discriminator to further enlarge the reconstruction error gap between normal and anomalous data. Furthermore, Akcay et al. [2] leveraged another encoder to embed the reconstruction results to the subspace where to calculate the reconstruction error. Gong et al. [12] augmented the autoencoder with a memory module and developed an improved autoencoder called memory-augmented autoencoder to strengthen reconstructed errors on anomalies. [27]

applied two adversarial discriminators and a classifier on a denoising autoencoder. By adding constraint and forcing each randomly drawn latent code to reconstruct examples like the normal data, it obtained high reconstruction errors for the anomalous data.

Classification-based approaches: Some approaches tackled the anomaly detection problem through classification. Hendrycks [15] observed that anomaly detection can still benefit from introducing extra data under a classification framework, even though the extra data is in limited quantities and weakly correlated to the normal data. Lee et al. [22] used Kullback-Leibler (KL) divergence to guide GAN to generate anomalous data more closed to the normal data, leading to a better training set for classification method. Golan et al. [11] applied dozens of image geometric transforms and created a self-labeled dataset for transformation classification. The model tends to capture the patterns of different transformations. Different from [11], the transformations utilized in our work are to erase information from input data.

3 Inverse-Transform AutoEncoder

3.1 Problem Statement

In this section, we first formulate the problem of anomaly detection on images. Let , , and denote the sets of entire dataset, normal dataset and anomalous dataset, respectively, where and . Given any image , where , and , and denote the dimensions of image channels, height and width, the goal is to build a model for discriminating whether or

. We formulate this process in a view of probability: (1) draw an image from the dataset

; (2) test if this sample comes from the distribution of normal data or anomalous data, i.e. and . Thus, our model is to parameterize the posterior and test if the given sample belongs to the marginal distribution simultaneously.

Figure 2: Pipeline for anomaly detection with Inverse-Transform AutoEncoder with mathematical expression.

3.2 Model Architectures

In this section, we present the Inverse-Transform AutoEncoder (ITAE) in detail. ITAE is based on an encoder-decoder framework to restore the samples from some information-erasing transformations. The restoration errors are used for anomaly detection.

To achieve effective anomaly detection, we design some efficient transformations based on human priors to erase the crucial and distinctive information of input samples. Commonly, the transformation satisfies two conditions:

  • The transformations need to erase information

    ; otherwise the transformations are linear invertible and ITAE degenerates to the linear reverse transformations. The transformation and ITAE are multiplied as an identity matrix, causing low restoration error for both normal and anomalous samples.

  • The transformations should erase more distinctive information, which is the key to differentiate normal and anomalous data; otherwise the model tends to restore more common information, leading to similar restoration errors for any data.

The proposed ITAE forces the autoencoder to embed the erased information in the model and restore the normal samples. Suppose we have a set of transformations , where denotes the th transformation and is the operation number. In the training phase, given sampled from and any transformation , let the transforms be . The proposed ITAE takes the transformed samples as the inputs, and attempts to inversely restore the original training samples . Mathematically, given , the restored sample be is formulated as


where and indicate encoder and decoder of ITAE. By minimizing the likelihood-based restoration loss, ITAE is forced to capture the specific pattern of inverse transformation to obtain the original normal training data. The corresponding structure is shown in Figure 2. Note that while ITAE is employed for the inverse-transform, it is different from existing autoencoder-based anomaly detection methods in that the inputs and outputs are asymmetrical, i.e. ITAE needs to restore information erased by the transformations.

As for testing phases, both normal and anomalous data are fed into the model. We design a metric based on the restoration error to distinguish whether one sample belongs to the normal set. We suppose that the restorations of normal samples show much smaller errors than the anomalous samples due to the specific inverse transforming scheme.

3.3 Training Loss and Error Measurement

To train our ITAE for effective anomaly detection, loss is utilized to measure the distances between the restored samples and targets since it is smoother and distributes more punishments on the dimensions with larger generation errors. Let the target image be , the training loss is formulated as


where denotes the norm. We use Monte Carlo to approximate the expectation operation by averaging the costs on samples and transformations in each mini-batch.

In the test phases, we calculate the restoration error of each input image for anomaly detection. We note that loss is more suitable to measure the distance between outputs and original images. For each in the transformation set , we first calculate the expectation based restoration error of training normal data; then we use this error to normalize the test restoration error corresponding to ; we finally calculate the expectation of normalized errors across different transformations. Let the test sample be , the restoration error is formulated as


where indicates the distribution of normal data, as well as being consistent with the distribution of training set; denotes the norm. We also approximate the expectation with averaging operation. A normal sample leads to a low restoration error; the higher obtained, the higher probability for the anomalous sample.

4 Experiments

In this section, we conduct substantial experiments to validate our method. The ITAE is first evaluated on multiple commonly used benchmark datasets under unsupervised anomaly detection settings, and the large-scale dataset ImageNet [32]

, which is rarely looked into in previous anomaly detection studies. Next, we conduct experiments on real anomaly detection datasets to evaluate the performance in real-world environments. Then we present the respective effects of different designs (e.g. different types of image-level transformation and loss function design) through ablation study. Finally, the stability of our models is validated through monitoring performance fluctuation during the training process and comparing the final performance after convergence in multiple training attempts, all from random weights and with the same training configuration.

Dataset Method 0 1 2 3 4 5 6 7 8 9 avg SD VAE [18] 92.1 99.9 81.5 81.4 87.9 81.1 94.3 88.6 78.0 92.0 87.7 7.05 AnoGAN [35] 99.0 99.8 88.8 91.3 94.4 91.2 92.5 96.4 88.3 95.8 93.7 4.00 ADGAN [9] 99.5 99.9 93.6 92.1 94.9 93.6 96.7 96.8 85.4 95.7 94.7 4.15 MNIST GANomaly [1] 97.2 99.6 85.1 90.6 94.9 94.9 97.1 93.9 79.7 95.4 92.8 6.12 OCGAN [27] 99.8 99.9 94.2 96.3 97.5 98.0 99.1 98.1 93.9 98.1 97.5 2.10 GeoTrans [11] 98.2 91.6 99.4 99.0 99.1 99.6 99.9 96.3 97.2 99.2 98.0 2.50 AE 98.8 99.3 91.7 88.5 86.2 85.8 95.4 94.0 82.3 96.5 91.9 5.90 OURS 98.6 99.9 99.0 99.1 98.1 98.1 99.7 99.0 93.6 97.8 98.3 1.78 DAGMM [40] 42.1 55.1 50.4 57.0 26.9 70.5 48.3 83.5 49.9 34.0 51.8 16.47 DSEBM [42] 91.6 71.8 88.3 87.3 85.2 87.1 73.4 98.1 86.0 97.1 86.6 8.61 Fashion- ADGAN [9] 89.9 81.9 87.6 91.2 86.5 89.6 74.3 97.2 89.0 97.1 88.4 6.75 MNIST GANomaly [1] 80.3 83.0 75.9 87.2 71.4 92.7 81.0 88.3 69.3 80.3 80.9 7.37 GeoTrans [11] 99.4 97.6 91.1 89.9 92.1 93.4 83.3 98.9 90.8 99.2 93.5 5.22 AE 71.6 96.9 72.9 78.5 82.9 93.1 66.7 95.4 70.0 80.7 80.9 11.03 OURS 92.7 99.3 89.1 93.6 90.8 93.1 85.0 98.4 97.8 98.4 93.9 4.70 VAE [18] 62.0 66.4 38.2 58.6 38.6 58.6 56.5 62.2 66.3 73.7 58.1 11.50 DAGMM [40] 41.4 57.1 53.8 51.2 52.2 49.3 64.9 55.3 51.9 54.2 53.1 5.95 DSEBM [42] 56.0 48.3 61.9 50.1 73.3 60.5 68.4 53.3 73.9 63.6 60.9 9.10 CIFAR- AnoGAN [35] 61.0 56.5 64.8 52.8 67.0 59.2 62.5 57.6 72.3 58.2 61.2 5.68 10 ADGAN [9] 63.2 52.9 58.0 60.6 60.7 65.9 61.1 63.0 74.4 64.4 62.4 5.56 GANomaly [1] 93.5 60.8 59.1 58.2 72.4 62.2 88.6 56.0 76.0 68.1 69.5 13.08 OCGAN [27] 75.7 53.1 64.0 62.0 72.3 62.0 72.3 57.5 82.0 55.4 65.6 9.52 GeoTrans [11] 74.7 95.7 78.1 72.4 87.8 87.8 83.4 95.5 93.3 91.3 86.0 8.52 AE 57.1 54.9 59.9 62.3 63.9 57.0 68.1 53.8 64.4 48.6 59.3 5.84 OURS 78.5 89.8 86.1 77.4 90.5 84.5 89.2 92.9 92.0 85.5 86.6 5.35 GANomaly [1] 58.9 57.5 55.7 57.9 47.9 61.2 56.8 58.2 49.7 48.8 55.3 4.46 ImageNet GeoTrans [11] 72.9 61.0 66.8 82.0 56.7 70.1 68.5 77.2 62.8 83.6 70.1 8.43 AE 57.1 51.3 47.7 57.4 43.8 54.9 54.6 51.3 48.3 41.5 50.8 5.16 OURS 71.9 85.8 70.7 78.8 69.5 83.3 80.6 72.4 74.9 84.3 77.2 5.77 Dataset Method 0 1 2 3 4 5 6 7 8 9 10 DAGMM [40] 43.4 49.5 66.1 52.6 56.9 52.4 55.0 52.8 53.2 42.5 52.7 DSEBM [42] 64.0 47.9 53.7 48.4 59.7 46.6 51.7 54.8 66.7 71.2 78.3 ADGAN [9] 63.1 54.9 41.3 50.0 40.6 42.8 51.1 55.4 59.2 62.7 79.8 GANomaly [1] 57.9 51.9 36.0 46.5 46.6 42.9 53.7 59.4 63.7 68.0 75.6 GeoTrans [11] 74.7 68.5 74.0 81.0 78.4 59.1 81.8 65.0 85.5 90.6 87.6 AE 66.7 55.4 41.4 49.2 44.9 40.6 50.2 48.1 66.1 63.0 52.7 CIFAR- OURS 77.5 70.0 62.4 76.2 77.7 64.0 86.9 65.6 82.7 90.2 85.9 100 Method 11 12 13 14 15 16 17 18 19 avg SD DAGMM [40] 46.4 42.7 45.4 57.2 48.8 54.4 36.4 52.4 50.3 50.5 6.55 DSEBM [42] 62.7 66.8 52.6 44.0 56.8 63.1 73.0 57.7 55.5 58.8 9.36 ADGAN [9] 53.7 58.9 57.4 39.4 55.6 63.3 66.7 44.3 53.0 54.7 10.08 GANomaly [1] 57.6 58.7 59.9 43.9 59.9 64.4 71.8 54.9 56.8 56.5 9.94 GeoTrans [11] 83.9 83.2 58.0 92.1 68.3 73.5 93.8 90.7 85.0 78.7 10.76 AE 62.1 59.6 49.8 48.1 56.4 57.6 47.2 47.1 41.5 52.4 8.11 OURS 83.5 84.6 67.6 84.2 74.1 80.3 91.0 85.3 85.4 78.8 8.82
Table 1:

Average area under the ROC curve (AUROC) in % of anomaly detection methods. For every dataset, each model is trained on the single class, and tested against all other classes. “SD” means standard deviation among classes. The best performing method is in bold.

4.1 Experiments on Popular Benchmarks

4.1.1 Experimental Setups

Datasets. Our experiments involve five popular image datasets: MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100 and ImageNet. For all datasets, the training and test partitions remain as default. In addition, pixel values of all images are normalized to . We introduce these five datasets briefly as follows:

  • MNIST [21]: consists of 70,000 handwritten grayscale digit images.

  • Fashion-MNIST [37]: a relatively new dataset comprising grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category.

  • CIFAR-10 [20]: consists of 60,000 RGB images of 10 classes, with 6,000 images for per class. There are 50,000 training images and 10,000 test images, divided in a uniform proportion across all classes.

  • CIFAR-100 [20]: consists of 100 classes, each of which contains 600 RGB images. The 100 classes in the CIFAR-100 are grouped into 20 “superclasses” to make the experiment more concise and data volume of each selected “normal class” larger.

  • ImageNet [32]: We group the data from the ILSVRC 2012 classification dataset [32] into 10 superclasses by merging similar category labels using Latent Dirichlet Allocation (LDA) [5]

    , a natural language processing method (see the appendix for more details). We note that few anomaly detection research has been conducted on ImageNet since its images have higher resolution and more complex background.

Model configuration. The detailed structure of the autoencoders used as our baseline model can be found in the appendix. We follow the settings in [30, 2, 17]

and add skip-connections between some layers in encoder and corresponding decoder layers to facilitate the backpropagation of the gradient in an attempt to improve the performance of image restoration. We use stochastic gradient descent (SGD) 


optimizer with default hyperparameters in Pytorch. ITAE is trained using a batch size of 32 for

epochs, where means the number of transformations we used. The learning rate is initially set to 0.1, and is divided by 2 every epoch. In our experiments, we use a transformation which contains two tandem operations:

  • Graying: This operation averages each pixel value along the channel dimension of images.

  • Random rotation: This operation rotates anticlockwise by angle around the center of each image channel. The rotation angle is randomly selected from a set .

The graying operation erases channel information, and the random rotation operation erases objects’ orientation. Both of them meet the conditions we introduced in Section 3.2. For example, random rotation is not linear invertible (first condition) because of the randomness. Meanwhile, it removes the orientation of objects, which may be an important characteristic of an object (second condition).

Evaluation protocols. In our experiments, we quantify the model performance using the area under the Receiver Operating Characteristic (ROC) curve metric (AUROC). It is commonly adopted as performance measurement in anomaly detection tasks and eliminates the subjective decision of threshold value to divide the “normal” samples from the anomalous ones.

4.1.2 Comparison with State-of-the-art Methods

For a dataset with classes, we conduct a batch of experiments respectively with each of the classes set as the “normal” class once. We then evaluate performance on an independent test set, which contains samples from all classes, including normal and anomalous data. As all classes have equal volumes of samples in our selected datasets, the overall number proportion of normal and anomalous samples is simply .

In Table 1, we provide results on MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 in detail. Some popular methods are involved in

Figure 3: Comparison of frames per second (FPS) (horizontal coordinates), GPU memory usages (circular sizes) and AUROC for anomaly detection (vertical coordinates) of various methods testing on CIFAR-10. ITAE takes up a relatively small GPU memory, and its FPS is relatively higher.

comparison: VAE [18], DAGMM [40], DSEBM [42], AnoGAN [35], ADGAN [9], GANomaly [1], OCGAN [27], GeoTrans [11] and our baseline backbone AE. Results of VAE, AnoGAN and ADGAN are borrowed from [9]. Results of DAGMM, DSEBM and GeoTrans are borrowed from [11]. We use the officially released source code of GANomaly to fill the incomplete results reported in [1] with our experimental settings. For RGB datasets, such as CIFAR-10 and CIFAR-100, we use graying and random rotation operations tandemly, together with some standard data augmentations (flipping/mirroring/shifting), which is widely used in [14, 16]. For grayscale datasets, such as MNIST and Fashion-MNIST, we only use random rotation transformation, without any data augmentation.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 avg
GeoTrans [11] 74.4 67.0 61.9 84.1 63.0 41.7 86.9 82.0 78.3 43.7 35.9 81.3 50.0 97.2 61.1 67.2
GANomaly [1] 89.2 73.2 70.8 84.2 74.3 79.4 79.2 74.5 75.7 69.9 78.5 70.0 74.6 65.3 83.4 76.2
AE 65.4 61.9 82.5 79.9 77.3 73.8 64.6 86.8 63.9 64.1 73.1 63.7 99.9 76.9 97.0 75.4
OURS 94.1 68.1 88.3 86.2 78.6 73.5 84.3 87.6 83.2 70.6 85.5 66.7 100 100 92.3 83.9
Table 2: Average area under the ROC curve (AUROC) in % of anomaly detection methods on MVTec AD [4] dataset. The best performing method in each experiment is in bold.

On all involved datasets, experiment results present that the average AUROCs of ITAE outperform all other methods to different extents. For each individual image class, we also obtain competitive performances, showing effectiveness for anomaly detection. To further validate the effectiveness of our method, we conduct experiments on a subset of the ILSVRC 2012 classification dataset [32]. Table 1 also shows the performance of GANomaly, GeoTrans, baseline AE and our method on ImageNet. As can be seen, our method significantly outperforms the other three methods. Our method maintains performance stability on more difficult datasets.

In addition, GeoTrans [11] takes up more GPU memory and computation time. For testing on CIFAR10 (totally 10,000 images), GeoTrans needs 285.45s (35fps, NVIDIA GTX 1080Ti, average on 10 runs) and it takes 1389MB of GPU memory. ITAE takes only 36.97s (270fps, same experimental environment) and 713MB of GPU memory ( faster than GeoTrans) thanks to its efficient pipeline and network structure. Figure 3 shows the comparison of frames per second (FPS), GPU memory usage and AUROCs of various anomaly detection methods tested on CIFAR-10. ITAE takes up a relatively small GPU memory, and its FPS is relatively higher.

4.2 Experiments on Real-world Anomaly Detection

Previous works [11, 9] experiment on multi-class classification datasets since there is a lack of comprehensive real-world datasets available for anomaly detection. By defining anomalous events as occurrences of different object classes and splitting the datasets based on unsupervised settings, the multi-class datasets can be used for anomaly detection experiments. However, the real anomalous data does not necessarily meet the above settings, e.g. damaged objects. In this section, we experiment on the most recent real-world anomaly detection benchmark dataset MVTec AD [4].

Methods Temporal Dependency? AUROC
TSC [23] 67.9
StackRNN [23] 68.0
AE-Conv3D [41] 69.7
MemAE [12] 71.2
AE-Conv2D [13] 60.9
OURS 72.5
Table 3: Average area under the ROC curve (AUROC) in % of anomaly detection methods on ShanghaiTech [23] dataset. The best performing method in each experiment is in bold.

MVTec anomaly detection dataset. MVTec Anomaly Detection (MVTec AD) dataset [4] contains 5354 high-resolution color images of different object and texture categories. It contains normal images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, and various structural changes. In this paper, we conduct image-level anomaly detection tasks on MVTec AD dataset to classify normal and anomalous objects.

Comparison with state-of-the-art methods. Table 2 shows that our ITAE performs better than baseline AE, GANomaly and GeoTrans. The advantages of ITAE over GeoTrans are growing from ideal datasets to real-world datasets MVTec AD. We conclude that our ITAE is more adaptable to complex real-world environments.

0 1 2 3 4 5 6 7 8 9 avg
AE 57.1 54.9 59.9 62.3 63.9 57.0 68.1 53.8 64.4 48.6 59.3
AE+S 72.8 41.8 66.4 57.5 71.0 62.8 68.4 48.5 56.8 31.9 57.8
AE+G 67.4 60.9 60.5 67.1 67.0 65.5 70.7 69.3 69.7 61.0 65.6
AE+R 76.1 80.0 83.6 77.1 89.2 83.0 82.6 85.0 90.0 75.9 82.2
AE+G+R 78.5 89.8 86.1 77.4 90.5 84.5 89.2 92.9 92.0 85.5 86.6
Table 4: Average area under the ROC curve (AUROC) in % of anomaly detection methods for different components on CIFAR-10. “S”, “G” and “R” represent scaling, graying and random rotation operations. The best performing method in each experiment is in bold.
0 74.2 74.1 77.8 78.5
1 82.0 80.7 86.8 89.8
2 82.6 81.9 85.2 86.1
3 77.2 77.1 76.0 77.4
Table 5: Average area under the ROC curve (AUROC) in % of anomaly detection methods for different losses on part of CIFAR-10. “” means loss and “” means loss. For example, means using loss as training loss to train autoencoders and using loss to calculate restoration error when testing. The best performing method in each experiment is in bold.

4.3 Experiments on Video Anomaly Detection

Video anomaly detection, which is distinguished from image-level anomaly detection, requires detections of anomalous objects and strenuous motions in the video data. We here experiment on a most recent video anomaly detection benchmark dataset ShanghaiTech [23], comparing our methods with other state-of-the-arts.

ShanghaiTech. ShanghaiTech [23] has scenes with complex light conditions and camera angles. It contains anomalous events and over training frames. In the dataset, objects except for pedestrians (e.g. vehicles) and strenuous motion (e.g. fighting and chasing) are treated as anomalies.

Comparison with state-of-the-art methods. Since our ITAE is designed for image-level anomaly detection, different from some state-of-the-arts [23, 41, 12], we use single frames but not stacking neighbor frames as inputs. In order to apply the random rotation transformation, we resize all the images into . We here use ResNet34 [14] as our encoder. Following [13, 23, 12], we obtain the normality score of the th frame by normalizing the errors to range :


where denotes the restoration error of the th frame in a video episode. The value of closer to indicates the frame is more likely an anomalous frame. Table 3 shows the AUROC values on ShanghaiTech dataset. Results show that our ITAE outperforms all the state-of-the-arts, including some temporal dependent methods [23, 41, 12].

4.4 Ablation Study and Discussion

In this part, we study the contribution of the proposed components of ITAE independently. Table 4 shows experimental results of ablation study on CIFAR-10. It shows that both graying and random rotation operations improve the performance significantly, especially the random rotation operation. Table 5 shows the ablation study about the selection of restoration loss. It proves that using loss as training loss and using

loss to calculate restoration error performs the best. Through the ablation study, we claim that the image transformation, network architecture and the loss function we used all have independent contributions to boost the model performance.

We use image scaling to study the degradation problem of the ITAE caused by ill-selected transformations. Downsampling of images can delete part of the image information. However, the second condition is not met since the deleted pixel-level information can be inferred from neighboring pixels and this rule is the same between normal and anomalous data. We test on CIFAR10 with a 0.5x scaling and obtain 58.8% AUROC for ITAE, while that of AE is 59.3%, showing that the ITAE degenerates into a vanilla AE with ill-selected transformations.

Methods #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 avg SD
GeoTrans  [11] 91.55 72.38 81.26 82.94 87.04 87.95 87.24 81.77 85.51 85.68 84.33 5.22
OURS 99.93 99.94 99.95 99.94 99.95 99.93 99.93 99.94 99.92 99.93 99.94 0.01


Table 6: Average area under the ROC curve (AUROC) in % of anomaly detection methods with digit “1” on MNIST for ten runs. Our stability is much higher than GeoTrans.

4.5 Model Stability

Anomaly detection puts higher concerns on the stability of model performance than traditional classification tasks. It is because of the lack of anomalous data makes it impossible to do validation during training. Model stability tends to be more important since we have no confidence to obtain a best checkpoint for anomaly detection model in a training period without validation.

The stability of model performance is mainly reflected in three aspects: 1) whether the model can stably reach convergence after acceptable training epochs; 2) whether the model can reach stable performance level in multiple independent training attempts under the same training configuration; 3) whether the model can stably achieve good performance in various datasets and training configurations. Figure 4 shows AUC-changing during one run to reveal that our model performs more stably in the late training phase, instead of fluctuating.

Figure 4: Training process under three methods. Both logs are achieved on the MNIST dataset. It shows the case when the digit “7” is the normal class. We attach complete logs for Fashion-MNIST and MNIST datasets in the appendix.

Thus, we have the confidence to obtain a robust model for this practically validation-unavailable task after a training period. In order to test the stability of multiple training performances, we rerun GeoTrans [11] and our method for 10 times on MNIST. Table 6 shows that GeoTrans suffers a larger performance fluctuation compared with our method. For the last one, the standard deviation (SD) among classes has a good measure. SD in Table 1 prove that our method has the strongest stability of this type.

4.6 Visualization Analysis

Anomaly detection on images. In order to demonstrate the effectiveness of transformations for anomaly detection in a simple and straightforward way, we visualize some restoration outputs from ITAE, comparing with GANomaly in Figure 5. All visualization results are based on the number “6” as normal samples.

Figure 5: Visualization analysis comparing with GANomaly on MNIST. “Ori”, “I” and “O” represent original images, inputs and outputs, respectively. Cases with outputs similar to “Ori” are considered normal, otherwise anomalous. All visualization results are based on the number “6” as normal samples.
(a) Frame (b) AE-Conv2D (c) ITAE(G) (d) ITAE(G+R)
Figure 6: Restoration error maps of AE and ITAE on an anomalous frame of ShanghaiTech. Chasing is the anomalous event in this frame (red bounding box). “G” means graying and “R” means random rotation transformation. ITAE can significantly highlight the anomalous parts in the scene.

The first column “Ori” represents original images. “I” means images after transformation. Note that outputs should always be compared with original images but not inputs. Cases with outputs similar to “Ori” are considered normal, otherwise anomalous. The bottom line in Figure 5 shows the testing of number 9. Four outputs are far different from “Ori” and thus recognized as anomalous. Except for the number “6”, the other numbers get the wrong direction or ambiguous restoration outputs from our ITAE. It enlarges the gap of restoration error between normal and anomalous data. However, all the outputs from GANomaly are similar to the ground truth, meaning that it is less capable to distinguish between normal and anomalous data. We conclude that our guide for information embedding by using transformations is successful since the outputs show that models attempt to restore all the images using the orientation distribution of the numbers learning from the training set.

Anomaly detection on videos. Figure 6 shows restoration error maps of AE and ITAE on an anomalous frame of ShanghaiTech. In this frame, chasing is the anomalous event (red bounding box in Figure 6 (a)). AE generalize so “well” that it can reconstruct this frame including the chasing humans. Thus, AE cannot correctly detect this anomalous event. ITAE significantly highlights the anomalous parts in the scene and it is the reason why ITAE outperforms state-of-the-arts on video anomaly detection.

5 Conclusion and Future Work

In this paper, we propose a novel technique named Inverse-Transform AutoEncoder (ITAE) for anomaly detection. Simple transformations are employed to erase certain information. The ITAE learns the inverse transform to restore the original data. The restoration error is expected to be a good indicator of anomalous data. We experiment with two simple transformations: graying and random rotation, and show that our method not only outperforms state-of-the-art methods but also achieves high stability. Notably, there are still more transformations to explore. These transformations, when added to the ITAE, are likely to further improve the performance for anomaly detection. We look forward to the addition of more transformations and the exploration of a more intelligent transformation selection strategy. In addition, this way of feature embedding can also be applied to more fields, opening avenues for future research.


  • [1] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon (2018) GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training. ACCV. Cited by: Appendix D, §1, §4.1.2, Table 1, Table 1, Table 2.
  • [2] S. Akçay, A. Atapour-Abarghouei, and T. P. Breckon (2019) Skip-ganomaly: skip connected and adversarially trained encoder-decoder anomaly detection. arXiv preprint arXiv:1901.08954. Cited by: §2, §4.1.1.
  • [3] J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE. Cited by: §2.
  • [4] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger (2019) MVTec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, Cited by: §A.2, §1, §4.2, §4.2, Table 2.
  • [5] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research. Cited by: §A.1, 5th item.
  • [6] L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, Cited by: §4.1.1.
  • [7] R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. Cited by: §2.
  • [8] V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR). Cited by: §1, §2.
  • [9] L. Deecke, R. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft (2018) Anomaly detection with generative adversarial networks. Cited by: §2, §4.1.2, §4.2, Table 1, Table 1.
  • [10] E. Eskin (2000)

    Anomaly detection over noisy data using learned probability distributions

    In ICML, Cited by: §2.
  • [11] I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. In NeurIPS, Cited by: §2, §4.1.2, §4.1.2, §4.2, §4.5, Table 1, Table 1, Table 2, Table 6.
  • [12] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. Cited by: §1, §2, §4.3, §4.3, Table 3.
  • [13] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In CVPR, Cited by: §4.3, Table 3.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.2, §4.3.
  • [15] D. Hendrycks, M. Mazeika, and T. Dietterich (2019)

    Deep anomaly detection with outlier exposure

    In ICLR, Cited by: §2.
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Cited by: §4.1.2.
  • [17] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §4.1.1.
  • [18] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.1.2, Table 1.
  • [19] B. R. Kiran, D. M. Thomas, and R. Parakkal (2018) An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging. Cited by: §2.
  • [20] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §A.3, Table 9, Figure 7, Appendix C, 3rd item, 4th item.
  • [21] Y. LeCun (1998)

    The mnist database of handwritten digits

    http://yann. lecun. com/exdb/mnist/. Cited by: Figure 8, Appendix D, 1st item.
  • [22] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, Cited by: §2.
  • [23] W. Luo, W. Liu, and S. Gao (2017) A revisit of sparse coding based anomaly detection in stacked rnn framework. In ICCV, Cited by: §1, §4.3, §4.3, §4.3, §4.3, Table 3.
  • [24] M. Markou and S. Singh (2003) Novelty detection: a review—part 1: statistical approaches. Signal Processing. Cited by: §2.
  • [25] M. Markou and S. Singh (2003) Novelty detection: a review—part 2: : neural network based approaches. Signal Processing. Cited by: §2.
  • [26] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, Cited by: Appendix D, §1.
  • [27] P. Perera, R. Nallapati, and B. Xiang (2019) OCGAN: one-class novelty detection using gans with constrained latent representations. Cited by: §1, §1, §2, §4.1.2, Table 1.
  • [28] M. A. F. Pimentel, D. A. Clifton, C. Lei, and L. Tarassenko (2014) A review of novelty detection. Signal Processing. Cited by: §2.
  • [29] M. Rahmani and G. K. Atia (2017)

    Coherence pursuit: fast, simple, and robust principal component analysis

    IEEE Transactions on Signal Processing. Cited by: §2.
  • [30] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §4.1.1.
  • [31] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In ICML, Cited by: §1.
  • [32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §A.1, §1, 5th item, §4.1.2, §4.
  • [33] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli (2018) Adversarially learned one-class classifier for novelty detection. In CVPR, Cited by: §1, §2.
  • [34] M. Sakurada and T. Yairi (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Mlsda Workshop on Machine Learning for Sensory Data Analysis, Cited by: §1.
  • [35] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, Cited by: §1, §2, §4.1.2, Table 1.
  • [36] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun (2015) Learning discriminative reconstructions for unsupervised outlier removal. In ICCV, Cited by: §2.
  • [37] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §A.3, Table 9, Figure 9, Appendix D, 2nd item.
  • [38] H. Xu, C. Caramanis, and S. Sanghavi (2012) Robust pca via outlier pursuit. IEEE Transactions on Information Theory. Cited by: §2.
  • [39] K. Yamanishi, J. I. Takeuchi, G. Williams, and P. Milne (2000)

    On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms

    Data Mining & Knowledge Discovery. Cited by: §2.
  • [40] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang (2016)

    Deep structured energy based models for anomaly detection

    Cited by: §4.1.2, Table 1, Table 1.
  • [41] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X. Hua (2017) Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, Cited by: §4.3, §4.3, Table 3.
  • [42] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    In ICLR, Cited by: §1, §2, §4.1.2, Table 1, Table 1.

Appendix A Class Names and Index for Datasets

a.1 ImageNet [32]

Even though there is actually a class-tree on the ImageNet website, to subjectively pick up classes from it is still not convincing to some extent and the fact that not all image labels are on the same tree-level brings additional troubles. We tried to be objective to use LDA [5], a popular language tool, during clustering instead of subjectively cherry-picking. During dividing samples into super-clusters, we tried to make the process as objectively as possible. Thus, we used LDA [5], a popular language processing tool, to do the categories clustering automatically. Since the limitation of computing resources we randomly select 10 categories in our experiment. Table 7 shows the specific category index.

Label Index
0 Snake n01728920, n01728572, n01729322,
n01734418, n01737021, n01740131,
1 Finch n01530575, n01531178, n01532829, n01534433, n01795545, n01796340
2 Spider n01773157, n01773549, n01774384, n01775062, n01773797, n01774750
3 Big cat n02128385, n02128925, n02129604, n02130308, n02128757, n02129165
4 Beetle n02165105, n02165456, n02169497, n02177972, n02167151
5 Wading bird n02007558, n02012849, n02013706, n02018795, n02006656
6 Monkey n02486261, n02486410, n02488291, n02489166
7 Fungus n12985857, n13037406, n13054560, n13040303
8 Cat n02123045, n02123394, n02124075, n02123159
9 Dog n02088364, n02105412, n02106030, n02106166, n02106662, n02106550, n02088466, n02093754, n02091635
Table 7: Index of clustering results for ImageNet.

a.2 MVTec AD [4]

MVTec AD dataset contains 5354 high-resolution color images of different object and texture categories. It contains normal images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contaminations, and various structural changes. Table 8 shows class names and anomalous types for each categories.

Class Name Anomalous Types
0 Bottle broken large, broken small, contamination
1 Capsule crack, faulty imprint, poke, scratch, squeeze
2 Grid bent, broken, glue, metal contamination, thread
3 Leather color, cut, fold, glue, poke
4 Pill color, combined, contamination, crack, faulty imprint, pill type, scratch
5 Tile crack, glue strip, gray stroke, oil, rough
6 Transistor bent, cut, damaged, misplaced
7 Zipper broken teeth, combined, fabric border, fabric interior, rough, split teeth, squeezed teeth
8 Cable bent wire, cable swap, combined, cut inner insulation, cut outer insulation, missing cable, missing wire, poke insulation
9 Carpet color, cut, hole, metal contamination, thread
10 Hazelnut crack, cut, hole, print
11 Metal nut bent, color, flip, scratch
12 Screw manipulated front, scratch head, scratch neck, thread side, thread top
13 Toothbrush defective
14 Wood color, combined, hole, liquid, scratch
Table 8: Class names and anomalous types of MVTec AD.

a.3 Other Datasets

Table 9 describes the content of all single classes on Fashion-MNIST [37], CIFAR-10 [20] and CIFAR-100 [20].

Dataset Class Name
0 Ankle-boot
1 Bag
2 Coat
3 Dress
Fashion- 4 Pullover
MNIST 5 Sandal
6 Shirt
7 Sneaker
8 T-shirt
9 Trouser
0 Airplane
1 Car
2 Bird
3 Cat
CIFAR- 4 Deer
10 5 Dog
6 Frog
7 Horse
8 Ship
9 Truck
0 Aquatic mammals
1 Fish
2 Flowers
3 Food containers
4 Fruit and vegetables
5 Household electrical devices
6 Household furniture
7 Insects
8 Large carnivores
CIFAR- 9 Large man-made outdoor things
100 10 Large natural outdoor scenes
11 Large omnivores and herbivores
12 Medium-sized mammals
13 Non-insect invertebrates
14 People
15 Reptiles
16 Small mammals
17 Trees
18 Vehicles 1
19 Vehicles 2
Table 9: Class names of Fashion-MNIST [37], CIFAR-10 [20] and CIFAR-100 [20].

Appendix B Model Structure of ITAE

Table 10 shows the model structure of ITAE. It bases on an encoder-decoder framework. It totally has 4 blocks for the encoder and 4 blocks for the decoder. Each block has a maxpooling or an upsampling operation, following two convolutional layers. Skip-connection operations are added to facilitate the backpropagation of the gradient and improve the performance of image reconstruction.

Figure 7: Visualization analysis on CIFAR-10 [20]. “Ori”, “I” and “O” represent original images, inputs and outputs, respectively. Cases with outputs similar to “Ori” are considered normal, otherwise anomalous. All visualization results are based on the class “horse” as the normal class.
Layer Input Output
Table 10: Structure of ITAE.

Appendix C Visualization Analysis on CIFAR-10

In order to further demonstrate the effectiveness of transformations for anomaly detection, we visualize some restoration outputs of CIFAR-10 [20] from ITAE in Figure 7. All visualization results are based on the class “horse” as the normal class. The first column “Ori” represents original images. “I” means images after transformation. Note that outputs should always be compared with original images but not inputs. Cases with outputs similar to “Ori”, i.e. lower loss, are considered normal, otherwise anomalous. ITAE enlarges the gap of restoration error between normal and anomalous data.

Appendix D Model Stability

We argue that our proposed method achieves more robust performance. The main challenge in the task of anomaly detection is the lack of negative samples. Without validation, model stability tends to be more important than traditional data classification tasks. We train three models, including ITAE, traditional autoencoder [26] and GANomaly [1], respectively on each category of MNIST [21] and Fashion-MNIST [37] datasets and test models every 5 epochs along with training. The traditional autoencoder [26] and GANomaly [1] are set as our baseline model. The model performance of validation during the training process is shown in Figure 8 and Figure 9, from which we can see the performance of our ITAE method always converge in a high position; moreover, ITAE shows the highest performance stability at the end of the training process.

Figure 8: Reported Accuracy under L1 metric on the test dataset of MNIST[21]. Ten sub-images represent the cases where the digit “0”-“9” is set as the normal category by order.
Figure 9: Reported Accuracy under L1 metric on the test dataset of Fashion-MNIST[37]. Ten sub-images represent the cases where the class 0 - 9 is set as the normal category by order.